Knowledge Matters: Importance of Prior Information for Optimization
Abstract
We explore the effect of introducing prior information into the intermediate level of deep supervised neural networks for a learning task on which all the blackbox stateoftheart machine learning algorithms tested have failed to learn. We motivate our work from the hypothesis that there is an optimization obstacle involved in the nature of such tasks, and that humans learn useful intermediate concepts from other individuals via a form of supervision or guidance using a curriculum. The experiments we have conducted provide positive evidence in favor of this hypothesis. In our experiments, a twotiered MLP architecture is trained on a dataset for which each image input contains three sprites, and the binary target class is 1 if all three have the same shape. Blackbox machine learning algorithms only got chance on this task. Standard deep supervised neural networks also failed. However, using a particular structure and guiding the learner by providing intermediate targets in the form of intermediate concepts (the presence of each object) allows to nail the task. Much better than chance but imperfect results are also obtained by exploring architecture and optimization variants, pointing towards a difficult optimization task. We hypothesize that the learning difficulty is due to the composition of two highly nonlinear tasks. Our findings are also consistent with hypotheses on cultural learning inspired by the observations of effective local minima (possibly due to illconditioning and the training procedure not being able to escape what appears like a local minimum).
Département d’informatique et de recherche opérationnelle
Université de Montréal, Montréal, QC, Canada Yoshua Bengio bengioy@iro.umontreal.ca
Département d’informatique et de recherche opérationnelle
Université de Montréal, Montréal, QC, Canada
Editor: Not Assigned
Keywords: Deep Learning, Neural Networks, Optimization, Evolution of Culture, Curriculum Learning, Training with Hints
1 Introduction
There is a recent emerging interest in different fields of science for cultural learning (Henrich and McElreath, 2003) and how groups of individuals exchanging information can learn in ways superior to individual learning. This is also witnessed by the emergence of new research fields such as ”Social Neuroscience”. Learning from other agents in an environment by the means of cultural transmission of knowledge with a peertopeer communication is an efficient and natural way of acquiring or propagating common knowledge. The most popular belief on how the information is transmitted between individuals is that bits of information are transmitted by small units, called memes, which share some characteristics of genes, such as selfreplication, mutation and response to selective pressures (Dawkins, 1976).
This paper is based on the hypothesis (which is further elaborated in Bengio (2013a)) that human culture and the evolution of ideas have been crucial to counter an optimization issue: this difficulty would otherwise make it difficult for human brains to capture high level knowledge of the world without the help of other educated humans. In this paper machine learning experiments are used to investigate some elements of this hypothesis by seeking answers for the following questions: are there machine learning tasks which are intrinsically hard for a lone learning agent but that may become very easy when intermediate concepts are provided by another agent as additional intermediate learning cues, in the spirit of Curriculum Learning (Bengio et al., 2009b)? What makes such learning tasks more difficult? Can specific initial values of the neural network parameters yield success when random initialization yield complete failure? Is it possible to verify that the problem being faced is an optimization problem or with a regularization problem? These are the questions discussed (if not completely addressed) here, which relate to the following broader question: how can humans (and potentially one day, machines) learn complex concepts?
In this paper, results of different machine learning algorithms on an artificial learning task involving binary 6464 images are presented. In that task, each image in the dataset contains 3 Pentomino tetris sprites (simple shapes). The task is to figure out if all the sprites in the image are the same or if there are different sprite shapes in the image. Several stateoftheart machine learning algorithms have been tested and none of them could perform better than a random predictor on the test set. Nevertheless by providing hints about the intermediate concepts (the presence and location of particular sprite classes), the problem can easily be solved where the samearchitecture neural network without the intermediate concepts guidance fails. Surprisingly, our attempts at solving this problem with unsupervised pretraining algorithms failed solve this problem. However, with specific variations in the network architecture or training procedure, it is found that one can make a big dent in the problem. For showing the impact of intermediate level guidance, we experimented with a twotiered neural network, with supervised pretraining of the first part to recognize the category of sprites independently of their orientation and scale, at different locations, while the second part learns from the output of the first part and predicts the binary task of interest.
The objective of this paper is not to propose a novel learning algorithm or architecture, but rather to refine our understanding of the learning difficulties involved with composed tasks (here a logical formula composed with the detection of object classes), in particular the training difficulties involved for deep neural networks. The results also bring empirical evidence in favor of some of the hypotheses from Bengio (2013a), discussed below, as well as introducing a particular form of curriculum learning (Bengio et al., 2009b).
Building difficult AI problems has a long history in computer science. Specifically hard AI problems have been studied to create CAPTCHA’s that are easy to solve for humans, but hard to solve for machines (Von Ahn et al., 2003). In this paper we are investigating a difficult problem for the offtheshelf blackbox machine learning algorithms.^{1}^{1}1You can access the source code of some experiments presented in that paper and their hyperparameters from here: https://github.com/caglar/kmatters
1.1 Curriculum Learning and Cultural Evolution Against Effective Local Minima
What Bengio (2013a) calls an effective local minimum is a point where iterative training stalls, either because of an actual local minimum or because the optimization algorithm is unable (in reasonable time) to find a descent path (e.g., because of serious illconditioning). In this paper, it is hypothesized that some more abstract learning tasks such as those obtained by composing simpler tasks are more likely to yield effective local minima for neural networks, and are generally hard for generalpurpose machine learning algorithms.
The idea that learning can be enhanced by guiding the learner through intermediate easier tasks is old, starting with animal training by shaping (Skinner, 1958; Peterson, 2004; Krueger and Dayan, 2009). Bengio et al. (2009b) introduce a computational hypothesis related to a presumed issue with effective local minima when directly learning the target task: the good solutions correspond to hardtofindbychance effective local minima, and intermediate tasks prepare the learner’s internal configuration (parameters) in a way similar to continuation methods in global optimization (which go through a sequence of intermediate optimization problems, starting with a convex one where local minima are no issue, and gradually morphing into the target task of interest).
In a related vein, Bengio (2013a) makes the following inferences based on experimental observations of deep learning and neural network learning:

Point 2: It is much easier to train a neural network with supervision (where examples ar provided to it of when a concept is present and when it is not present in a variety of examples) than to expect unsupervised learning to discover the concept (which may also happen but usually leads to poorer renditions of the concept). The poor results obtained with unsupervised pretraining reinforce that hypothesis.

Point 3: Directly training all the layers of a deep network together not only makes it difficult to exploit all the extra modeling power of a deeper architecture but in many cases it actually yields worse results as the number of required layers is increased (Larochelle et al., 2009; Erhan et al., 2010). The experiments performed here also reinforce that hypothesis.

Point 4: Erhan et al. (2010) observed that no two training trajectories ended up in the same effective local minimum, out of hundreds of runs, even when comparing solutions as functions from input to output, rather than in parameter space (thus eliminating from the picture the presence of symmetries and multiple local minima due to relabeling and other reparametrizations). This suggests that the number of different effective local minima (even when considering them only in function space) must be huge.

Point 5: Unsupervised pretraining, which changes the initial conditions of the descent procedure, sometimes allows to reach substantially better effective local minima (in terms of generalization error!), and these better local minima do not appear to be reachable by chance alone (Erhan et al., 2010). The experiments performed here provide another piece of evidence in favor of the hypothesis that where random initialization can yield rather poor results, specifically targeted initialization can have a drastic impact, i.e., that effective local minima are not just numerous but that some small subset of them are much better and hard to reach by chance.^{2}^{2}2Recent work showed that rather deep feedforward networks can be very successfully trained when large quantities of labeled data are available (Ciresan et al., 2010; Glorot et al., 2011a; Krizhevsky et al., 2012). Nonetheless, the experiments reported here suggest that it all depends on the task being considered, since even with very large quantities of labeled examples, the deep networks trained here were unsuccessful.
Based on the above points, Bengio (2013a) then proposed the following hypotheses regarding learning of highlevel abstractions.

Optimization Hypothesis: When it learns, a biological agent performs an approximate optimization with respect to some implicit objective function.

Deep Abstractions Hypothesis: Higher level abstractions represented in brains require deeper computations (involving the composition of more nonlinearities).

Local Descent Hypothesis: The brain of a biological agent relies on approximate local descent and gradually improves itself while learning.

Effective Local Minima Hypothesis: The learning process of a single human learner (not helped by others) is limited by effective local minima.

Deeper Harder Hypothesis: Effective local minima are more likely to hamper learning as the required depth of the architecture increases.

Abstractions Harder Hypothesis: Highlevel abstractions are unlikely to be discovered by a single human learner by chance, because these abstractions are represented by a deep subnetwork of the brain, which learns by local descent.

Guided Learning Hypothesis: A human brain can learn high level abstractions if guided by the signals produced by other agents that act as hints or indirect supervision for these highlevel abstractions.

Memes DivideandConquer Hypothesis: Linguistic exchange, individual learning and the recombination of memes constitute an efficient evolutionary recombination operator in the memespace. This helps human learners to collectively build better internal representations of their environment, including fairly highlevel abstractions.
This paper is focused on “Point 1” and testing the “Guided Learning Hypothesis”, using machine learning algorithms to provide experimental evidence. The experiments performed also provide evidence in favor of the “Deeper Harder Hypothesis” and associated “Abstractions Harder Hypothesis”. Machine Learning is still far beyond the current capabilities of humans, and it is important to tackle the remaining obstacles to approach AI. For this purpose, the question to be answered is why tasks that humans learn effortlessly from very few examples, while machine learning algorithms fail miserably?
2 Culture and Optimization Difficulty
As hypothesized in the “Local Descent Hypothesis”, human brains would rely on a local approximate descent, just like a MultiLayer Perceptron trained by a gradientbased iterative optimization. The main argument in favor of this hypothesis relies on the biologicallygrounded assumption that although firing patterns in the brain change rapidly, synaptic strengths underlying these neural activities change only gradually, making sure that behaviors are generally consistent across time. If a learning algorithm is based on a form of local (e.g. gradientbased) descent, it can be sensitive to effective local minima (Bengio, 2013a).
When one trains a neural network, at some point in the training phase the evaluation of error seems to saturate, even if new examples are introduced. In particular Erhan et al. (2010) find that early examples have a much larger weight in the final solution. It looks like the learner is stuck in or near a local minimum. But since it is difficult to verify if this is near a true local minimum or simply an effect of strong illconditioning, we call such a “stuck” configuration an effective local minimum, whose definition depends not just on the optimization objective but also on the limitations of the optimization algorithm.
Erhan et al. (2010) highlighted both the issue of effective local minima and a regularization effect when initializing a deep network with unsupervised pretraining. Interestingly, as the network gets deeper the difficulty due to effective local minima seems to be get more pronounced. That might be because of the number of effective local minima increases (more like an actual local minima issue), or maybe because the good ones are harder to reach (more like an illconditioning issue) and more work will be needed to clarify this question.
As a result of Point 4 we hypothesize that it is very difficult for an individual’s brain to discover some higher level abstractions by chance only. As mentioned in the “Guided Learning Hypothesis” humans get hints from other humans and learn highlevel concepts with the guidance of other humans^{3}^{3}3But some highlevel concepts may also be hardwired in the brain, as assumed in the universal grammar hypothesis (Montague, 1970), or in nature vs nurture discussions in cognitive science.. Curriculum learning (Bengio et al., 2009a) and incremental learning (Solomonoff, 1989), are examples of this. This is done by properly choosing the sequence of examples seen by the learner, where simpler examples are introduced first and more complex examples shown when the learner is ready for them. One of the hypothesis on why curriculum works states that curriculum learning acts as a continuation method that allows one to discover a good minimum, by first finding a good minimum of a smoother error function. Recent experiments on human subjects also indicates that humans teach by using a curriculum strategy (Khan et al., 2011).
Some parts of the human brain are known to have a hierarchical organization (i.e. visual cortex) consistent with the deep architecture studied in machine learning papers. As we go from the sensory level to higher levels of the visual cortex, we find higher level areas corresponding to more abstract concepts. This is consistent with the Deep Abstractions Hypothesis.
Training neural networks and machine learning algorithms by decomposing the learning task into subtasks and exploiting prior information about the task is wellestablished and in fact constitutes the main approach to solving industrial problems with machine learning. The contribution of this paper is rather on rendering explicit the effective local minima issue and providing evidence on the type of problems for which this difficulty arises. This prior information and hints given to the learner can be viewed as inductive bias for a particular task, an important ingredient to obtain a good generalization error (Mitchell, 1980). An interesting earlier finding in that line of research was done with Explanation Based Neural Networks (EBNN) in which a neural network transfers knowledge across multiple learning tasks. An EBNN uses previously learned domain knowledge as an initialization or search bias (i.e. to constrain the learner in the parameter space) (O’Sullivan, 1996; Mitchell and Thrun, 1993).
Another related work in machine learning is mainly focused on reinforcement learning algorithms, based on incorporating prior knowledge in terms of logical rules to the learning algorithm as a prior knowledge to speed up and bias learning (Kunapuli et al., 2010; Towell and Shavlik, 1994).
As discussed in “Memes Divide and Conquer Hypothesis“ societies can be viewed as a distributed computational processing systems. In civilized societies knowledge is distributed across different individuals, this yields a space efficiency. Moreover computation, i.e. each individual can specialize on a particular task/topic, is also divided across the individuals in the society and hence this will yield a computational efficiency. Considering the limitations of the human brain, the whole processing can not be done just by a single agent in an efficient manner. A recent study in paleoantropology states that there is a substantial decline in endocranial volume of the brain in the last 30000 years Henneberg (1988). The volume of the brain shrunk to 1241 ml from 1502 ml (Henneberg and Steyn, 1993). One of the hypothesis on the reduction of the volume of skull claims that, decline in the volume of the brain might be related to the functional changes in brain that arose as a result of cultural development and emergence of societies given that this time period overlaps with the transition from huntergatherer lifestyle to agricultural societies.
3 Experimental Setup
Some tasks, which seem reasonably easy for humans to learn^{4}^{4}4keeping in mind that humans can exploit prior knowledge, either from previous learning or innate knowledge., are nonetheless appearing almost impossible to learn for current generic stateofart machine learning algorithms.
Here we study more closely such a task, which becomes learnable if one provides hints to the learner about appropriate intermediate concepts. Interestingly, the task we used in our experiments is not only hard for deep neural networks but also for nonparametric machine learning algorithms such as SVM’s, boosting and decision trees.
The result of the experiments for varying size of dataset with several offtheshelf black box machine learning algorithms and some popular deep learning algorithms are provided in Table 1. The detailed explanations about the algorithms and the hyperparameters used for those algorithms are given in the Appendix Section A.2. We also provide some explanations about the methodologies conducted for the experiments at Section 3.2.
3.1 Pentomino Dataset
In order to test our hypothesis, an artificial dataset for object recognition using 6464 binary images is designed^{5}^{5}5The source code for the script that generates the artificial Pentomino datasets (ArcadeUniverse) is available at: https://github.com/caglar/ArcadeUniverse. This implementation is based on Olivier Breuleux’s bugland dataset generator.. If the task is two tiered (i.e., with guidance provided), the task in the first part is to recognize and locate each Pentomino object class^{6}^{6}6 A human learner does not seem to need to be taught the shape categories of each Pentomino sprite in order to solve the task. On the other hand, humans have lots of previously learned knowledge about the notion of shape and how central it is in defining categories. in the image. The second part/final binary classification task is to figure out if all the Pentominos in the image are of the same shape class or not. If a neural network learned to detect the categories of each object at each location in an image, the remaining task becomes an XORlike operation between the detected object categories. The types of Pentomino objects that is used for generating the dataset are as follows:
Pentomino sprites N, P, F, Y, J, and Q, along with the Pentomino N2 sprite (mirror of “Pentomino N” sprite), the Pentomino F2 sprite (mirror of “Pentomino F” sprite), and the Pentomino Y2 sprite (mirror of “Pentomino Y” sprite).
As shown in Figures 1(a) and 1(b), the synthesized images are fairly simple and do not have any texture. Foreground pixels are “1” and background pixels are “0”. Images of the training and test sets are generated iid. For notational convenience, assume that the domain of raw input images is , the set of sprites is , the set of intermediate object categories is for each possible location in the image and the set of final binary task outcomes is . Two different types of rigid body transformation is performed: sprite rotation where and scaling where is the scaling factor. The data generating procedure is summarized below.

Sprite transformations: Before placing the sprites in an empty image, for each image , a value for is randomly sampled which is to have (or not) the same three sprite shapes in the image. Conditioned on the constraint given by , three sprites are randomly selected from without replacement. Using a uniform probability distribution over all possible scales, a scale is chosen and accordingly each sprite image is scaled. Then rotate each sprite is randomly rotated by a multiple of 90 degrees.

Sprite placement: Upon completion of sprite transformations, a 6464 uniform grid is generated which is divided into 88 blocks, each block being of size 88 pixels, and randomly select three different blocks from the 64=88 on the grid and place the transformed objects into different blocks (so they cannot overlap, by construction).
Each sprite is centered in the block in which it is located. Thus there is no object translation inside the blocks. The only translation invariance is due to the location of the block inside the image.
A Pentomino sprite is guaranteed to not overflow the block in which it is located, and there are no collisions or overlaps between sprites, making the task simpler. The largest possible Pentomino sprite can be fit into an 84 mask.
3.2 Learning Algorithms Evaluated
Initially the models are crossvalidated by using 5fold crossvalidation. With 40,000 examples, this gives 32,000 examples for training and 8,000 examples for testing. For neural network algorithms, stochastic gradient descent (SGD) is used for training. The following standard learning algorithms were first evaluated: decision trees, SVMs with Gaussian kernel, ordinary fullyconnected MultiLayer Perceptrons, Random Forests, kNearest Neighbors, Convolutional Neural Networks, and Stacked Denoising AutoEncoders with supervised finetuning. More details of the configurations and hyperparameters for each of them are given in Appendix Section A.2. The only better than chance results were obtained with variations of the Structured MultiLayer Perceptron described below.
3.2.1 Structured MultiLayer Perceptron (SMLP)
The neural network architecture that is used to solve this task is called the SMLP (Structured MultiLayer Perceptron), a deep neural network with two parts as illustrated in Figure 5 and 7:
The lower part, P1NN (Part 1 Neural Network, as it is called in the rest of the paper), has shared weights and local connectivity, with one identical MLP instance of the P1NN for each patch of the image, and typically an 11element output vector per patch (unless otherwise noted). The idea is that these 11 outputs per patch could represent the detection of the sprite shape category (or the absence of sprite in the patch). The upper part, P2NN (Part 2 Neural Network) is a fully connected one hidden layer MLP that takes the concatenation of the outputs of all patchwise P1NNs as input. Note that the first layer of P1NN is similar to a convolutional layer but where the stride equals the kernel size, so that windows do not overlap, i.e., P1NN can be decomposed into separate networks sharing the same parameters but applied on different patches of the input image, so that each network can actually be trained patchwise in the case where a target is provided for the P1NN outputs. The P1NN output for patch which is extracted from the image is computed as follows:
(1) 
where is the input patch/receptive field extracted from location of a single image. is the weight matrix for the first layer of P1NN and is the vector of biases for the first layer of P1NN. is the activation function of the first layer and is the activation function of the second layer. In many of the experiments, best results were obtained with a rectifying nonlinearity (a.k.a. as RELU), which is (Jarrett et al., 2009b; Nair and Hinton, 2010; Glorot et al., 2011a; Krizhevsky et al., 2012). is the second layer’s weights matrix, such that and are the biases of the second layer of the P1NN, with expected to be smaller than .
In this way, is an overcomplete representation of the input patch that can potentially represent all the possible Pentomino shapes for all factors of variations in the patch (rotation, scaling and Pentomino shape type). On the other hand, when trained with hints, is expected to be the lower dimensional representation of a Pentomino shape category invariant to scaling and rotation in the given patch.
In the experiments with SMLP trained with hints (targets at the output of P1NN), the P1NN is expected to perform classification of each 88 nonoverlapping patches of the original 6464 input image without having any prior knowledge of whether that specific patch contains a Pentomino shape or not. P1NN in SMLP without hints just outputs the local activations for each patch, and gradients on are backpropagated from the upper layers. In both cases P1NN produces the input representation for the Part 2 Neural Net (P2NN). Thus the input representation of P2NN is the concatenated output of P1NN across all the 64 patch locations:
where is the number of patches and the . is the concatenated output of the P1NN at each patch.
There is a standardization layer on top of the output of P1NN that centers the activations and performs divisive normalization by dividing by the standard deviation over a minibatch of the activations of that layer. We denote the standardization function . Standardization makes use of the mean and standard deviation computed for each hidden unit such that each hidden unit of will have 0 activation and unit standard deviation on average over the minibatch. is the set of pentomino images in the minibatch, where is a matrix with images. is the vector of activations of the th hidden unit of hidden layer for the th example, with .
(2) 
(3) 
(4) 
where is a very small constant, that is used to prevent numerical underflows in the standard deviation. P1NN is trained on each 88 patches extracted from the image.
is standardized for each training and test sample separately. Different values of were used for SMLPhints and SMLPnohints.
The concatenated output of P1NN is fed as an input to the P2NN. P2NN is a feedforward MLP with a sigmoid output layer using a single RELU hidden layer. The task of P2NN is to perform a nonlinear logical operation on the representation provided at the output of P1NN.
3.2.2 Structured Multi Layer Perceptron Trained with Hints (SMLPhints)
The SMLPhints architecture exploits a hint about the presence and category of Pentomino objects, specifying a semantics for the P1NN outputs. P1NN is trained with the intermediate target , specifying the type of Pentomino sprite shape present (if any) at each of the 64 patches (88 nonoverlapping blocks) of the image. Because a possible answer at a given location can be “none of the object types” i.e., an empty patch, (for patch ) can take one of the 11 possible values, 1 for rejection and the rest is for the Pentomino shape classes, illustrated in Figure 2:
A similar task has been studied by Fleuret et al. (2011) (at SI appendix Problem 17), who compared the performance of humans vs computers.
The SMLPhints architecture takes advantage of dividing the task into two subtasks during training with prior information about intermediatelevel relevant factors. Because the sum of the training losses decomposes into the loss on each patch, the P1NN can be pretrained patchwise. Each patchspecific component of the P1NN is a fully connected MLP with 88 inputs and 11 outputs with a softmax output layer. SMLPhints uses the the standardization given in Equation 3 but with .
The standardization is a crucial step for training the SMLP on the Pentomino dataset, and yields much sparser outputs, as seen on Figures 3 and 4. If the standardization is not used, even SMLPhints could not solve the Pentomino task. In general, the standardization step dampens the small activations and augments larger ones(reducing the noise). Centering the activations of each feature detector in a neural network has been studied in (Raiko et al., 2012) and (Vatanen et al., 2013). They proposed that transforming the outputs of each hidden neuron in a multilayer perceptron network to have zero output and zero slope on average makes first order optimization methods closer to the second order techniques.
By default, the SMLP uses rectifier hidden units as activation function, we found a significant boost by using rectification compared to hyperbolic tangent and sigmoid activation functions. The P1NN has a highly overcomplete architecture with 1024 hidden units per patch, and L1 and L2 weight decay regularization coefficients on the weights (not the biases) are respectively 1e6 and 1e5. The learning rate for the P1NN is 0.75. 1 training epoch was enough for the P1NN to learn the features of Pentomino shapes perfectly on the 40000 training examples. The P2NN has 2048 hidden units. L1 and L2 penalty coefficients for the P2NN are 1e6, and the learning rate is 0.1. These were selected by trial and error based on validation set error. Both P1NN (for each patch) and P2NN are fullyconnected neural networks, even though P1NN globally is a special kind of convolutional neural network.
3.2.3 Deep and Structured Supervised MLP without Hints (SMLPnohints)
SMLPnohints uses the same connectivity pattern (and deep architecture) that is also used in the SMLPhints architecture, but without using the intermediate targets (). It directly predicts the final outcome of the task (), using the same number of hidden units, the same connectivity and the same activation function for the hidden units as SMLPhints. 120 hyperparameter values have been evaluated by randomly selecting the number of hidden units from and randomly sampling 20 learning rates uniformly in the logdomain within the interval of . Two fully connected hidden layers with 1024 hidden units (same as P1NN) per patch is used and 2048 (same as P2NN) for the last hidden layer, with twenty training epochs. For this network the best results are obtained with a learning rate of 0.05.^{7}^{7}7The source code of the structured MLP is available at the github repository: https://github.com/caglar/structured_mlp
We chose to experiment with various SMLPnohint architectures and optimization procedures, trying unsuccessfully to achieve as good results with SMLPnohint as with SMLPhints.
Rectifier NonLinearity
A rectifier nonlinearity is used for the activations of MLP hidden layers. We observed that using piecewise linear nonlinearity activation function such as the rectifier can make the optimization more tractable.
Intermediate Layer
The output of the P1NN is considered as an intermediate layer of the SMLP. For the SMLPhints, only softmax output activations have been tried at the intermediate layer, and that sufficed to learn the task. Since things did not work nearly as well with the SMLPnohints, several different activation functions have been tried: softmax, tanh, sigmoid and linear activation functions.
Standardization Layer
Normalization at the last layer of the convolutional neural networks has been used occasionaly to encourage the competition between the hidden units. (Jarrett et al., 2009a) used a local contrast normalization layer in their architecture which performs subtractive and divisive normalization. A local contrast normalization layer enforces a local competition between adjacent features in the feature map and between features at the same spatial location in different feature maps. Similarly (Krizhevsky et al., 2012) observed that using a local response layer that enjoys the benefit of using local normalization scheme aids generalization.
Standardization has been observed to be crucial for both SMLP trained with or without hints. In both SMLPhints and SMLPnohints experiments, the neural network was not able to generalize or even learn the training set without using standardization in the SMLP intermediate layer, doing just chance performance. More specifically, in the SMLPnohints architecture, standardization is part of the computational graph, hence the gradients are being backpropagated through it. The mean and the standard deviation is computed for each hidden unit separately at the intermediate layer as in Equation 4. But in order to prevent numerical underflows or overflows during the backpropagation we have used (Equation 3).
The benefit of having sparse activations may be specifically important for the illconditioned problems, for the following reasons. When a hidden unit is “off”, its gradient (the derivative of the loss with respect to its output) is usually close to 0 as well, as seen here. That means that all offdiagonal second derivatives involving that hidden unit (e.g. its input weights) are also near 0. This is basically like removing some columns and rows from the Hessian matrix associated with a particular example. It has been observed that the condition number of the Hessian matrix (specifically, its largest eigenvalue) increases as the size of the network increases (Dauphin and Bengio, 2013), making training considerably slower and inefficient (Dauphin and Bengio, 2013). Hence one would expect that as sparsity of the gradients (obtained because of sparsity of the activations) increases, training would become more efficient, as if we were training a smaller subnetwork for each example, with shared weights across examples, as in dropouts (Hinton et al., 2012).
In Figure 9, the activation of each hidden unit in a bar chart is shown: the effect of standardization is significant, making the activations sparser.
In Figure 10, one can see the activation histogram of the SMLPnohints intermediate layer, showing the distribution of activation values, before and after standardization. Again the sparsifying effect of standardization is very apparent.
In Figures 10 and 9, the intermediate level activations of SMLPnohints are shown before and after standardization. These are for the same SMLPnohints architecture whose results are presented on Table 1. For that same SMLP, the Adadelta (Zeiler, 2012) adaptive learning rate scheme has been used, with 512 hidden units for the hidden layer of P1NN and rectifier activation function. For the output of the P1NN, 11 sigmoidal units have been used while P2NN had 1200 hidden units with rectifier activation function. The output nonlinearity of the P2NN is a sigmoid and the training objective is the binary crossentropy.
Adaptive Learning Rates
We have experimented with several different adaptive learning rate algorithms. We tried rmsprop ^{8}^{8}8This is learning rate scaling method that is discussed by G. Hinton in his Video Lecture 6.5  rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural Networks for Machine Learning, 2012., Adadelta (Zeiler, 2012), Adagrad (Duchi et al., 2010) and a linearly (1/t) decaying learning rate (Bengio, 2013b). For the SMLPnohints with sigmoid activation function we have found Adadelta(Zeiler, 2012) converging faster to an effective local minima and usually yielding better generalization error compared to the others.
3.2.4 Deep and Structured MLP with Unsupervised PreTraining
Several experiments have been conducted using an architecture similar to the SMLPnohints, but by using unsupervised pretraining of P1NN, with Denoising AutoEncoder (DAE) and/or Contractive AutoEncoders (CAE). Supervised finetuning proceeds as in the deep and structured MLP without hints. Because an unsupervised learner may not focus the representation just on the shapes, a larger number of intermediatelevel units at the output of P1NN has been explored: previous work on unsupervised pretraining generally found that larger hidden layers were optimal when using unsupervised pretraining, because not all unsupervised features will be relevant to the task at hand. Instead of limiting to 11 units per patch, we experimented with networks with up to 20 hidden (i.e., code) units per patch in the secondlayer patchwise autoencoder.
In Appendix A.1 we also provided the result of some experiments with binarybinary RBMs trained on 88 patches from the 40k training dataset.
In unsupervised pretraining experiments in this paper, both contractive autoencoder(CAE) with sigmoid nonlinearity and binary cross entropy cost function and denoising autoencoder(DAE) have been used. In the second layer, experiments were performed with a DAE with rectifier hidden units utilizing L1 sparsity and weight decay on the weights of the autoencoder. Greedy layerwise unsupervised training procedure is used to train the deep autoencoder architecture (Bengio et al., 2007). In unsupervised pretraining experiments, tied weights have been used. Different combinations of CAE and DAE for unsupervised pretraining have been tested, but none of the configurations tested managed to learn the Pentomino task, as shown in Table 1.
Algorithm  20k dataset  40k dataset  80k dataset  
Training  Test  Training  Test  Training  Test  
Ã  Error  Error  Error  Error  Error  Error 
SVM RBF  26.2  50.2  28.2  50.2  30.2  49.6 
K Nearest Neighbors  24.7  50.0  25.3  49.5  25.6  49.0 
Decision Tree  5.8  48.6  6.3  49.4  6.9  49.9 
Randomized Trees  3.2  49.8  3.4  50.5  3.5  49.1 
MLP  26.5  49.3  33.2  49.9  27.2  50.1 
Convnet/Lenet5  50.6  49.8  49.4  49.8  50.2  49.8 
Maxout Convnet  14.5  49.5  0.0  50.1  0.0  44.6 
2 layer sDA  49.4  50.3  50.2  50.3  49.7  50.3 
Struct. Supervised MLP w/o hints  0.0  48.6  0.0  36.0  0.0  12.4 
Struct. MLP+CAE Supervised Finetuning  50.5  49.7  49.8  49.7  50.3  49.7 
Struct. MLP+CAE+DAE, Supervised Finetuning  49.1  49.7  49.4  49.7  50.1  49.7 
Struct. MLP+DAE+DAE, Supervised Finetuning  49.5  50.3  49.7  49.8  50.3  49.7 
Struct. MLP with Hints 
0.21  30.7  0  3.1  0  0.01 
3.3 Experiments with 1 of K representation
To explore the effect of changing the complexity of the input representation on the difficulty of the task, a set of experiments have been designed with symbolic representations of the information in each patch. In all cases an empty patch is represented with a 0 vector. These representation can be seen as an alternative input for a P2NNlike network, i.e., they were fed as input to an MLP or another blackbox classifier.
The following four experiments have been conducted, each one using one using a different input representation for each patch:
 Experiment 1Onehot representation without transformations:

In this experiment several trials have been done with a 10input onehot vector per patch. Each input corresponds to an object category given in clear, i.e., the ideal input for P2NN if a supervised P1NN perfectly did its job.
 Experiment 2Disentangled representations:

In this experiment, we did trials with 16 binary inputs per patch, 10 onehot bits for representing each object category, 4 for rotations and 2 for scaling, i.e., the whole information about the input is given, but it is perfectly disentangled. This would be the ideal input for P2NN if an unsupervised P1NN perfectly did its job.
 Experiment 3Onehot representation with transformations:

For each of the ten object types there are 8 = 42 possible transformations. Two objects in two different patches are the considered “the same” (for the final task) if their category is the same regardless of the transformations. The onehot representation of a patch corresponds to the crossproduct between the 10 object shape classes and the 42 transformations, i.e., one out of 80=10 possibilities represented in an 80bit onehot vector. This also contains all the information about the input image patch, but spread out in a kind of nonparametric and noninformative (not disentangled) way, like a perfect memorybased unsupervised learner (like clustering) could produce. Nevertheless, the shape class would be easier to read out from this representation than from the image representation (it would be an OR over 8 of the bits).
 Experiment 4Onehot representation with 80 choices:

This representation has the same 1 of 80 onehot representation per patch but the target task is defined differently. Two objects in two different patches are considered the same iff they have exactly the same 80bit onehot representation (i.e., are of the same object category with the same transformation applied).
The first experiment is a sanity check. It was conducted with single hiddenlayered MLP’s with rectifier and tanh nonlinearity, and the task was learned perfectly (0 error on both training and test dataset) with very few training epochs.
The results of Experiment 2 are given in Table 2. To improve results, we experimented with the Maxout nonlinearity in a feedforward MLP (Goodfellow et al., 2013) with two hidden layers. Unlike the typical Maxout network mentioned in the original paper, regularizers have been deliberately avoided in order to focus on the optimization issue, i.e: no weight decay, norm constraint on the weights, or dropout. Although learning from a disentangled representation is more difficult than learning from perfect object detectors, it is feasible with some architectures such as the Maxout network. Note that this representation is the kind of representation that one could hope an unsupervised learning algorithm could discover, at best, as argued in Bengio et al. (2012).
The only results obtained on the validation set for Experiment 3 and Experiment 4 are shown respectively in Table 3 and Table 4. In these experiments a tanh MLP with two hidden layers have been tested with the same hyperparameters. In experiment 3 the complexity of the problem comes from the transformations (8=42) and the number of object types.
But in experiment 4, the only source of complexity of the task comes from the number of different object types. These results are in between the complete failure and complete success observed with other experiments, suggesting that the task could become solvable with better training or more training examples. Figure 11 illustrates the progress of training a tanh MLP, on both the training and test error, for Experiments 3 and 4. Clearly, something has been learned, but the task is not nailed yet. On experiment 3 for both maxout and tanh the maxout there was a long plateau where the training error and objective stays almost same. Maxout did just chance on the experiment for about 120 iterations on the training and the test set. But after 120th iteration the training and test error started decline and eventually it was able to solve the task. Moreover as seen from the curves in Figure 11\subreffig:80inputs_test_train and 11\subreffig:80inputs_trans_test_train, the training and test error curves are almost the same for both tasks. This implies that for onehot inputs, whether you increase the number of possible transformations for each object or the number of object categories, as soon as the number of possible configurations is same, the complexity of the problem is almost the same for the MLP.
3.4 Does the Effect Persist with Larger Training Set Sizes?
The results shown in this section indicate that the problem in the Pentomino task clearly is not just a regularization problem, but rather basically hinges on an optimization problem. Otherwise, we would expect test error to decrease as the number of training examples increases. This is shown first by studying the online case and then by studying the ordinary training case with a fixed size training set but considering increasing training set sizes. In the online minibatch setting, parameter updates are performed as follows:
(5) 
(6) 
where is the loss incurred on example with parameters , where and is the learning rate.
Ordinary batch algorithms converge linearly to the optimum , however the noisy gradient estimates in the online SGD will cause parameter to fluctuate near the local optima. However, online SGD directly optimizes the expected risk, because the examples are drawn iid from the groundtruth distribution (Bottou, 2010). Thus:
(7) 
where is the generalization error. Therefore online SGD is trying to minimize the expected risk with noisy updates. Those noisy updates have the effect of regularizer:
(8) 
where is the true gradient and is the zeromean stochastic gradient “noise” due to computing the gradient over a finitesize minibatch sample.
We would like to know if the problem with the Pentomino dataset is more a regularization or an optimization problem. An SMLPnohints model was trained by online SGD with the randomly generated online Pentomino stream. The learning rate was adaptive, with the Adadelta procedure (Zeiler, 2012) on minibatches of 100 examples. In the online SGD experiments, two SMLPnohints that is trained with and without standardization at the intermediate layer with exactly the same hyperparameters are tested. The SMLPnohints P1NN patchwise submodel has 2048 hidden units and the SMLP intermediate layer has hidden units. The nonlinearity that is used for the intermediate layer is the sigmoid. P2NN has 2048 hidden units.
SMLPnohints has been trained either with or without standardization on top of the output units of the P1NN. The experiments illustrated in Figures 12 and 13 are with the same SMLP without hints architecture for which results are given in Table 1. In those graphs only the results for the training on the randomly generated 545400 Pentomino samples have been presented. As shown in the plots SMLPnohints was not able to generalize without standardization. Although without standardization the training loss seems to decrease initially, it eventually gets stuck in a plateau where training loss doesn’t change much.
Training of SMLPnohints online minibatch SGD is performed using standardization in the intermediate layer and Adadelta learning rate adaptation, on 1046000 training examples from the randomly generated Pentomino stream. At the end of the training, test error is down to 27.5%, which is much better than chance but from from the score obtained with SMLPhints of near 0 error.
In another SMLPnohints experiment without standardization the model is trained with the 1580000 Pentomino examples using online minibatch SGD. P1NN has 2048 hidden units and 16 sigmoidal outputs per patch. for the P1NN hidden layer. P2NN has 1024 hidden units for the hidden layer. Adadelta is used to adapt the learning rate. At the end of training this SMLP, the test error remained stuck, at 50.1%.
3.4.1 Experiments with Increased Training Set Size
Here we consider the effect of training different learners with different numbers of training examples. For the experimental results shown in Table 1, 3 training set sizes (20k, 40k and 80k examples) had been used. Each dataset was generated with different random seeds (so they do not overlap). Figure 14 also shows the error bars for an ordinary MLP with three hidden layers, for a larger range of training set sizes, between 40k and 320k examples. The number of training epochs is 8 (more did not help), and there are three hidden layers with 2048 feature detectors. The learning rate we used in our experiments is 0.01. The activation function of the MLP is a tanh nonlinearity, while the L1, L2 penalty coefficients are both 1e6.
Table 1 shows that, without guiding hints, none of the stateofart learning algorithms could perform noticeably better than a random predictor on the test set. This shows the importance of intermediate hints introduced in the SMLP. The decision trees and SVMs can overfit the training set but they could not generalize on the test set. Note that the numbers reported in the table are for hyperparameters selected based on validation set error, hence lower training errors are possible if avoiding all regularization and taking large enough models. On the training set, the MLP with two large hidden layers (several thousands) could reach nearly 0% training error, but still did not manage to achieve good test error.
In the experiment results shown in Figure 14, we evaluate the impact of adding more training data for the fullyconnected MLP. As mentioned before for these experiments we have used a MLP with three hidden layers where each layer has 2048 hidden units. The tanh activation function is used with 0.05 learning rate and minibatches of size 200.
As can be seen from the figure, adding more training examples did not help either training or test error (both are near 50%, with training error slightly lower and test error slightly higher), reinforcing the hypothesis that the difficult encountered is one of optimization, not of regularization.
3.5 Experiments on Effect of Initializing with Hints
Initialization of the parameters in a neural network can have a big impact on the learning and generalization (Glorot and Bengio, 2010). Previously Erhan et al. (2010) showed that initializing the parameters of a neural network with unsupervised pretraining guides the learning towards basins of attraction of local minima that provides better generalization from the training dataset. In this section we analyze the effect of initializing the SMLP with hints and then continuing without hints at the rest of the training. For experimental analysis of hints based initialization, SMLP is trained for 1 training epoch using the hints and for 60 epochs it is trained without hints on the 40k examples training set. We also compared the same architecture with the same hyperparameters, against to SMLPnohints trained for 61 iterations on the same dataset. After one iteration of hintbased training SMLP obtained 9% training error and 39% test error. Following the hint based training, SMLP is trained without hints for 60 epochs, but at epoch 18, it already got 0% training and 0% test error. The hyperparameters for this experiment and the experiment that the results shown for the SMLPhints in Table 1 are the same. The test results for initialization with and without hints are shown on Figure 15. This figure suggests that initializing with hints can give the same generalization performance but training takes longer.
3.5.1 Further Experiments on Optimization for Pentomino Dataset
With extensive hyperparameter optimization and using standardization in the intermediate level of the SMLP with softmax nonlinearity, SMLPnohints was able to get 5.3% training and 6.7% test error on the 80k Pentomino training dataset. We used the 2050 hidden units for the hidden layer of P1NN and 11 softmax output per patch. For the P2NN, we used 1024 hidden units with sigmoid and learning rate 0.1 without using any adaptive learning rate method. This SMLP uses a rectifier nonlinearity for hidden layers of both P1NN and P2NN. Considering that architecture uses softmax as the intermediate activation function of SMLPnohints. It is very likely that P1NN is trying to learn the presence of specific Pentomino shape in a given patch. This architecture has a very large capacity in the P1NN, that probably provides it enough capacity to learn the presence of Pentomino shapes at each patch effortlessly.
An MLP with 2 hidden layers, each 1024 rectifier units, was trained using LBFGS (the implementation from the scipy.optimize library) on 40k training examples, with gradients computed on batches of 10000 examples at each iteration. However, after convergence of training, the MLP was still doing chance on the test dataset.
We also observed that using linear units for the intermediate layer yields better generalization error without standardization compared to using activation functions such as sigmoid, tanh and RELU for the intermediate layer. SMLPnohints was able to get 25% generalization error with linear units without standardization whereas all the other activation functions that has been tested failed to generalize with the same number of training iterations without standardization and hints. This suggests that using nonlinear intermediatelevel activation functions without standardization introduces an optimization difficulty for the SMLPnohints, maybe because the intermediate level acts like a bottleneck in this architecture.
4 Conclusion and Discussion
In this paper we have shown an example of task which seems almost impossible to solve by standard blackbox machine learning algorithms, but can be almost perfectly solved when one encourages a semantics for the intermediatelevel representation that is guided by prior knowledge. The task has the particularity that it is defined by the composition of two nonlinear subtasks (object detection on one hand, and a nonlinear logical operation similar to XOR on the other hand).
What is interesting is that in the case of the neural network, we can compare two networks with exactly the same architecture but a different pretraining, one of which uses the known intermediate concepts to teach an intermediate representation to the network. With enough capacity and training time they can overfit but did not not capture the essence of the task, as seen by test set performance.
We know that a structured deep network can learn the task, if it is initialized in the right place, and do it from very few training examples. Furthermore we have shown that if one pretrains SMLP with hints for only one epoch, it can nail the task. But the exactly same architecture which started training from random initialization, failed to generalize.
Consider the fact that even SMLPnohints with standardization after being trained using online SGD on 1046000 generated examples and still gets 27.5% test error. This is an indication that the problem is not a regularization problem but possibly an inability to find a good effective local minima of generalization error.
What we hypothesize is that for most initializations and architectures (in particular the fullyconnected ones), although it is possible to find a good effective local minimum of training error when enough capacity is provided, it is difficult (without the proper initialization) to find a good local minimum of generalization error. On the other hand, when the network architecture is constrained enough but still allows it to represent a good solution (such as the structured MLP of our experiments), it seems that the optimization problem can still be difficult and even training error remains stuck high if the standardization isn’t used. Standardization obviously makes the training objective of the SMLP easier to optimize and helps it to find at least a better effective local minimum of training error. This finding suggests that by using specific architectural constraints and sometimes domain specific knowledge about the problem, one can alleviate the optimization difficulty that generic neural network architectures face.
It could be that the combination of the network architecture and training procedure produces a training dynamics that tends to yield into these minima that are poor from the point of view of generalization error, even when they manage to nail training error by providing enough capacity. Of course, as the number of examples increases, we would expect this discrepancy to decrease, but then the optimization problem could still make the task unfeasible in practice. Note however that our preliminary experiments with increasing the training set size (8fold) for MLPs did not reveal signs of potential improvements in test error yet, as shown in Figure 14. Even using online training on 545400 Pentomino examples, the SMLPnohints architecture was still doing far from perfect in terms of generalization error (Figure 12).
These findings bring supporting evidence to the “Guided Learning Hypothesis” and “Deeper Harder Hypothesis” from Bengio (2013a): higher level abstractions, which are expressed by composing simpler concepts, are more difficult to learn (with the learner often getting in an effective local minimum ), but that difficulty can be overcome if another agent provides hints of the importance of learning other, intermediatelevel abstractions which are relevant to the task.
Many interesting questions remain open. Would a network without any guiding hint eventually find the solution with a enough training time and/or with alternate parametrizations? To what extent is illconditioning a core issue? The results with LBFGS were disappointing but changes in the architectures (such as standardization of the intermediate level) seem to make training much easier. Clearly, one can reach good solutions from an appropriate initialization, pointing in the direction of an issue with local minima, but it may be that good solutions are also reachable from other initializations, albeit going through a tortuous illconditioned path in parameter space. Why did our attempts at learning the intermediate concepts in an unsupervised way fail? Are these results specific to the task we are testing or a limitation of the unsupervised feature learning algorithm tested? Trying with many more unsupervised variants and exploring explanatory hypotheses for the observed failures could help us answer that. Finally, and most ambitious, can we solve these kinds of problems if we allow a community of learners to collaborate and collectively discover and combine partial solutions in order to obtain solutions to more abstract tasks like the one presented here? Indeed, we would like to discover learning algorithms that can solve such tasks without the use of prior knowledge as specific and strong as the one used in the SMLP here. These experiments could be inspired by and inform us about potential mechanisms for collective learning through cultural evolutions in human societies.
Acknowledgments
We would like to thank to the ICLR 2013 reviewers for their insightful comments, and NSERC, CIFAR, Compute Canada and Canada Research Chairs for funding.
References
 BenHur and Weston (2010) A. BenHur and J. Weston. A user’s guide to support vector machines. Methods in Molecular Biology, 609:223–239, 2010.
 Bengio et al. (2007) Y. Bengio, P. Lamblin, D. Popovici, and H. Larochelle. Greedy layerwise training of deep networks. In NIPS’2006, 2007.
 Bengio (2009) Yoshua Bengio. Learning deep architectures for AI. Foundations and Trends in Machine Learning, 2(1):1–127, 2009. Also published as a book. Now Publishers, 2009.
 Bengio (2013a) Yoshua Bengio. Evolving culture vs local minima. In Growing Adaptive Machines: Integrating Development and Learning in Artificial Neural Networks, number also as ArXiv 1203.2990v1, pages T. Kowaliw, N. Bredeche & R. Doursat, eds. SpringerVerlag, March 2013a. URL http://arxiv.org/abs/1203.2990.
 Bengio (2013b) Yoshua Bengio. Practical recommendations for gradientbased training of deep architectures. In K.R. Müller, G. Montavon, and G. B. Orr, editors, Neural Networks: Tricks of the Trade. Springer, 2013b.
 Bengio et al. (2009a) Yoshua Bengio, Jerome Louradour, Ronan Collobert, and Jason Weston. Curriculum learning. In Léon Bottou and Michael Littman, editors, Proceedings of the Twentysixth International Conference on Machine Learning (ICML’09). ACM, 2009a.
 Bengio et al. (2009b) Yoshua Bengio, Jerome Louradour, Ronan Collobert, and Jason Weston. Curriculum learning. In ICML’09, 2009b.
 Bengio et al. (2012) Yoshua Bengio, Aaron Courville, and Pascal Vincent. Unsupervised feature learning and deep learning: A review and new perspectives. Technical Report arXiv:1206.5538, U. Montreal, 2012. URL http://arxiv.org/abs/1206.5538.
 Bengio et al. (2013) Yoshua Bengio, Aaron Courville, and Pascal Vincent. Unsupervised feature learning and deep learning: A review and new perspectives. IEEE Trans. Pattern Analysis and Machine Intelligence (PAMI), 2013.
 Bergstra et al. (2010) James Bergstra, Olivier Breuleux, Frédéric Bastien, Pascal Lamblin, Razvan Pascanu, Guillaume Desjardins, Joseph Turian, David WardeFarley, and Yoshua Bengio. Theano: a CPU and GPU math expression compiler. In Proceedings of the Python for Scientific Computing Conference (SciPy), 2010.
 Bottou (2010) Léon Bottou. Largescale machine learning with stochastic gradient descent. In Proceedings of COMPSTAT’2010, pages 177–186. Springer, 2010.
 Breiman (2001) Leo Breiman. Random forests. Machine Learning, 45(1):5–32, 2001.
 Ciresan et al. (2010) D. C. Ciresan, U. Meier, L. M. Gambardella, and J. Schmidhuber. Deep big simple neural nets for handwritten digit recognition. Neural Computation, 22:1–14, 2010.
 Dauphin and Bengio (2013) Yann Dauphin and Yoshua Bengio. Big neural networks waste capacity. Technical Report arXiv:1301.3583, Universite de Montreal, 2013.
 Dawkins (1976) Richard Dawkins. The Selfish Gene. Oxford University Press, 1976.
 Duchi et al. (2010) J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12:2121–2159, 2010.
 Erhan et al. (2010) Dumitru Erhan, Yoshua Bengio, Aaron Courville, PierreAntoine Manzagol, Pascal Vincent, and Samy Bengio. Why does unsupervised pretraining help deep learning? Journal of Machine Learning Research, 11:625–660, February 2010.
 Fleuret et al. (2011) François Fleuret, Ting Li, Charles Dubout, Emma K Wampler, Steven Yantis, and Donald Geman. Comparing machines and humans on a visual categorization test. Proceedings of the National Academy of Sciences, 108(43):17621–17625, 2011.
 Glorot et al. (2011a) X. Glorot, A. Bordes, and Y. Bengio. Deep sparse rectifier neural networks. In AISTATS, 2011a.
 Glorot and Bengio (2010) Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward neural networks. In JMLR W&CP: Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics (AISTATS 2010), volume 9, pages 249–256, May 2010.
 Glorot et al. (2011b) Xavier Glorot, Antoine Bordes, and Yoshua Bengio. Deep sparse rectifier neural networks. In JMLR W&CP: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics (AISTATS 2011), April 2011b.
 Goodfellow et al. (2013) Ian J. Goodfellow, David WardeFarley, Mehdi Mirza, Aaron Courville, and Yoshua Bengio. Maxout networks. In ICML, 2013.
 Henneberg (1988) Maciej Henneberg. Decrease of human skull size in the holocene. Human biology, pages 395–405, 1988.
 Henneberg and Steyn (1993) Maciej Henneberg and Maryna Steyn. Trends in cranial capacity and cranial index in subsaharan africa during the holocene. American journal of human biology, 5(4):473–479, 1993.
 Henrich and McElreath (2003) J. Henrich and R. McElreath. The evolution of cultural evolution. Evolutionary Anthropology: Issues, News, and Reviews, 12(3):123–135, 2003.
 Hinton et al. (2006) Geoffrey E. Hinton, Simon Osindero, and Yee Whye Teh. A fast learning algorithm for deep belief nets. Neural Computation, 18:1527–1554, 2006.
 Hinton et al. (2012) Geoffrey E. Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Improving neural networks by preventing coadaptation of feature detectors. Technical report, arXiv:1207.0580, 2012.
 Hsu et al. (2003) C.W. Hsu, C.C. Chang, C.J. Lin, et al. A practical guide to support vector classification, 2003.
 Jarrett et al. (2009a) Kevin Jarrett, Koray Kavukcuoglu, Marc’Aurelio Ranzato, and Yann LeCun. What is the best multistage architecture for object recognition? In Proc. International Conference on Computer Vision (ICCV’09), pages 2146–2153. IEEE, 2009a.
 Jarrett et al. (2009b) Kevin Jarrett, Koray Kavukcuoglu, Marc’Aurelio Ranzato, and Yann LeCun. What is the best multistage architecture for object recognition? In ICCV’09, 2009b.
 Khan et al. (2011) Faisal Khan, Xiaojin Zhu, and Bilge Mutlu. How do humans teach: On curriculum learning and teaching dimension. In Advances in Neural Information Processing Systems 24 (NIPS’11), pages 1449–1457, 2011.
 Krizhevsky et al. (2012) Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton. ImageNet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems 25 (NIPS’2012). 2012.
 Krueger and Dayan (2009) Kai A. Krueger and Peter Dayan. Flexible shaping: how learning in small steps helps. Cognition, 110:380–394, 2009.
 Kunapuli et al. (2010) G. Kunapuli, K.P. Bennett, R. Maclin, and J.W. Shavlik. The adviceptron: Giving advice to the perceptron. Proceedings of the Conference on Artificial Neural Networks In Engineering (ANNIE 2010), 2010.
 Larochelle et al. (2009) Hugo Larochelle, Yoshua Bengio, Jerome Louradour, and Pascal Lamblin. Exploring strategies for training deep neural networks. Journal of Machine Learning Research, 10:1–40, 2009.
 LeCun et al. (1998) Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
 Mitchell (1980) T.M. Mitchell. The need for biases in learning generalizations. Department of Computer Science, Laboratory for Computer Science Research, Rutgers Univ., 1980.
 Mitchell and Thrun (1993) T.M. Mitchell and S.B. Thrun. Explanationbased neural network learning for robot control. Advances in Neural information processing systems, pages 287–287, 1993.
 Montague (1970) R. Montague. Universal grammar. Theoria, 36(3):373–398, 1970.
 Nair and Hinton (2010) V. Nair and G. E Hinton. Rectified linear units improve restricted Boltzmann machines. In ICML’10, 2010.
 Olshen and Stone (1984) L.B.J.H.F.R.A. Olshen and C.J. Stone. Classification and regression trees. Belmont, Calif.: Wadsworth, 1984.
 O’Sullivan (1996) Joseph O’Sullivan. Integrating initialization bias and search bias in neural network learning, 1996.
 Pedregosa et al. (2011) F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, et al. Scikitlearn: Machine learning in python. The Journal of Machine Learning Research, 12:2825–2830, 2011.
 Peterson (2004) Gail B. Peterson. A day of great illumination: B. F. Skinner’s discovery of shaping. Journal of the Experimental Analysis of Behavior, 82(3):317–328, 2004.
 Raiko et al. (2012) Tapani Raiko, Harri Valpola, and Yann LeCun. Deep learning made easier by linear transformations in perceptrons. In International Conference on Artificial Intelligence and Statistics, pages 924–932, 2012.
 Rifai et al. (2011) Salah Rifai, Pascal Vincent, Xavier Muller, Xavier Glorot, and Yoshua Bengio. Contractive autoencoders: Explicit invariance during feature extraction. In ICML’2011, 2011.
 Rifai et al. (2012) Salah Rifai, Yoshua Bengio, Yann Dauphin, and Pascal Vincent. A generative process for sampling contractive autoencoders. In Proceedings of the Twentynine International Conference on Machine Learning (ICML’12). ACM, 2012. URL http://icml.cc/discuss/2012/590.html.
 Salakhutdinov and Hinton (2009) R. Salakhutdinov and G.E. Hinton. Deep Boltzmann machines. In Proceedings of the Twelfth International Conference on Artificial Intelligence and Statistics (AISTATS 2009), volume 8, 2009.
 Skinner (1958) Burrhus F. Skinner. Reinforcement today. American Psychologist, 13:94–99, 1958.
 Solomonoff (1989) R.J. Solomonoff. A system for incremental learning based on algorithmic probability. In Proceedings of the Sixth Israeli Conference on Artificial Intelligence, Computer Vision and Pattern Recognition, pages 515–527. Citeseer, 1989.
 Towell and Shavlik (1994) G.G. Towell and J.W. Shavlik. Knowledgebased artificial neural networks. Artificial intelligence, 70(1):119–165, 1994.
 Vatanen et al. (2013) Tommi Vatanen, Tapani Raiko, Harri Valpola, and Yann LeCun. Pushing stochastic gradient towards secondorder methods–backpropagation learning with transformations in nonlinearities. arXiv preprint arXiv:1301.3476, 2013.
 Vincent et al. (2010) Pascal Vincent, Hugo Larochelle, Isabelle Lajoie, Yoshua Bengio, and PierreAntoine Manzagol. Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. Journal of Machine Learning Research, 11:3371–3408, December 2010.
 Von Ahn et al. (2003) Luis Von Ahn, Manuel Blum, Nicholas J Hopper, and John Langford. Captcha: Using hard ai problems for security. In Advances in CryptologyâEUROCRYPT 2003, pages 294–311. Springer, 2003.
 Weston et al. (2008) Jason Weston, Frédéric Ratle, and Ronan Collobert. Deep learning via semisupervised embedding. In William W. Cohen, Andrew McCallum, and Sam T. Roweis, editors, Proceedings of the Twentyfifth International Conference on Machine Learning (ICML’08), pages 1168–1175, New York, NY, USA, 2008. ACM. ISBN 9781605582054. doi: 10.1145/1390156.1390303.
 Zeiler (2012) Matthew D Zeiler. Adadelta: An adaptive learning rate method. arXiv preprint arXiv:1212.5701, 2012.
Appendix A Appendix
a.1 BinaryBinary RBMs on Pentomino Dataset
We trained binarybinary RBMs (both visible and hidden are binary) on 88 patches extracted from the Pentomino Dataset using PCD (stochastic maximum likelihood), a weight decay of .0001 and a sparsity penalty^{9}^{9}9implemented as TorontoSparsity in pylearn2, see the yaml file in the repository for more details. We used 256 hidden units and trained by SGD with a batch size of 32 and a annealing learning rate (Bengio, 2013b) starting from 1e3 with annealing rate 1.000015. The RBM is trained with momentum starting from 0.5. The biases are initialized to 2 in order to get a sparse representation. The RBM is trained for 120 epochs (approximately 50 million updates).
After pretraining the RBM, its parameters are used to initialize the first layer of an SMLPnohints network. As in the usual architecture of the SMLPnohints on top of P1NN, there is an intermediate layer. Both P1NN and the intermediate layer have a sigmoid nonlinearity, and the intermediate layer has 11 units per location. This SMLPnohints is trained with Adadelta and standardization at the intermediate layer ^{10}^{10}10In our autoencoder experiments we directly fed features to P2NN without standardization and Adadelta..
a.2 Experimental Setup and Hyperparameters
a.2.1 Decision Trees
We used the decision tree implementation in the scikitlearn (Pedregosa et al., 2011) python package which is an implementation of the CART (Regression Trees) algorithm. The CART algorithm constructs the decision tree recursively and partitions the input space such that the samples belonging to the same category are grouped together (Olshen and Stone, 1984). We used The Gini index as the impurity criteria. We evaluated the hyperparameter configurations with a gridsearch. We crossvalidated the maximum depth () of the tree (for preventing the algorithm to severely overfit the training set) and minimum number of samples required to create a split (). 20 different configurations of hyperparameter values were evaluated. We obtained the best validation error with and .
a.2.2 Support Vector Machines
We used the “Support Vector Classifier (SVC)” implementation from the scikitlearn package which in turn uses the libsvm’s Support Vector Machine (SVM) implementation. Kernelbased SVMs are nonparametric models that map the data into a high dimensional space and separate different classes with hyperplane(s) such that the support vectors for each category will be separated by a large margin. We crossvalidated three hyperparameters of the model using gridsearch: , and the type of kernel(). is the penalty term (weight decay) for the SVM and is a hyperparameter that controls the width of the Gaussian for the RBF kernel. For the polynomial kernel, controls the flexibility of the classifier (degree of the polynomial) as the number of parameters increases (Hsu et al., 2003; BenHur and Weston, 2010). We evaluated fortytwo hyperparameter configurations. That includes, two kernel types: ; three gammas: for the RBF kernel, for the polynomial kernel, and seven values among: . As a result of the grid search and crossvalidation, we have obtained the best test error by using the RBF kernel, with and .
a.2.3 Multi Layer Perceptron
We have our own implementation of Multi Layer Perceptron based on the Theano (Bergstra et al., 2010) machine learning libraries. We have selected 2 hidden layers, the rectifier activation function, and 2048 hidden units per layer. We crossvalidated three hyperparameters of the model using randomsearch, sampling the learning rates in logdomain, and selecting and regularization penalty coefficients in sets of fixed values, evaluating 64 hyperparameter values. The range of the hyperparameter values are , and . As a result, the following were selected: , and .
a.2.4 Random Forests
We used scikitlearn’s implementation of “Random Forests” decision tree learning. The Random Forests algorithm creates an ensemble of decision trees by randomly selecting for each tree a subset of features and applying bagging to combine the individual decision trees (Breiman, 2001). We have used gridsearch and crossvalidated the , , and number of trees (). We have done the gridsearch on the following hyperparameter values, , , and . We obtained the best validation error with , and .
a.2.5 kNearest Neighbors
We used scikitlearn’s implementation of kNearest Neighbors (kNN). kNN is an instancebased, lazy learning algorithm that selects the training examples closest in Euclidean distance to the input query. It assigns a class label to the test example based on the categories of the closest neighbors. The hyperparameters we have evaluated in the crossvalidation are the number of neighbors () and . The hyperparameter can be either “uniform” or “distance”. With “uniform”, the value assigned to the query point is computed by the majority vote of the nearest neighbors. With “distance”, each value assigned to the query point is computed by weighted majority votes where the weights are computed with the inverse distance between the query point and the neighbors. We have used and for hyperparameter search. As a result of crossvalidation and grid search, we obtained the best validation error with and =“uniform”.
a.2.6 Convolutional Neural Nets
We used a Theano (Bergstra et al., 2010) implementation of Convolutional Neural Networks (CNN) from the deep learning tutorial at deeplearning.net, which is based on a vanilla version of a CNN LeCun et al. (1998). Our CNN has two convolutional layers. Following each convolutional layer, we have a maxpooling layer. On top of the convolutionpoolingconvolutionpooling layers there is an MLP with one hidden layer. In the crossvalidation we have sampled 36 learning rates in logdomain in the range and the number of filters from the range uniformly. For the first convolutional layer we used 99 receptive fields in order to guarantee that each object fits inside the receptive field. As a result of random hyperparameter search and doing manual hyperparameter search on the validation dataset, the following values were selected:

The number of features used for the first layer is 30 and the second layer is 60.

For the second convolutional layer, 77 receptive fields. The stride for both convolutional layers is 1.

Convolved images are downsampled by a factor of 22 at each pooling operation.

The learning rate for CNN is 0.01 and it was trained for 8 epochs.
a.2.7 Maxout Convolutional Neural Nets
We used the pylearn2 (https://github.com/lisalab/pylearn2) implementation of maxout convolutional networks (Goodfellow et al., 2013). There are two convolutional layers in the selected architecture, without any pooling. In the last convolutional layer, there is a maxout nonlinearity. The following were selected by crossvalidation: learning rate, number of channels for the both convolution layers, number of kernels for the second layer and number of units and pieces per maxout unit in the last layer, a linearly decaying learning rate, momentum starting from 0.5 and saturating to 0.8 at the 200’th epoch. Random search for the hyperparameters was used to evaluate 48 different hyperparameter configurations on the validation dataset. For the first convolutional layer, 88 kernels were selected to make sure that each Pentomino shape fits into the kernel. Early stopping was used and test error on the model that has the best validation error is reported. Using norm constraint on the fanin of the final softmax units yields slightly better result on the validation dataset.
As a result of crossvalidation and manually tuning the hyperparameters we used the following hyperparameters:

16 channels per convolutional layer. 600 hidden units for the maxout layer.

6x6 kernels for the second convolutional layer.

5 pieces for the convolution layers and 4 pieces for the maxout layer per maxout units.

We decayed the learning rate by the factor of 0.001 and the initial learning rate is 0.026367. But we scaled the learning rate of the second convolutional layer by a constant factor of 0.6.

The norm constraint (on the incoming weights of each unit) is 1.9365.
Figure 19 shows the first layer filters of the maxout convolutional net, after being trained on the 80k training set for 85 epochs.
a.2.8 Stacked Denoising AutoEncoders
Denoising AutoEncoders (DAE) are a form of regularized autoencoder (Bengio et al., 2013). The DAE forces the hidden layer to discover more robust features and prevents it from simply learning the identity by reconstructing the input from a corrupted version of it (Vincent et al., 2010). Two DAEs were stacked, resulting in an unsupervised transformation with two hidden layers of 1024 units each. Parameters of all layers are then finetuned with supervised finetuning using logistic regression as the classifier and SGD as the gradientbased optimization algorithm. The stochastic corruption process is binomial (0 or 1 replacing each input value, with probability 0.2). The selected learning rate is for the DAe and for supervised finetuning. Both L1 and L2 penalty for the DAEs and for the logistic regression layer are set to 1e6.
CAE+MLP with Supervised Finetuning:
A regularized autoencoder which sometimes outperforms the DAE is the Contractive AutoEncoder (CAE), (Rifai et al., 2012), which penalizes the Frobenius norm of the Jacobian matrix of derivatives of the hidden units with respect to the CAE’s inputs. The CAE serves as pretraining for an MLP, and in the supervised finetuning state, the Adagrad method was used to automatically tune the learning rate (Duchi et al., 2010).
After training a CAE with 100 sigmoidal units patchwise, the features extracted on each patch are concatenated and fed as input to an MLP. The selected Jacobian penalty coefficient is 2, the learning rate for pretraining is 0.082 with batch size of 200 and 200 epochs of unsupervised learning are performed on the training set. For supervised finetuning, the learning rate is 0.12 over 100 epochs, L1 and L2 regularization penalty terms respectively are 1e4 and 1e6, and the toplevel MLP has 6400 hidden units.
Greedy Layerwise CAE+DAE Supervised Finetuning:
For this experiment we stack a CAE with sigmoid nonlinearities and then a DAE with rectifier nonlinearities during the pretraining phase. As recommended by Glorot et al. (2011b) we have used a softplus nonlinearity for reconstruction, . We used an L1 penalty on the rectifier outputs to obtain a sparser representation with rectifier nonlinearity and L2 regularization to keep the nonzero weights small.
The main difference between the DAE and CAE is that the DAE yields more robust reconstruction whereas the CAE obtains more robust features (Rifai et al., 2011).
As seen on Figure 7 the weights U and V are shared on each patch and we concatenate the outputs of the last autoencoder on each patch to feed it as an input to an MLP with a large hidden layer.
We used 400 hidden units for the CAE and 100 hidden units for DAE. The learning rate used for the CAE is 0.82 and for DAE it is 9*1e3. The corruption level for the DAE (binomial noise) is 0.25 and the contraction level for the CAE is 2.0. The L1 regularization penalty for the DAE is 2.25*1e4 and the L2 penalty is 9.5*1e5. For the supervised finetuning phase the learning rate used is 4*1e4 with L1 and L2 penalties respectively 1e5 and 1e6. The toplevel MLP has 6400 hidden units. The autoencoders are each trained for 150 epochs while the whole MLP is finetuned for 50 epochs.
Greedy Layerwise DAE+DAE Supervised Finetuning:
For this architecture, we have trained two layers of denoising autoencoders greedily and performed supervised finetuning after unsupervised pretraining. The motivation for using two denoising autoencoders is the fact that rectifier nonlinearities work well with the deep networks but it is difficult to train CAEs with the rectifier nonlinearity. We have used the same type of denoising autoencoder that is used for the greedy layerwise CAE+DAE supervised finetuning experiment.
In this experiment we have used 400 hidden units for the first layer DAE and 100 hidden units for the second layer DAE. The other hyperparameters for DAE and supervised finetuning are the same as with the CAE+DAE MLP Supervised Finetuning experiment.