# Learning Internal Representations (COLT 1995)

## Abstract

Probably the most important problem in machine learning is the preliminary biasing of a learner’s hypothesis space so that it is small enough to ensure good generalisation from reasonable training sets, yet large enough that it contains a good solution to the problem being learnt. In this paper a mechanism for automatically learning or biasing the learner’s hypothesis space is introduced. It works by first learning an appropriate internal representation for a learning environment and then using that representation to bias the learner’s hypothesis space for the learning of future tasks drawn from the same environment.

An internal representation must be learnt by sampling from many similar tasks, not just a single task as occurs in ordinary machine learning. It is proved that the number of examples per task required to ensure good generalisation from a representation learner obeys where is the number of tasks being learnt and and are constants. If the tasks are learnt independently (i.e. without a common representation) then . It is argued that for learning environments such as speech and character recognition and hence representation learning in these environments can potentially yield a drastic reduction in the number of examples required per task. It is also proved that if (with ) then the representation learnt will be good for learning novel tasks from the same environment, and that the number of examples required to generalise well on a novel task will be reduced to (as opposed to if no representation is used).

It is shown that gradient descent can be used to train neural network representations and the results of an experiment are reported in which a neural network representation was learnt for an environment consisting of translationally invariant Boolean functions. The experiment provides strong qualitative support for the theoretical results.

## 1 Introduction

It has been argued elsewhere (for example, see [2]) that the main problem in machine learning is the biasing of the learner’s hypothesis space sufficiently well to ensure good generalisation from a relatively small number of examples. Once suitable biases have been found the actual learning task is relatively trivial. Despite this conclusion, much of machine learning theory is still concerned only with the problem of quantifying the conditions necessary for good generalisation once a suitable hypothesis space for the learner has been found; virtually no work appears to have been done on the problem of how the learner’s hypothesis space is to be selected in the first place. This paper presents a new method for automatically selecting a learner’s hypothesis space: internal representation learning.

The idea of automatically learning internal representations is not new to machine learning. In fact the huge increase in Artificial Neural Network (henceforth ANN) research over the last decade can be partially attributed to the promise—first given air in [5]—that neural networks can be used to automatically learn appropriate internal representations. However it is fair to say that despite some notable isolated successes, ANNs have failed to live up to this early promise. The main reason for this is not any inherent deficiency of ANNs as a machine learning model, but a failure to realise the true source of information necessary to generate a good internal representation.

Most machine learning theory and practice is concerned with learning a single task (such as “recognise the digit ‘1’ ”) or at most a handful of tasks (recognise the digits ‘0’ to ‘9’). However it is unlikely that the information contained in a small number of tasks is sufficient to determine an appropriate internal representation for the tasks. To see this, consider the problem of learning to recognise the handwritten digit ‘1’. The most extreme representation possible would be one that completely solves the classification problem, i.e. a representation that outputs ‘yes’ if its input is an image of a ‘1’, regardless of the position, orientation, noise or writer dependence of the original digit, and ‘no’ if any other image is presented to it. A learner equipped with such a representation would require only one positive and one negative example of of the digit to learn to recognise it perfectly. Although the representation in this example certainly reduces the complexity of the learning problem, it does not really seem to capture what is meant by the term representation. What is wrong is that although the representation is an excellent one for learning to recognise the digit ‘1’, it could not be used for any other learning task. A representation that is appropriate for learning to recognise ‘1’ should also be appropriate for other character recognition problems—it should be good for learning other digits, or the letters of the alphabet, or Kanji characters, or Arabic letters, and so on. Thus the information necessary to determine a good representation is not contained in a single learning problem (recognising ‘1’), but is contained in many examples of similar learning problems. The same argument can be applied to other familiar learning domains, such as face recognition and speech recognition. A representation appropriate for learning a single face should be appropriate for learning all faces, and similarly a single word representation should be good for all words (to some extent even regardless of the language).

In the rest of this paper it is shown how to formally model the process of sampling from many similar learning problems and how information from such a sample can be used to learn an appropriate representation. If learning problems are learnt independently then the number of examples required per problem for good generalisation obeys , whereas if a common representation is learnt for all the problems then . The origin of the constants and will be explained in section 3 and it is argued in section 2 that for common learning domains such as speech and image recognition , hence representation learning in such environments can potentially yield a drastic reduction in the number of examples required for good generalisation. It will also be shown that if a representation is learnt on tasks then with high probability it will be good for learning novel tasks and that the sampling burden for good generalisation on novel tasks will be reduced , in constrast to if no representation is used.

## 2 Mathematical framework

Haussler’s [3] statistical decision theoretic formulation of ordinary machine learning is used throughout this paper as it has the widest applicability of any formulation to date. This formulation may be summarised as follows. The learner is provided with a training set where each example consists of an input and an outcome . The training set is generated by independent trials according to some (unknown) joint probability distribution on . In addition the learner is provided with an action space , a loss function and a hypothesis space containing functions . Defining the true loss of hypothesis with respect to distribution as

(1) |

the goal of the learner is to produce a hypothesis that has true loss as small as possible. is designed to give a measure of the loss the learner suffers, when, given an input , it produces an action and is subsequently shown the outcome .

If for each a function is defined by for all , then can be expressed as the expectation of with respect to ,

Let . The measure and the -algebra on are assumed to be such that all the are -measurable (see definition 3).

To minimize the true loss the learner searches for a hypothesis minimizing the empirical loss on the sample ,

(2) |

To enable the learner to get some idea of the environment in which it is learning and hopefully then extract some of the bias inherent in the environment, it is assumed that the environment consists of a set of probability measures and an environmental measure on . Now the learner is not just supplied with a single sample , sampled according to some probability measure , but with such samples . Each sample , for , is generated by first sampling from according to the environmental measure to generate , and then sampling times from to generate . Denote the entire sample by and write it as an ( rows, columns) matrix over :

Denote the matrices over by and call a sample generated by the above process an -sample.

To illustrate this formalism, consider the problem of character recognition. In this case would be the space of all images, would be the set , each probability measure would represent a distinct character or character like object in the sense that if is an image of the character represents and , and otherwise. The marginal distribution in each case could be formed by choosing a positive example of the character with probability half from the ‘background’ distribution over images of the character concerned, and similarly choosing a negative example with probability half. would give the probability of occurence of each character. The -sample is then simply a set of training sets, each row of being a sequence of classified examples of some character. If the idea of character is widened to include other alphabets such as the greek alphabet and the Japanese Kanji characters then the number of different characters to sample from is very large indeed.

To enable the learner to take advantage of
the prior information contained in the -sample ,
the hypothesis space is split into
two sections: where and , where
is an arbitrary set^{1}

will be called the representation space and an individual member of will be called an internal representation or just a representation. will be called the output function space.

Based on the information about the environment , contained in , the learner searches for a good representation . A good representation is one with a small empirical loss on , where this is defined by

(3) |

where denotes the th row of .
The empirical loss of with respect to is a
measure of how well the learner can learn
using , assuming that the learner
is able to find the best possible for any given sample .
For example, if the empirical loss
of on is zero then it is possible for the
learner to find an output function^{2}

(4) |

The true loss of with respect to is the expected best possible performance of —over all —on a distribution chosen at random from according to . If has a small true loss then learning using on a random “task” —drawn according to —will with high probability be successful.

Note that the learner takes as input samples , for any values , and produces as output hypothesis representations , so it is a map from the space of all possible samples into ,

It may be that the tasks generating the training set are all that the learner is ever going to be required to learn, in which case it is more appropriate to suppose that in response to the -sample the learner generates hypotheses all using the same representation and collectively minimizing

(5) |

The true loss of the learner will then be

(6) |

Denoting the set of all functions for and by , the learner in this case is a map

If the learner is going to be using the representation to learn future tasks drawn according to the same environment , it will do so by using the restricted hypothesis space . That is, the learner will be fed samples drawn according to some distribution , which in turn is drawn according to , and will search for a hypothesis with small empirical loss on . Intuitively, if is much “bigger” than then the number of examples required to learn using will be much less than the number of examples required to learn using the full space , a fact that is proved in the next section. Hence, if the learner can find a good representation and the sample is large enough, learning using will be considerably quicker and more reliable than learning using . If the representation mechanism outlined here is the one employed by our brains then the fact that children learn to recognise characters and words from a relatively small number of examples is evidence of a small in these cases. The fact that we are able to recognise human faces after being shown only a single example is evidence of an even smaller for this task. Furthermore, the fact that most of the difficulty in machine learning lies in the initial bias of the learner’s hypothesis space [2] indicates that our ignorance concerning an appropriate representation is large, and hence the entire representation space will have to be large to ensure that it contains a suitable representation. Thus it seems that at least for the examples outlined above the conditions ensuring that representation learning is a big improvement over ordinary learning will be satisfied.

The main issue in machine learning is that of quantifying the necessary sampling conditions ensuring good generalisation. In representation learning there are two measures of good generalisation. The first is the proximity of (5) above to the second form of the true loss (6). If the sample is large enough to guarantee with high probability that these two quantities are close, then a learner that produces a good performance in training on will be likely to perform well on future examples of any of the tasks used to generate . The second measure of generalisation performance is the proximity of (3) to the first form of the true loss (4). In this case good generalisation means that the learner should expect to perform well if it uses the representation to learn a new task drawn at random according to the environmental measure . Note that this is a new form of generalisation, one level of abstraction higher than the usual meaning of generalisation, for within this framework a learner generalises well if, after having learnt many different tasks, it is able to learn new tasks easily. Thus, not only is the learner required to generalise well in the ordinary sense by generalising well on the tasks in the training set, but also the learner is expected to have “learnt to learn” the tasks from the environment in general. Both the number of tasks generating and the number of examples of each task must be sufficiently large to ensure good generalisation in this new sense.

To measure the deviation between (5) and (6), and (3) and (4), the following one-parameter family of metrics on , introduced in [3], will be used:

for all and . Thus, good generalisation in the first case is governed by the probability

(7) |

where the probability measure on is . In the second case it is governed by the probability

(8) |

This time the probability measure on is

for any measurable subset^{3}

## 3 Conditions for good generalisation

To state the main results some further definitions are required.

###### Definition 1.

For the structure and loss function define for any by . Let . For any probability measure on define the pseudo-metric on by

(9) |

Let be the size of the smallest -cover of and define the -capacity of to be

(10) |

where the supremum is over all probability measures .
For any probability measure on
define the pseudo-metric on by^{4}

Let be the corresponding -capacity.

### 3.1 Generalisation on tasks

The following theorem bounds the number of examples of each task in an -sample needed to ensure with high probability good generalisation from a representation learner on average on all the tasks. It uses the notion of a hypothesis space family—which is just a set of hypothesis spaces—and a generalisation of Pollard’s concept of permissibility [4] to cover hypothesis space families, called f-permissibility. The definition of f-permissibility is given in appendix B.

###### Theorem 1.

Let and be families of functions with the structure , let be a loss function and suppose and are such that the hypothesis space family is f-permissible. Let be probability measures on and let be an -sample generated by sampling times from according to each . For all and any representation learner

if

(11) |

where , then

Proof.
See appendix A.

Theorem 1 with corresponds to the ordinary
learning scenario in which a single task is learnt. Setting
and
gives for a single task while for tasks learnt
using a common representation^{5}

If and are Lipschitz bounded neural networks with weights in and weights in and is one of many loss functions used in practice (such as Euclidean loss, mean squared loss, cross entropy loss—see [3], section 7), then simple extensions of the methods of [3] can be used to show:

where and are constants not dependent on the number of weights or . Substituting these expressions into (11) and optimizing the bound with respect to gives:

which yields^{6}

(12) |

Although (12) results from a worst-case analysis and some rather crude approximations to the capacities of and , its general form is intuitively very appealing—particularly the behaviour of as a function of and the size of and . I would expect this general behaviour to survive in more accurate analyses of specific representation learning scenarios. This conclusion is certainly supported by the experiments of section 5.

### 3.2 Generalisation when learning to learn

The following theorem bounds the number of tasks and the number of examples of each task required of an -sample to ensure with high probability that a representation learnt on that sample will be good for learning future tasks drawn according to the same environment.

###### Theorem 2.

Let and be as in theorem 1. Let be an -sample from according to an environmental measure . For all , , , if

and is any representation learner,

then

Proof.
See appendix A.

Apart from the odd factor of two, the bound on in this theorem is the same as the bound in theorem 1. Using the notation introduced following theorem 1, the bound on is , which is very large. However it results again from a worst-case analysis (it corresponds to the worst possible environment ) and is only an approximation, so it is likely to be beaten in practice. The experimental results of section 5 verify this. The bound on now becomes , while the total number of examples . This is far greater than the examples that would be required to learn a single task using the full space , however a representation learnt on a single task cannot be used to reliably learn novel tasks. Also the representation learning phase is most likely to be an off-line process to generate the preliminary bias for later on-line learning of novel tasks, and as learning of novel tasks will be done with the biased hypothesis space rather than the full space , the number of examples required for good generalisation on novel tasks will be reduced to .

## 4 Representation learning via backpropagation

If the function classes and consist of “backpropagation” type neural networks then it is possible to use a slight variation on ordinary gradient descent procedures to learn a neural network representation .

In this case, based on an -sample

the learner searches for a representation minimizing the mean-squared representation error:

The most common procedure for training differentiable neural networks is to use some form of gradient descent algorithm (vanilla backprop, conjugate gradient, etc) to minimize the error of the network on the sample being learnt. For example, in ordinary learning the learner would receive a single sample and would perform some form of gradient descent to find a function such that

(13) |

is minimal. This procedure works because it is a relatively simple matter to compute the gradient, , where are the parameters (weights) of the networks in and .

Applying this method directly to the problem of minimising above would mean calculating the gradient where now refers only to the parameters of . However, due to the presence of the infimum over in the formula for , calculating the gradient in this case is much more difficult than in the ordinary learning scenario. An easier way to proceed is to minimize (recall equation (5)) over all , for if is such that

then so too is

The advantage of this approach is that essentially the same techniques used for computing the gradient in ordinary learning can be used to compute the gradient of . Note that in the present framework is:

(14) |

which is simply the average of the mean-squared error of each on .

An example of a neural network of the form is illustrated in figure 1. With reference to the figure, consider the problem of computing the derivative of with respect to a weight in the output network . Denoting the weight by and recalling equation (14), we have

(15) |

which is just times the derivative of the ordinary learning error (13) of on sample with respect to the weight . This can be computed using the standard backpropagation formula for the derivative [5]. Alternatively, if is a weight in the representation network then

(16) |

which is simply the average of the derivatives of the ordinary learning errors over all the samples and hence can also be computed using the backpropagation formula.

## 5 Experiment: Learning translation invariance

In this section the results of an experiment are reported in which a neural network like the one in figure 1 was trained to perform a very simple “machine vision” task where it had to learn certain translationally invariant Boolean functions. All simulations reported here were performed on the 32 node CM5 at The South Australian centre for Parallel Supercomputing.

The input space was viewed as a one-dimensional “retina” in which all the pixels could be either on (1) or off (0) (so in fact ). However the network did not see all possible input vectors during the course of its training, the only vectors with a non-zero probability of appearing in the training set were those consisting of from one to four active adjacent pixels placed somewhere in the input (wrapping at the edge was allowed).

The functions in the environment of the network consisted of all possible translationally invariant Boolean functions over the input space (except the trivial “constant 0” and “constant 1” functions). The requirement of translation invariance means that the environment consisted of just 14 different functions—all the Boolean functions on four objects (of which there are ) less the two trivial ones. Thus the environment was highly restricted, both in the number of different input vectors seen—40 out of a possible 1024—and in the number of different functions to be learnt—14 out of a possible . samples were generated from this environment by firstly choosing functions (with replacement) uniformly from the fourteen possible, and then choosing input vectors (with replacement again), for each function, uniformly from the 40 possible input vectors.

The architecture of the network was similar to the one shown in figure 1, the only difference being that the output networks for this experiment had only one hidden layer, not two. The network in figure 1 is for learning samples (it has output networks), in general for learning samples the network will have output networks.

The network was trained on samples with ranging from to in steps of four and ranging from to in steps of . Conjugate-gradient descent was used with exact line search with the gradients for each weight computed using the backpropagation algorithm according to the formulae (16) and (15). Further details of the experimental procedure may be found in [1].

Once the network had sucessfully learnt the sample its generalization ability was tested on all functions in the training set. In this case the generalisation error (i.e true error—) could be computed exactly by calculating the network’s output (for all functions) for each of the input vectors, and comparing the result with the desired output.

In an ordinary learning situtation the generalisation error of a network would be plotted as a function of , the number of examples in the training set, resulting in a learning curve. For representation learning there are two parameters and so the learning curve becomes a learning surface. Plots of the learning surface are shown in figure 2 for three independent simulations. All three cases support the theoretical result that the number of examples required for good generalisation decreases with increasing (cf theorem 1).

For samples that led to a generalisation error of less than , the representation network was extracted and tested for its true error, where this is defined as in equation (4) and in the current framework translates to

where are the input vectors seen by the learner and are all the functions in the environment. measures how useful the representation is for learning all functions in the environment, not just the ones used in generating the sample was trained on. To measure , entire training sets consisting of 40 input-output pairs were generated for each of the 14 functions in the environment and the training sets were learnt individually by fixing and performing conjugate gradient descent on the weights of . To be (nearly) certain that a minimal solution had been found for each of the functions, learning was started from different random weight initialisations for (this number was chosen so that the CM5 could perform all the restarts in parallel) and the best result from all 32 recorded. For each sample giving perfect generalisation, was calculated and then averaged over all samples with the same value of , and finally averaged over all three simulations, to give an indication of the behaviour of as a function of . This is plotted in figure 3, along with the representation error for the three simulations (i.e, the maximum error over all 14 functions and all 40 examples and over all three simulations). Qualitatively the curves support the theoretical conclusion that the representation error should decrease with an increasing number of tasks in the sample. However, note that the representation error is very small, even when the representation is derived from learning only one function from the environment. This can be explained as follows. For a representation to be a good one for learning in this environment it must be translationally invariant and distinguish all the four objects it sees (i.e. have different values on all four objects). For small values of , to achieve perfect generalisation the representation is forced to be at least approximately translationally invariant and so half of the problem is already solved. However depending upon the particular functions in the sample the representation may not have to distinguish all four objects, for example it may map two objects to the same element of if none of the functions in the sample distinguish those objects (a function distinguishes two objects if it has a different value on those two objects). However, because the representation network is continuous it is very unlikely that it will map different objects to exactly the same element of —there will always be slight differences. When the representation is used to learn a function that does distinguish the objects mapped to nearly the same element of , often an output network with sufficiently large weights can be found to amplify this difference and produce a function with small error. This is why a representation that is simply translationally invariant does quite well in general. This argument is supported by a plot in figure 4 of the representation’s behaviour for minimum-sized samples leading to a generalisation error of less than for . The four different symbols marking the points in the plots correspond to the four different input objects. For the plot the three and four pixel objects are well separated by the representation while the one and two pixel objects are not, except that a closer look reveals that there is a slight separation between the representation’s output for the one and two pixel objects. This separation can be exploited to learn a function that at least partially distinguishes the two objects. Note that the representation’s behaviour improves dramatically with increasing , in that all four objects become well separated and the variation in the representation’s output for individual objects decreases. This improvement manifests itself in superior learning curves for learning using a representation from a high -sample, although it is not necessarily reflected in the representation’s error because of the use of the infimum over all in the definition of that error.

### 5.1 Representation vs. no representation.

As well as reducing the sampling burden for the tasks in the training set, a representation learnt on sufficiently many tasks should be good for learning novel tasks and should greatly reduce the number of examples required of new tasks. This was experimentally verified by taking a representation known to be perfect for the environment above and using it to learn all the functions in the environment in turn. Hence the hypothesis space of the learner was , rather than the full space . All the functions in the environment were also learnt using the full space. The learning curves (i.e. the generalisation error as a function of the number of examples in the training set) were calculated for all 14 functions in each case. The learning curves for all the functions were very similar and two are plotted in figure 5, for learning with a representation (Gof in the graphs) and without (GoF). These curves are the average of 32 different simulations obtained by using different random starting points for the weights in (and when using the full space to learn). Learning with a good representation is clearly far superior to learning without.

## Appendix A Sketch proofs of theorems 1 and 2

###### Definition 2.

Let be sets of functions mapping into . For any , let or simply denote the map for all . Let denote the set of all such functions. Given elements of , or equivalently an element of , , let . For any product probebility measure on let . For any set define the pseudo-metric and -capacity as in (9) and (10).

The following lemma is a generalisation of a similar result (theorem 3) for ordinary learning in [3], which is in turn derived from results in [4]. It is proved in [1] where it is called the fundamental theorem. The definition of permissibility is given in appendix B.

###### Lemma 3.

Let be a permissible set of functions mapping into . Let be generated by independent trials from according to some product probability measure . For all , ,

### a.1 Proof sketch of theorem 1

Let denote the set of all functions where and , and let denote an individual such map. Recall that in theorem 1 the learner maps -samples into and note from definition 2 and equation (5) that where denotes matrix transposition. This gives,

### a.2 Proof sketch of theorem 2

To prove theorem 2 note that in the -sampling process a list of probability measures is implicitly generated in addition to the -sample . Defining and using the triangle inequality on , if

(17) |

and

(18) |

then

Inequality (17) can be bounded using essentially the same techniques as theorem 1, giving

where . Note that the probability in inequality (18) is less than or equal to

(19) |

Now, for each define by and let