We show that the number of unique function mappings in a neural network hypothesis space is inversely proportional to , where is the number of neurons in the hidden layer .
Neural network hypothesis space]The effect of the choice of neural network depth and breadth on the size of its hypothesis space
Keywords: Deep learning, artificial neural networks
A shallow neural network is a universal function approximator, if allowed an unlimited number of neurons in its single hidden layer (Cybenko, 1989; Hornik et al., 1989). Since in theory a shallow network can do anything, what is the advantage of going deep? For one thing, deeper architectures are capable of encoding certain types of functions far more efficiently than their shallow counterparts (Montufar et al., 2014; Szymanski and McCane, 2014; Telgarsky, 2015). The efficiency of function encoding is important for two reasons:
deep learning can tackle problems that may be computationally intractable with the shallow approach;
While the first point is highly relevant for practical purposes, the latter is more interesting from the theoretical point of view. An approximation that can be made to the same level of accuracy with significantly fewer parameters is likely to give better generalisation. However, the notion of generalisation has little meaning when the function to be approximated is fully specified, as has been the case in theoretical comparisons of shallow versus deep architectures thus far. Also, because of this presupposing of the desired mapping function, the existing proofs do not establish that deep representations are richer in general – only that it is exceptionally efficient at certain types of approximations. This does not exclude the possibility that there are times where shallow representations are better. Although at the moment the empirical evidence suggests that going deeper does not hurt (Zagoruyko and Komodakis, 2016), we do not know that this is true in general.
In this paper we examine the capabilities of different choices of neural network architecture from a different point of view. Instead of contrasting the model complexity required for the same accuracy on a specified task, we compare the sizes of the hypothesis spaces from different variants of neural architecture of equivalent complexity (in terms of the total number of parameters). Our analysis is based on counting the number of equivalence classes in the set of possible states for a neural network of a particular architecture where the equivalence relation corresponds to states that lead to the same function mapping. We prove that the upper bound on the unique number of functions a neural network can produce is , where is the total number of parameters, is the cardinality of the set of values parameters can take, and is the number of neurons in hidden layer . This implies that given a fixed number of parameters, architecturally it is better to impart the computational complexity of the network into its depth rather than breadth in order to increase the model’s function mapping capability.
We also provide results of a numerical evaluation in small networks, which show that the actual number of unique function mappings, although much smaller than the theoretical bound and highly dependent on the choice of activation function, is nevertheless always larger in deeper architectures.
2 Neural network as a hypothesis space
A neural network with a particular architecture is a hypothesis space, denoted as . The architecture is specified through a set of hyperparameters. Some of these, such as the number of inputs , are dictated by the attributes of data the network needs to work with. Other parameters, the number of hidden layers , number of hidden neurons in layers , and the activation function are chosen by the user. Once the choice of the hyperparameters is made, the input-output mapping that the network provides will depend on the values of the weights and the biases on the connections between the neurons. In this paper we will restrict ourselves to working with single-output networks. The function produced by such network is:
where and are respectively the weight and bias of the single output neuron,
is the output of the neuron in layer , where is some activity function, is a weight on the input from layer , is the bias, and with is attribute of input .
The total number of trainable parameters (weights + biases) in a fully connected single output feed-forward network is
where, again, .
A particular assignment of values to the weights and biases will be referred to as network’s state. The hypothesis space given by a neural network of a particular architecture is the set of all possible functions that this architecture is capable of producing through all possible choices of its state. Whenever there is a need to be explicit about the architecture, we will denote the corresponding hypothesis space .
3 Equivalence classes
For a network of parameters, where each can take on values from a finite set of cardinality , there is a total of states. However, different states can give rise to the same function mapping, and that is the equivalence relations we are interested in. Identical function mappings despite different states is a consequence of the fact that the order of summation over neuron’s weighted inputs does not matter with respect to its overall activity. A subset of states with the same equivalence relation forms forms an equivalence class. We want to establish how the choice of hyperparameters affects the number of total number of equivalence classes within all of its states, and thus the number of unique function mapping, or the size of the hypothesis space, .
Let’s examine a mapping from input to output of a single hidden layer as shown in Figure 1 for an arbitrary choice of the weight values on the connections. The change of state that does not affect the overall mapping is synonymous with a change in the positions of two (or more) neurons behaving as beads on a string. The neuron/bead can exchange its position with another neuron/bead, each taking along the strings corresponding to its input and output connections. The state of the network changes through a permutation of the weight values on the connections, but the overall computation does not. As an example, the state change from Figure 1 to Figure 1 is analogous to neuron A exchanging its position with neuron C. Figure 1 shifts the neurons with respect to Figure 1 in such a way that A moves into position of B, B to C, C to D and D to A. The neuron/bead analogy works for arbitrary number of inputs and outputs, thus also encompassing bias weights, which can be thought of as weights of a constant value input to all neurons in the layer.
Following the neuron/bead movement analogy it’s fairly obvious that for a layer of neurons, and a particular choice of values on the connections, there are up to permutations of the order of the summation producing the same mapping, regardless of the number of inputs and outputs of the layer. There might be fewer than permutations for certain choices of the values of the connections if the weights on neurons match in such a way that two (or more) neuron permutations produce identical state. For instance, if all the input weights have exactly same value, and all the output weights have exactly same value, then all the neuron permutations produce exactly the same state.
When accounting for the mapping capability of the combination of multiple layers, we need to account for all possible combinations of computation-preserving permutations of neurons of each layer. Figure 2 shows all these combinations for a two hidden layer network with two neurons each. Since each layer has two neurons, individually each gives rise to equivalent permutations. Figure 2 represents the first permutation of the neuron order in each layer, Figure 2 the second permutation of the first layer (from the left) along with the first permutation of the second layer, Figure 2 the first permutation of the first layer and second permutation of the second layer, and finally Figure 2 depicts the second permutation of the neuron order in both layers. In general, depending on the choice of values of the parameters on the connections, there are up to permutations of the neurons that preserve the function mapping of the network.
If we take a finite set of values, then there are possible states for a network of an architecture with a total of parameters. If every state out of was part of an equivalence class of at least states producing the same function mapping, it would be trivially obvious that this network can give rise to no more than unique function mappings. Situation is not that simple, since there are states (with same values on different parameters) that do not have distinguishable permutations. However, relying on fairly fundamental results from Group Theory (Rotman, 1995), we can establish that indeed the upper bound on unique function mappings is .
The upper bound on the size of the hypothesis space of a fully connected neural network with arbitrary activity function is , where is the number of hidden layers, is number of neurons in layer , and is the total number of parameters and parameters , where is finite.
Theorem 1 (Unique Solutions)
3.1 Symbolic evaluation
In order to get a sense of the tightness of the bound on given in Section 3, we can run a symbolic evaluation over all possible states of network with parameters chosen from a set of symbols. We can evaluate and compare the symbolic output from neural networks of different architectures for all states and determine how many of these symbolic expressions are unique. Though only possible for small and , it still gives an idea on the tightness of bound on for arbitrary .
Figure 3 shows the exact number (solid line) and the bound (dash line) of unique symbolic solutions for function mapping over unspecified function plotted against the number of parameters in a single-layer and two-hidden-layer neural network. Note that the bound gets tighter as increases.
3.2 Numerical evaluation
To get a bit of an idea on the number of possible mappings in a practical scenario, we need a numerical evaluation over specific range of inputs and a choice of activation function. We can evaluate all possible function mappings of hypothesis space by considering model’s output over a range of inputs for each . For single-input single-output networks we evaluate over points from regularly samples range . For each hypothesis we compute the output vector . Some of the hypotheses that give different functions symbolically might give identical mappings over the chosen range of input in the numerical evaluation. Hence, we select the set of unique vectors (to within Euclidean distance) to form a set of mappings , which corresponds to for the choice of activation over the selected range of input. Table 1 shows the symbolically evaluated number of unique hypotheses against the number of unique vectors after numerical evaluation for different choice of activation functions. The evaluated hypothesis spaces are and , each with a total number of parameters. Numerical evaluation was done for where , and where .
It is hardly surprising that the choices for input range, allowed parameter values and activation function have a significant impact on the size of the corresponding hypothesis space. The possibility of inputs and parameters of same value with opposite sign introduces additional symmetries in the internal computations of the network, thus reducing the number of unique function mappings. ReLU introduces many extra symmetries, because it produces the same output for all negative activity. So does tanh, because of its symmetry about 0. Sigmoid gives rise to the richest hypothesis space.
Note that, although for a given choice of the number of unique functions is far below the upper bound given in Section 3, the deeper/fewer neurons per layer hypothesis is always richer than the shallower/more neurons per layer version of the neural network.
We have show that upper bound on the size of the hypothesis space given by a neural network is dictated by the the number of neurons per layer. For the same number of parameters deeper architecture (fewer neurons per more layers) gives a hypothesis space capable of producing more function mappings than a shallower one (with more neurons per fewer layers).
- Anthony and Bartlett (2009) Martin Anthony and Peter L. Bartlett. Neural Network Learning: Theoretical Foundations. Cambridge University Press, New York, NY, USA, 2009.
- Bartlett et al. (2017) Peter L. Bartlett, Nick Harvey, Christopher Liaw, and Abbas Mehrabian. Nearly-tight vc-dimension bounds for piecewise linear neural networks. CoRR, abs/1703.02930, 2017. URL http://arxiv.org/abs/1703.02930.
- Cybenko (1989) G. Cybenko. Approximation by superpositions of a sigmoidal function. Mathematics of Control, Signals, and Systems (MCSS), 2(4):303–314, December 1989.
- Hornik et al. (1989) Kur Hornik, Maxwell Stinchcombe, and Halber White. Multilayer feedforward networks are universal approximators. Neural Networks, 2:359–366, 1989.
- Montufar et al. (2014) Guido Montufar, Razvan Pascanu, Cho Kyunghyun, and Bengio Yoshua. On the number of linear regions of deep neural networks. 2014.
- Rotman (1995) Joseph J. Rotman. An introduction to the Theory of Groups. Springer New York, 4 edition, 1995.
- Szymanski and McCane (2014) Lech Szymanski and Brendan McCane. Deep networks are effective encoders of periodicity. IEEE Trans. Neural Netw. Learning Syst., 25(10):1816–1827, 2014.
- Telgarsky (2015) Matus Telgarsky. Representation benefits of deep feedforward networks. CoRR, abs/1509.08101, 2015.
- Vapnik (1998) Vladimir N. Vapnik. Statistical Learning Theory. Wiley-Interscience, 1998.
- Zagoruyko and Komodakis (2016) Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. CoRR, abs/1605.07146, 2016.
Appendix A Proof of Section 3
The proof is a pretty straight forward application of Burnside’s Lemma to count the number of equivalence classes of the states producing same function mapping in a neural network of particular architecture. All the definitions and lemmas used here are proven in Rotman (1995).
Theorem ?? (Unique Solutions)
Proof Let’s denote as the set of all possible states of a neural network of parameters. For our context, . To bound the size of , we can partition into equivalence classes of identical hypotheses and count the number of such classes. For the sake of completeness, we included some definitions.
Definition 3 (Group; Rotman (1995), pg. 12)
A group is a nonempty set equipped with an associative operation containing an element such that:
for every , there is an element with .
By Definition 3 the set of bijections (or permutations) of the parameters that do not affect the overall function mapping of the network is a group. The operation is a permutation. Indeed, we can apply a permutation to a permutation and obtain another permutation. The identity permutation is a permutation that maps every element onto itself. Following the explanations from Section 3 we can see that group consist of parameter permutations isomorphic to the product of the permutations of the order of neurons in each hidden layer .
Definition 4 (G-set; Rotman (1995), pg. 55)
If X is a set and G is a group, then X is a G-set if there is a function (called an action), denoted by , such that:
for all ; and
for all and .
is a G-set, because permutations from re-order the values of parameters of the network creating another state in . The action is the re-ordering of the parameter values dictated by the permutation . Condition is satisfied by the identity permutation, which will map network state to itself. Condition is satisfied by the fact that application of several permutations is associative.
Definition 5 (G-orbit; Rotman (1995), pg. 56)
If is a G-set and , then the G-orbit of is:
The G-orbits we are interested in are the subsets of created by application of all neuron swapping permutations to all states . These subsets partition , each containing the states that produce the same hypothesis. We need to determine how many G-orbits there are in .
Lemma 6 (Burnside’s Lemma; Rotman (1995), pg. 58)
If X is a finite -set and is the number of -orbits of , then
where, for , is the number of fixed by .
We have established that when is finite, the set of network states is a finite set, and it is a G-set acted on by permutations of network parameters resulting from changing the order of summation of neuron output in network layers, where . is the number of G-orbits in created by actions of permutations from , and thus it’s the number of unique function mappings that a neural network can produce. The last thing we need to evaluate in order to get is .
In our context specifies how many unique states a permutation of elements can create when all possible choices of for the elements are considered. The answer is given by the following lemma found in Rotman (1995) (we changed the notation and analogy from colours to parameter values)
Lemma 7 (Rotman (1995), pg. 60)
Let be a set with , and let be a subset of all possible permutation of elements. If , then , where is the number of cycles occurring in the complete factorisation of .
Every permutation can be expressed as a factor of disjoint cycles. For example, a permutation written as denotes the following reordering of seven elements in cycles:
element swaps with element ;
element goes into place of element , which in turns goes into place of element , which goes into place of element ;
element is fixed, its position remains unchanged,
element is fixed.
Since by Lemma 7 , where is the number of cycles, the sum in 6 will be dominated by the permutation with the largest number of cycles. For a permutation of elements, the largest possible number of cycles is , and it’s given by the identity permutation, . Hence, as increases, we have
Given that the set has G-orbits with respect to all combinations of neuron-swapping permutations in all individual neural networks, we have an upper bound on the number of functions a neural network of a particular architecture can generate. Thus .
The tightness of the bound depends on the choice of activation function and the set of parameter values . During numerical evaluation, as shown in Section 3.2, extra symmetries might arise inside the neural network, which can result in different G-orbits in producing the same function mapping.