# Neural Networks as Model Selection

with Incremental MDL Normalization

###### Abstract

If we consider the neural network optimization process as a model selection problem, the implicit space can be constrained by the normalizing factor, the minimum description length of the optimal universal code. Inspired by the adaptation phenomenon of biological neuronal firing, we propose a class of reparameterization of the activation in the neural network that take into account the statistical regularity in the implicit space under the Minimum Description Length (MDL) principle. We introduce an incremental version of computing this universal code as normalized maximum likelihood and demonstrated its flexibility to include data prior such as top-down attention and other oracle information and its compatibility to be incorporated into batch normalization and layer normalization. The empirical results showed that the proposed method outperforms existing normalization methods in tackling the limited and imbalanced data from a non-stationary distribution benchmarked on computer vision and reinforcement learning tasks. As an unsupervised attention mechanism given input data, this biologically plausible normalization has the potential to deal with other complicated real-world scenarios as well as reinforcement learning setting where the rewards are sparse and non-uniform. Further research is proposed to discover these scenarios and explore the behaviors among different variants.

###### Keywords:

Neuronal Adaption Minimum Description Length Model Selection Universal Code Normalization Method in Neural Networks## 1 Introduction

The Minimum Description Length (MDL) principle asserts that the best model given some data is the one that minimizing the combined cost of describing the model and describing the misfit between the model and data [17] with a goal to maximize regularity extraction for optimal data compression, prediction and communication [6]. Most unsupervised learning algorithms can be understood using the MDL principle [18], treating the neural network as a system communicating the input to a receiver.

If we consider the neural network training as the optimization process of a communication system, each input at each layers of the system can be described as a point in a low-dimensional continuous constraint space [21]. If we consider the neural networks as population codes, the constraint space can be subdivided into the input-vector space, the hidden-vector space, and the implicit space, which represents the underlying dimensions of variability in the other two spaces, i.e., a reduced representation of the constraint space. For instance, if we are given a image of an object, the rotated or scaled version of the same image still refers to the same objects, then each image instance of the same object can be represented by a code assigned position on a 2D implicit space with one dimension as orientation and the other as size of the shape [21]. The relevant information about the implicit space can be constrained to ensure a minimized description length of the neural networks.

This type of constraint can also be found in biological systems. In biological brains of primates, high-level brain areas are known to send top-down feedback connections to lower-level areas to encourage the selection of the most relevant information in the current input given the current task [4], a process similar to the communication system above. This type of modulation is performed by collecting statistical regularity in a hierarchical encoding process between these brain areas. One feature of the neural coding during the hierarchical processing is the adaptation: in vision neuroscience, vertical orientation reduce their firing rates to that orientaiton after the adaptation [2], while the cell responses to other orientations may increase [5]. These behaviors contradict to the Bayesian assumption that the more probable the input, the larger firing rate should be, but instead, well match the information theoretical point-of-view that the most relevant information (saliency), which depends on the statistical regularity, have higher “information”, just as the firing of the neurons. As [16] hypothesized that the firing rate represent the code length instead of the probability, similarly, the more regular the input features are, the lower it should yield the activation. We introduce the minimum description length (MDL), such that the activation of neurons can be analogous to the code length of the model (a specific neuron or neuronal population) - a shorter code length would be assigned to a more regular input (such as after adaptation), and a longer code length to a more rare input or event.

In this paper, we adopt the similar definition of implicit space as in [21], but extend it beyond unsupervised learning, into a generical neural network optimization problem in both supervised and unsupervised setting. In addition, we consider the formulation and computation of description length differently, given the neuroscience inspiration described above. Instead of considering neural networks as population codes, we formulate each layer of neural networks during training a state of module selection. In our setup, the description length is computed not in the scale of the entire neural networks, but by the unit of each layer of the network. In addition, the optimization objective is not to minimize the description length, but instead, to take into account the minimum description length as part of the normalization procedure to reparameterize the activation of each neurons in each layer. The computation of the description length (or model cost as in [21]) aims to minimize it, while we directly compute the minimum description length in each layer not to minimize anything, but to reassign the weights based on statistical regularities. Finally, we compute the description length by an optimal universal code obtained by the batch input distribution in an online incremental fashion.

We begin our presentation in section 2, with a short overview of related works in normalization methods and MDL in neural networks. Section 3 formulated the problem setting in neural networks where we consider the training as a layer-specific model selection process under MDL principle. We then introduce the proposed class of incremental MDL normalization method, its standard formulation (regularity normalization), its implementation, and the online incremental tricks for batch computation. We also present several variants of the regularity normalization (RN) by incorporating batch and layer normalizations, termed regularity batch normalization (RBN) and regularity layer normalization (RLN), as well as including the data prior as a top-down attention mechanism during the training process, termed saliency normalization (SN). In section 5, we present the empirical results on the imbalanced MNIST dataset and a reinforcement learning problem to demonstrate that our approach is advantageous over existing normalization methods in different imbalanced scenarios. In the last section, we conclude our methods and point out several future work directions as the next step of this research.

## 2 Related work

### 2.1 Normalization in neural networks

Batch normalization (BN) performs global normalization along the batch dimension such that for each neuron in a layer, the activation over all the mini-batch training cases follows standard normal distribution, reducing the internal covariate shift [8]. Similarly, layer normalization (LN) performs global normalization over all the neurons in a layer, and have shown effective stabilizing effect in the hidden state dynamics in recurrent networks [1]. Weight normalization (WN) applied normalization over the incoming weights, offering computational advantages for reinforcement learning and generative modeling [20]. Like BN and LN, we apply the normalization on the activation of the neurons, but as an element-wise reparameterization (over both the layer and batch dimension). In section 4.2, we also proposed the variant methods based on our approach with batch-wise and layer-wise reparameterization, the regularity batch normalization (RBN) and regularity layer normalization (RLN).

### 2.2 Description length in neural networks

[7] first introduced the description length to quantify neural network simplicity and develop an optimization method to minimize the amount of information required to communicate the weights of the neural network. [21] considered the neural networks as population codes and used MDL to develop highly redundant population code. They showed that by assuming the hidden units reside in low-dimensional implicit spaces, optimization process can be applied to minimize the model cost under MDL principle. Our proposed method adopt a similar definition of implicit space, but consider the implicit space as data-dependent encoding statistical regularities. Unlike [21] and [7], we consider the description length as a indicator of the data input and assume that the implicit space is constrained when we normalize the activation of each neurons given its statistical regularity. Unlike the implicit approach to compute model cost, we directly compute the minimum description length with optimal universal code obtained in an incremental style.

## 3 Problem Setting

### 3.1 Minimum Description Length

Given a model class consisting of a finite number of models parameterized by the parameter set . Given a data sample , each model in the model class describes a probability with the code length computed as . The minimum code length given any arbitrary would be given by with model which compresses data sample most efficiently and offers the maximum likelihood [6].

However, the compressibility of the model, computed as the minimum code length, can be unattainable for multiple non-i.i.d. data samples as individual inputs, as the probability distributions of most efficiently representing a certain data sample given a certain model class can vary from sample to sample. The solution relies on the existence of a universal code, defined for a model class , such that for any data sample , the shortest code for is always , as proposed and proven in [19].

### 3.2 Normalized Maximum Likelihood

To select for a proper optimal universal code, a cautious approach would be to assume a worst-case scenario in order to make “safe” inferences about the unknown world. Formally, the worst-case expected regret is given by , where the “worst” distribution is allowed to be any probability distribution. Without referencing the unknown truth, [19] formulated finding the optimal universal distribution as a mini-max problem of computing , the coding scheme that minimizes the worst-case expected regret. Among the optimal universal code, the normalized maximum likelihood (NML) probability minimizes the worst-case regret and avoids assigning an arbitrary distribution to . The minimax optimal solution is given by [15]:

(1) |

where the summation is over the entire data sample space. Figure 1 describes the optimization problem of finding optimal model given data sample among model class . The models in the class, , are parameterized by the parameter set . are data sample from data . With this distribution, the regret is the same for all data sample given by [6]:

(2) |

which defines the model class complexity as it indicates how many different data samples can be well explained by the model class .

### 3.3 Neural networks as model selection

In the neural network setting where optimization process are performed in batches (as incremental data sample with denoting the batch ), the model selection process is formulated as a partially observable problem (as in Figure 2). Herein to illustrate our approach, we consider a feedforward neural network as an example, without loss of generalizability to other architecture (such as convolutional layers or recurrent modules). refers to the activation at layer at time point (batch ). is the parameters that describes (i.e. weights for layer ) optimized after steps (seen batch through ). Because one cannot exhaust the search among all possible , we assume that the optimized parameter at time step (seen batch through ) is the optimal model for data sample . Therefore, we generalize the optimal universal code with NML formulation as:

(3) |

where refers to the model parameter already optimized for steps and have seen sequential data sample through . This distribution is updated every time a new data sample is given, and can therefore be computed incrementally, as in batch-based training.

## 4 Incremental MDL Normalization

### 4.1 Standard Formulation

We first introduce the standard formulation of the class of incremental MDL normalization: the regularity normalization (RN). Regularity normalization is outlined in Algorithm 1, where the input would be the activation of each neurons in certain layer and batch. Parameters and are updated after each batch, through the incrementation in the normalization and optimization in the training respectively. As the numerator of at this step of normalization, the term is computed to be stored as a log probability of observing sample in , the normal distribution with the mean and standard deviation of all past data sample history (), with a Gaussian prior for . The selection for the Gaussian prior is based on the assumption that each is randomly sampled from a Gaussian distribution, and the parameter sets from model class are Gaussian, while further research can explore other possible priors and inference methods for arbitrary priors .

As defined in equation 2, is the denominator of the taken log, so the “increment” function takes in the storing and the latest batch of to be added in the denominator, stored as . The incrementation step involves computing the log sum of two values, which can be easily numerically stabilized with the log-sum-exp trick^{1}^{1}1In continuous data streams or time series analysis, the incrementation step can be replaced by integrating over the seen territory of the probability distribution of the data.. The normalization factor is then computed as the shortest code length given the NML distribution, the universal code distribution in equation 1.

### 4.2 Variant: Saliency Normalization

NML distribution can be modified to also include a data prior function, , given by [22]:

(4) |

where the data prior function can be anything, ranging from the emphasis of certain inputs, to the cost of certain data, or even top-down attention. For instance, we can introduce the prior knowledge of the fraction of labels (say, in an imbalanced data problem where the oracle informs the model of the distribution of each label in the training phase); or in a scenario where we wish the model to focus specifically on certain feature of the input, say certain texture or color (just like a convolution filter); or in the case where the definition of the regularity drifts (such as the user preferences over years): in all these possible applications, the normalization procedure can be more strategic given these additional information. Therefore, we formulate this additional functionality into our regularity normalization, to be saliency normalization (SN), where the is computed with the addition of a pre-specified data prior function .

### 4.3 Variant: Beyond Elementwise Normalization

In our current setup, the normalization is computed elementwise, considering the implicit space of the model parameters to be one-dimensional (i.e. all activations across the batch and layer are considered to be represented by the same implicit space). Instead, the definition of the implicit can be more than one-dimensional to increase the expressibility of the method, and can also be user-defined. For instance, we can also perform the normalization over the dimension of the batch, such that each neuron in the layer should have an implicit space to compute the universal code. We term this variant regularity batch normalization (RBN). Similarly, we can perform regularity normalization over the layer dimension, as the regularity layer normalization (RLN). These two variants have the potential to inherit the innate advantages of batch normalization and layer normalization.

## 5 Empirical Evaluations

“Balanced” | “Rare minority” | |||
---|---|---|---|---|

baseline | ||||

BN | ||||

LN | ||||

WN | ||||

RN | ||||

RLN | ||||

LN+RN | ||||

SN |

Test errors of the imbalanced permutation-invariant MNIST 784-1000-1000-10 task

“Highly imbalanced” | “Dominant oligarchy” | |||||
---|---|---|---|---|---|---|

baseline | ||||||

BN | ||||||

LN | ||||||

WN | ||||||

RN | ||||||

RLN | ||||||

LN+RN | ||||||

SN |

Test errors of the imbalanced permutation-invariant MNIST 784-1000-1000-10 task

### 5.1 Imbalanced MNIST Problem with Feedforward Neural Network

As a proof of concept, we evaluated our approach on MNIST dataset [11] and computed the total number of classification errors as a performance metric. As we specifically wish to understand the behavior where the data inputs are non-stationary and highly imbalanced, we created an imbalanced MNIST benchmark to test seven methods: batch normalization (BN), layer normalization (LN), weight normalization (WN), and regularity normalization (RN), as well as three variants: saliency normalization (SN) with data prior as class distribution, regularity layer normalization (RLN) where the implicit space is defined to be layer-specific, and a combined approach where RN is applied after LN (LN+RN).

Given the nature of regularity normalization, it should better adapt to the regularity of the data distribution than other methods, tackling the imbalanced data issue by up-weighting the activation of the rare sample features and down-weighting those of the dominant sample features.

To simulate changes in the context (input) distribution, in each epoch we randomly choose classes out of the ten, and set their sampling probability to be (only % of those classes are used in the training). In this way, the training data may trick the models into preferring to classifying into the dominant classes. For simplicity, we consider the classical 784-1000-1000-10 feedforward neural network with ReLU activation functions for all six normalization methods, as well as the baseline neural network without normalization. As we are looking into the short-term sensitivity of the normalization method on the neural network training, one epoch of trainings are being recorded (all model face the same randomized imbalanced distribution). Training, validation and testing sets are shuffled into 55000, 5000, and 10000 cases. In the testing phase, the data distribution is restored to be balanced, and no models have access to the other testing cases or the data distribution. Stochastic gradient decent is used with learning rate and momentum set to be .

The imbalanced degree is defined as following: when , it means that no classes are downweighted, so we termed it the “fully balanced” scenario; when to , it means that a few cases are extremely rare, so we termed it the “rare minority” scenario. When to , it means that the multi-class distribution are very different, so we termed it the “highly imbalanced” scenario; when , it means that there is one or two dominant classes that is 100 times more prevalent than the other classes, so we termed it the “dominant oligarchy” scenario. In real life, rare minority and highly imbalanced scenarios are very common, such as predicting the clinical outcomes of a patient when the therapeutic prognosis data are mostly tested on one gender versus the others, or in reinforcement learning setting where certain or most types of rewards are very sparse.

Table 1 and 2 report the test errors (in %) with their standard errors of eight methods in 10 training conditions over two heavy-tailed scenarios: labels with under-represented and over-represented minorities. In the balanced scenario, the proposed regularity-based method doesn’t show clear advantages over existing methods, but still managed to perform the classification tasks without major deficits. In both the “rare minority” and “highly imbalanced” scenarios, regularity-based methods performs the best in all groups, suggesting that the proposed method successfully constrained the model to allocate learning resources to the “special cases” which are rare and out of normal range, while BN and WN failed to learn it completely (as in the confusion matrices not shown here). In the “dominant oligarchy” scenario, LN performs the best, dwarfing all other normalization methods. However, as in the case of , LN+RN performs considerably well, with performance within error bounds to that of LN, beating other normalization methods by over 30 %. It is noted that LN also managed to capture the features of the rare classes reasonably well in other imbalanced scenarios, comparing to BN, WN and baseline. The hybrid methods RLN and LN+RN both displays excellent performance in the imbalanced scenarios, suggesting that combining regularity-based normalization with other methods is advantageous.

These results are mainly in the short term domain as a proof of concept. Further analysis need to be included to fully understand these behaviors in the long term (the converging performance over 100 epochs). However, the major test accuracy differences in the highly imbalanced scenario (RN over BN/WN/baseline for around 20%) in the short term provides promises in its ability to learn from the extreme regularities.

### 5.2 Reinforcement Learning Problem with Deep Q Network

We further evaluated the benefit of the proposed approach in the game setting of the reinforcement learning problem, where the rewards can be sparse. For simplicity, we consider the classical deep Q network [14] and tested it in OpenAI Gym’s LunarLander-v2 environment [3]. In this game, the agent learns to land on the exact coordinates of the landing pad (0,0) during a free fall motion starting from zero speed to the land with around 100 to 140 actions, with rewards fully dependent on the location of the lander (as the state vector) on the screen in a non-stationary fashion: moving away from landing pad loses reward; crashes yields -100; resting on the ground yields +100; each leg ground contact yields +10; firing main engine costs -0.3 points each frame; fuel is infinite. Four discrete actions are available: do nothing, fire left orientation engine, fire main engine, fire right orientation engine. Five agents (DQN, DQN+LN, DQN+RN, DQN+RLN, DQN+RN+LN) are being considered and evaluated in the speed to master the game, computed as the final scores over 1000 episodes of training.

The Q networks consist of with two hidden layers of 64 neurons. With experience replay [13], the learning of the DQN agents was implemented as Actor-Critic algorithm [10] with the discount factor , the soft update rate , the learning rate , epsilon greedy exploration from 1.0 to 0.01 with decay rate of 0.95, the buffer size 10,000, the batch size 50 and optimization algorithm Adam [9]. To adopt the proposed incremental MDL normalization method, we installed the normalization to both the local and target Q networks.

Figure 3 demonstrated the learning curves of the five competing agents over 1000 episodes of learning with their standard errors. Evaluating by the averaged final scores over 1000 episodes, DQN+RN (76.95 4.44) performs the best among all five agents, followed by DQN+RN+LN (65.82 10.91) and DQN+RLN (49.27 40.35). All three proposed agents beat DQN (37.17 8.82) and DQN-LN (-1.54 39.14) by a large marginal. These numerical results suggested the proposed method has the potential to benefit the neural network training in reinforcement learning setting. On the other hand, certain aspects of these behaviors are worth further exploring. For example, the proposed methods with highest final scores do not converge as fast as DQN+LN, suggesting that regularity normalization resembles some type of adaptive learning rate which gradually tune down the learning as scenario converges to stationarity.^{2}^{2}2The raw data and code to reproduce the results can be downloaded at https://app.box.com/s/ruycgz8p7rh30taj38d8dkc0h1ptltg1

## 6 Discussion

Empirical results offered a proof of concept to the proposed method. In the tasks of the image classification and the reinforcement learning problem, our approach empirically outperforms existing normalization methods its advantage in the imbalanced, limited, or non-stationary data scenario as hypothesized. However, several analyses and developments are worth pursuing to further understanding of the behaviors.

First, the metric use in the MNIST problem is the test error (as usually used in the normal case comparison). Although the proposed method is shown to have successfully constrained the model to allocate learning resources to the several imbalanced special cases, other performance metric should be evaluated specially tailored for these special cases.

Second, the probability inference can be replaced with a fully Bayesian variational approach to include the regularity estimation as part of the optimization process. Moreover, although the results shows the proposed MDL normalization has an improvement on MNIST, it would be interesting to record the overall loss or probability as the computation of NML makes selection on the model, as a partially observable routing process of representation selection [12].

Last but not least, in traditional model selection problems, MDL can be regarded as ensemble modeling process and usually involves multiple models. However, in our neural network problem, we assume that the only model trained at each step is the local “best” model learned so far, but local maximal likelihood may not be a global best approach for model combinations. In another word, the generation of optimized parameter set for a specific layer currently adopts greedy approach, such that the model selection could be optimized for each step, but we haven’t theoretically demonstrated that it is the best global selection.

## 7 Conclusion and Future Work

Inspired by the neural code adaptation of biological brains, we propose a biologically plausible normalization method taking into account the regularity (or saliency) of the activation distribution in the implicit space, and normalize it to upweight activation for rarely seen scenario and downweight activation for commonly seen ones. We introduce the concept from MDL principle and proposed to consider neural network training process as a model selection problem.

We compute the optimal universal code length by normalized maximum likelihood in an incremental fashion, and showed this implementation can be easily incorporated with established methods like batch normalization and layer normalization. In addition, we proposed saliency normalization, which can introduce top-down attention and data prior to facilitate representation learning. Fundamentally, we implemented with an incremental update of normalized maximum likelihood, constraining the implicit space to have a low model complexity and short universal code length.

One main next direction of this research include the inclusion of top-down attention given by data prior (such as feature extracted from signal processing, or task-dependent information). For instance, the application of top-down attention to modulate the normalization process can vary in different scenarios. Further investigation of how different functions of behave in different task settings may complete the story of having this method as a top-down meta learning algorithm potentially advantageous for continual multitask learning.

In concept, the regularity-based normalization can also be considered as an unsupervised attention mechanism imposed on the input data, with the flexibility to directly install top-down attention from either oracle supervision or other meta information. As the next stage, we are currently exploring this method to convolutional and recurrent neural networks, and applying to popular state-of-the-art neural network architectures in multiple modalities of datasets, as well as more complicated reinforcement learning setting where the rewards can be very sparse and non-uniform.

## References

- [1] Ba, J.L., Kiros, J.R., Hinton, G.E.: Layer normalization. arXiv preprint arXiv:1607.06450 (2016)
- [2] Blakemore, C., Campbell, F.W.: Adaptation to spatial stimuli. The Journal of physiology 200(1), 11P–13P (1969)
- [3] Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv preprint arXiv:1606.01540 (2016)
- [4] Ding, S., Cueva, C.J., Tsodyks, M., Qian, N.: Visual perception as retrospective bayesian decoding from high-to low-level features. Proceedings of the National Academy of Sciences 114(43), E9115–E9124 (2017)
- [5] Dragoi, V., Sharma, J., Sur, M.: Adaptation-induced plasticity of orientation tuning in adult visual cortex. Neuron 28(1), 287–298 (2000)
- [6] Grünwald, P.D.: The minimum description length principle. MIT press (2007)
- [7] Hinton, G., Van Camp, D.: Keeping neural networks simple by minimizing the description length of the weights. In: in Proc. of the 6th Ann. ACM Conf. on Computational Learning Theory. Citeseer (1993)
- [8] Ioffe, S., Szegedy, C.: Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167 (2015)
- [9] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
- [10] Konda, V.R., Tsitsiklis, J.N.: Actor-critic algorithms. In: Advances in neural information processing systems. pp. 1008–1014 (2000)
- [11] LeCun, Y.: The mnist database of handwritten digits. http://yann. lecun. com/exdb/mnist/ (1998)
- [12] Lin, B., Bouneffouf, D., Cecchi, G.A., Rish, I.: Contextual bandit with adaptive feature extraction. In: 2018 IEEE International Conference on Data Mining Workshops (ICDMW). pp. 937–944. IEEE (2018)
- [13] Lin, L.J.: Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4), 293–321 (1992)
- [14] Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., Riedmiller, M.: Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602 (2013)
- [15] Myung, J.I., Navarro, D.J., Pitt, M.A.: Model selection by normalized maximum likelihood. Journal of Mathematical Psychology 50(2), 167–179 (2006)
- [16] Qian, N., Zhang, J.: Neuronal firing rate as code length: a hypothesis. Computational Brain & Behavior pp. 1–20 (2019)
- [17] Rissanen, J.: Modeling by shortest data description. Automatica 14(5), 465–471 (1978)
- [18] Rissanen, J.: Stochastic complexity in statistical inquiry. World Scientific (1989)
- [19] Rissanen, J.: Strong optimality of the normalized ml models as universal codes and information in data. IEEE Transactions on Information Theory 47(5), 1712–1717 (2001)
- [20] Salimans, T., Kingma, D.P.: Weight normalization: A simple reparameterization to accelerate training of deep neural networks. In: Advances in Neural Information Processing Systems. pp. 901–909 (2016)
- [21] Zemel, R.S., Hinton, G.E.: Learning population coes by minimizing description length. In: Unsupervised learning. pp. 261–276. Bradford Company (1999)
- [22] Zhang, J.: Model selection with informative normalized maximum likelihood: Data prior and model prior. In: Descriptive and normative approaches to human behavior, pp. 303–319. World Scientific (2012)