Informational Neurobayesian Approach to Neural Networks Training. Opportunities and Prospects
Abstract
A study of the classification problem in context of information theory is presented in the paper. Current research in that field is focused on optimisation and bayesian approach. Although that gives satisfying results, they require a vast amount of data and computations to train on. Authors propose a new concept, named Informational Neurobayesian Approach (INA), which allows to solve the same problems, but requires significantly less training data as well as computational power. Experiments were conducted to compare its performance with the traditional one’s and the results showed that capacity of the INA is quite promising.
capbtabboxtable[][\FBwidth]
Keywords: Informational Neurobayesian Approach, text recognition, information quantity, classification, Sequence to Sequence model, Neurobayesian Approach, Gradient Descent, oneshot learning, information emergence, large neural networks, information neural models.
1 Introduction
The classification problem nowadays is usually solved by complex analytical algorithms requiring plenty of data. Another field of research is neurobayesian approach, which proposes optimisation based on Bayes rule with some tricks to cope with functions complexity and make them welldefined. The authors introduce a method based on the same concepts but they also take into consideration the informational side of the data  namely its meaning.
The problem of determining the quantity of information until recently was based on Hartley and Shennon approaches. The first one represents just the quantity of information, the second is its generalisation which takes into account different probabilities for classes. Another fundamental concept is Harkevich formula which measures the informational value of event or message as the logarithm of ratio of probability of a class after and before the event. The System Theory of Information (STI) proposed by Lucenko ^{1}^{1}1 Section 4.2 author,[3] [Lutsenko E., 2002] merges these concepts, taking each formula with corresponding power, which illustrates the emergence of classes and features. The main idea of the article is to develop a method, that allows to calculate weight coefficients of artificial neural networkâs synapses. We propose to move from uninterpreted weights (trial and error method) to the weight representing the amount of information that is contained in the feature regarding a possible consequence. Thus, a neural network is considered as a selftraining system of transforming the input data into the amount of information about the possible consequences.The proposed approach allows to train big neuromodels with billions of neurons on ordinary CPUs.
2 Classical Gradient Descent
2.1 Description
Gradient Descent is a widely used approach to solve optimisation problems which are too complex to find solution analytically. As the approach proposed in the paper is applied in the context of neural networks, we will briefly show the usage of gradient descent for their training.
The problem is formulated in the following way: model’s performance is measured by the error function, which depends on neurons and observations. So, weights are to be optimised to achieve the minimum of error function. The gradient descent principle is to push the weights towards the direction opposite to the error function’s gradient (all weights are changed simultaneously).
2.2 Formulae
So, on each iteration the weights are updated according the following rule:
where is an exogenous parameter, which determines the learning rate. The most popular type of error functions is sum of squares, so the aim is to minimise the expression:
where is a linear function, giving the values on the output of the neuron and depending on the weights of the input vector (instead of linear various functions can be used, e.g. tanh, sigmoid, softmax, the general principle remains the same). The derivative by each weight will be equal:
So, the formula for weights updating is:
Iterations are performed until the weight’s changes are more than some exogenous determined parameter.
2.3 Advantages and disadvantages
The approach performs well for a very narrow class of problems, though it has advanced modifications like Adagrad, Adam, and Gradient Boosting which show outstanding results although require considerable computational power.
2.4 Computational Complexity
Algorithm’s complexity for the described case is as on each iteration the sum of terms is computed times. The complexity is linear relative to input data size.
3 Neurobayesian Approach
3.1 Description
The bayesian approach means designing the forecast of distribution of variables and then updating it based on new knowledge about an object. Bayesian framework in comparison with traditional (frequentist) one considers all data random, makes great use of Bayes theorem and instead of MLestimates maximise posterior probability. Though it works with any amount of data, even small,the more information about the object is obtained, the more precise the final knowledge will be (the Bayesian updating can be applied iteratively as new data becomes available).
The problem is formulated as follows: let be a joint distribution of hidden variables , observed variables and decision rule parameters . If training data is , then posterior distribution of W is
Analytical approach to that expression is to make use of deltamethod and maximise likelihood function. But the bayesian approach proposes a less straightforward, yet fruitful step  to regularise ML estimate using prior distribution. It helps to cope with overfitting and gives a more thorough picture of hidden variables and also allows to learn from incomplete data. But the main advantage is the possibility of the algorithm to obtain high accuracy results with big data. Application of bayesian approach to neural networks is based on crossentropy error function minimisation with updating of the weights using existing data.
3.2 Formulae
In practice the optimisation procedure goes in the following way[2]^{2}^{2}2https://www.sdsj.ru/slides/Vetrov.pdf[Vetrov D., 2017]. If only is known, is a hidden variables vector, are their density functions respectively, and is to be determined, then:
The problem is that the first term is not concave. To cope with that a trick is introduced:
where is KullbackLeibler divergence, which can be interpreted as distance between distribution and is always nonnegative. So we can iteratively maximize instead of original expression. Procedure is called EMAlgorithm and consists of two steps:

Estep  corresponds to KLdivergence minimization.

Mstep
which is a concave function, so the maximisation is legal.
The procedure is repeated until convergence.
3.3 Advantages and disadvantages
The main advantage of the approach is its scalability to large datasets, and less complexity in comparison with analytical optimisation solutions, which are quadratic over data size , while EM algorithm is linear on data and weight vector’s dimensions. The approach allows to combine models, trained on different data (for same consequences), and thus achieve better results. The drawback is that the approach does not take into consideration the data’s structure, its intrinsic features, which can be exploited to build more realistic a priori distributions and achieve better performance. Since 2012, a number of studies were published, in which a new mathematical apparatus that allows to scale Bayesian methods to big data is proposed. At the heart of it lies an interesting idea. First, the problem of Bayesian inference (that is, the process of application of the Bayes theorem to data) was formulated as an optimization problem, and then modern techniques of stochastic optimization were applied to it, which made it possible to solve extremely large optimization problems approximately. This allowed Bayesian methods to enter the field of neural networks.Over the past 5 years, a whole class of Neurobayes models, that can solve a wider range of problems than conventional deepseated neural networks, has been developed.
4 Informational Neurobayesian Approach
4.1 Theoretical background
Some facts from the information theory are required for understanding this article. At first, the number of characters that can be obtained using the alphabet consisting of n symbols is where is the number of characters in the message. To avoid exponential dependency, it was proposed by Hartly to represent the quantity of information as a logarithm of number of all possible sequences:
Let there be characters, each has features and is a probability to get each feature. Shannon proposed a proportion for determining the average quantity of information in a message with random probabilities of character’s values:
The more uncertain is the consequence, the more information can be obtained from receiving the information about the consequence.To measure the uncertainty we use entropy, which is the average quantity of information per character.
The key task in this context is to determine the amount of valuable information in a message (cause) for classification of possible consequence. For its solution it is required to calculate the amount of information, contained in feature and in the fact, that the object with this feature belongs to class .
The average amount of information, contained in all features on all classes:
(1) 
is precisely the average of “individual amounts of information” in each feature about each class.
If there are characters in the message, then the information on belonging to class can be referenced as information’s density and is expressed via Pointwise Mutual Information
The expression above represents the sum of quantities of relevant information by Harkevich and Shennon, and is interpretation of Bayes’ theorem for information theory.
4.2 Foundations of System Theory of Information (STI)
In this paper the system generalisation of Hartley’s formula is used as where is a number of pure conditions of the system, – Hartley’s emergence coefficient (the level of system’s complexity, which consists of pure conditions). Lucenko [Lucenko, 2002]] took as an axiom the statement that system generalisation of Hartley’s formula is
(2) 
For every number of system’s elements there is a maximal level of system’s synergy. According to STI, the amount of information should be evaluated as
(3) 
in other words, it consists of classical and synergic parts. The system’s information that we obtain from an object via STI approach is actually information on all possible configurations of the system. Prof. Lucenko [Lucenko, 2002] discovered a tendency for share of synergic information to raise as the number of elements increases and he propose to name it as the Law of emergence increase, illustrated at Figure 1.
The classical Harkevich formula is
(4) 
where is the probability of attaining target after getting message about feature and is the probability of attaining target (class) without any information. This concept does not take into account the cardinality of space of future configurations of the object, while they can be taken into consideration by involving the classical and synergic Hartley’s formulas, nevertheless it does not allow to obtain the quantity of information in bits. To solve that problem, we take for each feature and class approximate the probabilities and through frequencies:
(5) 
(6) 
where is the total number of events like “condition was obtained after feature acting”,  total number of features, with which the condition was obtained,  total number of of various features for all final conditions. After placing into values and we get the expression of value of information measured as its quantity:
(7) 
As it is known, the classical Shennon formula for quantity of information for events with varied probabilities transforms into Hartley’s formula with the condition that all events have the same probability, i.e. satisfy the basic property of conformance. So the same can be applied to the Harkevich formula, which means that in marginal case it should become Hartley’s formula (marginal case means that there is unique feature for every class (object’s condition) and vice versa, and all thesade classes have the same probability. In that case the quantity of information in each feature about belonging to classes is maximal and is equal to information calculated through Hartley’s formula. So, in case of onetoone correspondence of features and classes:
General formula of Lutsenko formula and his emergence coefficient are:
(8) 
(9) 
Taking it into account, authors suggest to add an emergence coefficient into the modified Harkevich formula. The resulting expression is named Lutsenko emergence information quantity formula:
(10) 
where is Harkevich’s emergence coefficient, which defines the degree of determination of the object with the system’s organisation level , having pure conditions and features, on which the final condition is dependent, and Z is the maximal complexity calculated for each factor separately. observations on system’s behaviour were made. That generalisation can be performed in the following way:
(11) 
(12) 
(13) 
The same calculations for the continuous case:
(14) 
And finally:
(15) 
Generalised Kharkevich [1] formula also satisfies correspondence principle, i.e, transforms into Hartley’s formula: In the marginal case, when every class (condition of the object) corresponds only one feature, and for every feature  one class, these classes and features have the same probabilities.
4.3 System Emergence Coefficient
General formula of Lutsenko for Harkevich emergence coefficient:
(16) 
where .
varies from 0 to 1 and determines the degree of determination of the system:

corresponds to absolutely determined system, behaviour of which depends on the minimal number of features.

corresponds to totally random system.

corresponds to system, where there are more features than classes and none of them plays the key role in determining the class.
Professor Lutsenko[4] proposed coefficients representing the degree of determination of the possible state of the object with a set of features at the system organization level . Nevertheless, as it seems hard to evaluate, it was decided to omit that concept and admit the coefficient to be equal 1 (minimum level of complexity).
(17) 
Straightforward way to calculate it is to consider all possible of combinations and classes, and for each one to evaluate the amount of information. Evidently, it will require enormous calculations.
(18) 
The LucenkoArtemov formula for emergence coefficient of system:
(19) 
The value obtained is the emergence coefficient of the system for the real level of complexity. The optimal solution is to consider only relevant combinations of featureclasses, dividing them into groups by complexity, up to which the combinations will be evaluated. There are two approaches for calculation the complexity Z. The first one for each feature takes the number of classes the feature corresponds to (gives positive amount of information). The number of classes represents the complexity, up to which the combinations of classes will be calculated. The second method is to divide all features and classes into groups with the same number of features as well as classes. The idea is to take all classes to whom the feature corresponds and select other features which correspond to the same classes. The key point is that it helps to build onetoone correspondence between features and classes within that group. Finally we obtain limited number of features or groups of features , where is the number of features), which are acting collectively times in the same number of classes . That means for given features maximal complexity is . So , and we get the final expression for the coefficient of emergence for each group:
(20) 
The coefficient is normalised by the following rule:
(21) 
Formula 21 is the ArtemovLucenko emergence coefficient, for the system of features, aggregated by the maximal level of of system’s complexity.
4.4 Application to neural networks
The essence of the proposed approach shows its best application to neural networks. The main idea is that the quantity of information, calculated for all featureclass pairs can be expressed in network’s weights, so they gain a meaning instead of being simply empirical parametrs. It explains the given name ”Informational Neurobayesian” for the approach. Every weight represents quantity of information  the object with the feature activates given neuron :
(22) 
where  coefficient of emergence of the system for feature .
In the scheme above the is th object’s feature. – a vector of object features:
(23) 
For every object with a given set of features the neural network chooses over all classes the one with maximal value of the activation function, represented as an activated sum by all features. The process is illustrated in Figure 3.
(24) 
Bias is a parameter for activation of class. It allows to solve two problems: 1. To set a minimum threshold of neuron activation. 2. Allows integration with other models.
(25) 
(26) 
The Y is a chosen class for the given features, is activation function. The whole workflow is presented in Figure 2.
4.4.1 Generalised EMAlgorithm for INA
EMalgorithm consists of two steps iteration repetions. On the Estep expected hidden variables values are calculated using current estimate of parameters. On the Mstep likeli hood is maximised and new parameters vectors estimate is being obtained based on their current values and hidden variables vector.
Step 0: Calculate a frequency matrix for input data.
Unique features and classes are determined in training data and their frequency table which reflects the distribution of parameters  number of features for each class . Also parameter (number of classes, in which a features group is active) is determined during this step.
Step E: Creating informational neurobayesian model.
During the Estep expected value of hidden variables by current estimation of parameter vector is evaluated. The data is represented in Figure 4. and the output is organised in the same way, as presented in Figure 5.
(27) 
(28) 
(29) 
Step M: Model maximisation by Van Rijsbergen’s effectiveness Â measure (Fmeasure).

Calculate
Calculate average micro measure for each observed class based on training set, comparing the result with  test dataset for object’s features vector with the same set of features, stands for test data,  data the model has been trained on (Â«model has learntÂ»).
(30) (31) (32) 
For each where , i.e. or
2.1 Determine and  correct class for the set of features , .
2.2 Exclude all objects with the same set of features, but classes
2.3 For remaining objects change quantity of information for each feature, acting in given class , so that
2.4 Calculate
(33) The updated model is checked for improvement on wrongly classified objects.
2.5 Continue until
Model  MNIST  House Prices  Emotions in text  Russian words  Russian words 
Problem  Handwritten  Real Estate  Determining emotions  Word recognition by letters and bigramms.  
digits recognition  Valuation  (by Ecman), in text.  
Features  2 815  809  19 121  1042  1042 
Classes  11  658  8  5 099 300  5 099 300 
Parameters (weights)  30 965  532 322  152 968  6 313 470 600  6 313 470 600 
Model Size (Mb)    1,89  634,94  634,94  
Training data size (Mb)  74,9  0,53  2,85  505,43  505,43 
Training data, rows  42 000  1460  8 682  5 099 300  5 099 301 
Learning Type  E (1 epoch)  E (1 epoch)  E (1 epoch)  E (1 epoch)  EM 
Brain2 v.^{3}^{3}3Framework’s version, developed by project’s team, including paper’s author  1.0  2.0  2.0  3.0  3.0 
Result  CA ^{4}^{4}4This competition is evaluated on the categorisation accuracy of your predictions (the percentage of images you get correct).=0,81  RMSE=0,42  F1 = 0,81  F1=0,97  F1=0.98^{5}^{5}5Perplexity of 1.26 per letter and bigrams (feature) 
Time ^{6}^{6}6 Tests were run on the machine with specifications: IntelÂ® XeonÂ®, E51650 v3 HexaCore Haswell â 6Â cores, 128 GB ECC RAM, 2 x 240 GB 6 Gb/s SATA SSD , 2X2Tb SATA3. GPUs were not used for training.  20 min  2 min 33 sec  6 min 4 sec  2 h, 35 min  2 h, 49 min 




Optimisation framework  Backpropagation 



Weights Interpretation 




Complexity 


Advantages  Wide range for applications  Decreased computation complexity 

5 Experiments
For conducting the tests we used Brain2  a software platform implementing Informational Neurobayesian Approach. The biggest built neuromodel is aimed at word recognition by letter and bigrams in Russian language for 100 000 words â it consists of 1042 features and 5 099 300 classes, which makes the total number of parameters 6.3 bln (weight coefficients). An important factor here is that it took only 2 h 35 min to build the model on an average server without GPU (205 GFlops, 8 cores CPU). In comparison, the Google Brain biggest neural network consists of 137 billion parameters, but if we consider the needed resources â only for 4.3 bln parameters it took them 47 hours with 1632 Tesla K40 GPUs (1220 GFlops) [8][Shazeer N, 2017]. On other basic machine learning tasks such as MNIST and House Prices our models have shown an acceptable result â 0.81 on MNIST (Kaggle task) with the best 1.0; and 0.42 on House Prices with the best one at 0.06628 (For one epoch). It is important to note that we did not set the task to achieve the best possible result and have trained the model as is. Instead, we have concentrated on a more difficult problem, connected with natural text â reconstruction of words with errors and finding answers on questions, where we have reached accuracy close to 1. Experiment results are presented in Table 1.
6 Conclusion
In the paper the INA method aimed to cope with complex classification problems more intelligently than currently used analytical and numerical methods was described. Its ability to be applied to neural networks and provided experiments showed opportunities for the suggested method to simplify training process in various classification problems and achieve at least the same performance with less computational costs. The results of the experiments are presented in Table 1. And the summary of comparison with other method is presented in Table 2.
The additive properties of the amount of information allow not only to explain the correctness of addition of weights while choosing a class within a single model (layer), but also to link different models (layers), transferring values ââof the amount of information from one input to another. In addition, every information neuromodel can be trained independently. Eventually it makes possible to facilitate the construction of superneural networks  multilevel (and multilayer) neural networks, where each input of the neuron is a neural network of a lower order (from the position of information amount). The proposed information approach makes it possible to obtain an acceptable level of quality of the model (F1 0.6, for 1 epoch), even on relatively small data. And given the low complexity of the learning algorithm , the INA opens new horizons to the training of superneuromodels on conventional (not super) computers.
In our opinion, the INA combines in itself a general philosophical view presented by such cognitive scientists as D. Chalmers[6]^{7}^{7}7[Chalmers,1996] (as each layer represents a level of cognition, influencing a higher cognition level), V. Nalimov (in the works on the [9]probabilistic model of language [Nalimov, 1990]) and the pragmatic concept of meaning  the amount of information in the cause for a given consequence required for making a managerial decision ^{8}^{8}8[Kharkevich,2009] [5] (Ashby[7] ^{9}^{9}9[Ashby,1961], Wiener, U. Hubbard).
References
 [1] Harkevich A. On information value. Cybernetics problems.â 4, 2009.
 [2] Vetrov D. Bayesian Methods in Machine Learning. http://bit.ly/2eE0Tfj,http://bayesgroup.ru/, 2016.
 [3] Lutsenko E.V. Conceptual principles of the system (emergent) information theory and its application for the cognitive modelling of the active objects (entities). IEEE International Conference on Artificial Intelligence System, 2002.
 [4] Lutsenko E.V. System information theory and nonlocal interpratable feedforward neural net. Polytechnical network electronic scientific journal of the Kuban State Agrarian University, 2003.
 [5] Uladzimir Kharkevich Fausto Giunchiglia and Ilya Zaihrayeu. Concept Search. Oxford University Press.
 [6] Chalmers D. J. The conscious mind: In search of a fundamental theory. Oxford University Press, 1996.
 [7] Ashby W. R. An introduction to cybernetics. Chapman and Hall Ltd, 1961.
 [8] Mirhoseini A. Maziarz K. Davis A. Le Q. Hinton G. Shazeer, N. and J. Dean. Outrageously large neural networks: The sparselygated mixtureofexperts layer. arXiv preprint, 2017.
 [9] Nalimov V. V. Spontaneity of Consciousness: Probabalistic Theory of Meanings and Semantic Architectonics of Personality. 1990.