Semantic Interpretation of Deep Neural Networks Based on Continuous Logic
Combining deep neural networks with the concepts of continuous logic is desirable to reduce uninterpretability of neural models. Nilpotent logical systems offer an appropriate mathematical framework to obtain continuous logic based neural networks (CL neural networks). We suggest using a differentiable approximation of the cutting function in the nodes of the input layer as well as in the logical operators in the hidden layers. The first experimental results point towards a promising new approach of machine learning.
In recent times deep learning is applied to a variety of machine learning problems such as image and speech recognition, natural language processing, machine translation and so forth. Artificial neural networks (ANNs) are one type of model for machine learning, which has become competitive to conventional regression and statistical models. They are effective, efficient and successful in providing a high level of capability in handling complex problems in extensive applications e.g. in medical science, engineering, finance, management or security. One of the greatest research challenge is the increasing need to address the problem of interpretability and to find a general mathematical framework to improve model transparency and performance and to provide a deeper understanding of the model. Combining deep neural networks with structured logic rules contributes to the achievement of flexibility and to the reduction of uninterpretability of the neural models.
Although boolean units and multilayer perceptrons have a long history, to the best of our knowledge there has been little attempt to combine neural networks with continuous logical systems so far. The basic idea of continuous logic is the replacement of the space of truth values by a compact interval such as . Quantifiers and are replaced by and , and logical connectives are continuous functions. Among other families of many-valued logics, T-norm fuzzy logics are broadly used in applied fuzzy logic and fuzzy set theory as a theoretical basis for approximate reasoning. In fuzzy logic, the membership function of a fuzzy set represents the degree of truth as a generalization of the indicator function in classical sets. Both propositional and first-order (or higher-order) t-norm fuzzy logics, as well as their expansions by modal and other operators, have been studied thoroughly. Important examples of t-norm fuzzy logics are monoidal t-norm logic MTL of all left-continuous t-norms, basic logic BL of all continuous t-norms, product fuzzy logic of the product t-norm, or the nilpotent minimum logic of the nilpotent minimum t-norm. Some independently motivated logics belong among t-norm fuzzy logics, too, for example Łukasiewicz logic (which is the logic of the Łukasiewicz t-norm) or Gödel-Dummett logic (which is the logic of the minimum t-norm).
In this work, we suggest combining nilpotent logical systems and neural architecture. Among other preferable properties, the fulfillment of the law of contradiction and the excluded middle, and the coincidence of the residual and the S-implication Dubois ; Trillasimpl make the application of nilpotent operators in logical systems promising. In their pioneer work bounded , Dombi and Csiszár examined connective systems instead of operators themselves. In the last few years, the most important operators of general nilpotent systems have been thoroughly examined. In boundedimpl and in boundedeq , Dombi and Csiszár examined the implications and equivalence operators in bounded systems. In aggr , a parametric form of the generated operator was given by using a shifting transformation of the generator function. Here, the parameter has an important semantic meaning as a threshold of expectancy (decision level). This means that nilpotent conjunctive, disjunctive, aggregative (where a high input can compensate for a lower one) and negation operators can be obtained by changing this parameter. Negation operators were also studied thoroughly in bounded , as they play a significant role in logical systems by building connections between the main operators (De Morgan law) and characterising their basic properties. Dombi and Csiszár introduced possibility and neccesity operators in ijcci by repeating the arguments of manyvariable operators and in iwobi using double negations. Moreover, as it was shown in ijcci , membership functions, which play a substantial role in the overall performance of fuzzy representation, can also be defined by means of a generator function.
In this work we propose a general framework capable of enhancing various types of neural networks with nilpotent logic. The article is organized as follows. After reviewing some related work in Section II and the most relevant results on nilpotent logical systems in Section III, we introduce continuous logic based (CL) neural models in Section IV, and finally show some experimental results in Section V to demonstrate the promising performance of the squashing function as activatition function.
In Section VI, the main results are summarized.
Ii Related Work
Combination of neural networks and logic rules has been considered in different contexts. Neuro-fuzzy systems neurofuzzy were examined thoroughly in the literature. These hybrid intelligent systems synergize the human-like reasoning style of fuzzy systems with the learning structure of neural networks through the use of fuzzy sets and a linguistic model consisting of a set of IF-THEN fuzzy rules. These models were the first attempts to combine continuous logical elements and neural computation.
Kulkarni et al. Kulkarni used a specialized training procedure to obtain an interpretable neural layer of an image network.
In harness , Hu et. al. proposed a general framework capable of enhancing various types of neural networks (e.g., CNNs and RNNs) with declarative first-order logic rules. Specifcally, they developed an iterative distillation method that transfers the structured information of logic rules into the weights of neural networks. With a few highly intuitive rules, they obtained substantial improvements and achieve state-of-the-art or comparable results to previous best-performing systems.
In xu , Xu et al. develop a novel methodology for using symbolic knowledge in deep learning by deriving a semantic loss function that bridges between neural output vectors and logical constraints. This loss function captures how close the neural network is to satisfying the constraints on its output.
In dl2 , Fischer et al. present DL2, a system for training and querying neural networks with logical constraints. Using DL2, one can declaratively specify domain knowledge constraints to be enforced during training, as well as pose queries on the model to find inputs that satisfy a set of constraints. DL2 works by translating logical constraints into a loss function with desirable mathematical properties. The loss is then minimized with standard gradient-based methods.
All these promising approaches point towards the desirable mathematical framework that nilpotent logical systems can offer. Our aspiration in this paper is to provide a general mathematical framework in order to benefit from a tight integration of deep learning and continuous logical methods.
Iii Nilpotent Logical Systems
Next, we recall the basic concept of nilpotent operator systems.
The triple where is a t-norm, is a t-conorm and is a strong negation, is called a connective system.
A connective system is nilpotent if the conjunction is a nilpotent t-norm, and the disjunction is a nilpotent t-conorm.
Two connective systems and are isomorphic if there exists a bijection such that
In the nilpotent case, the generator functions of the disjunction and the conjunction being determined up to a multiplicative constant can be normalized the following way:
Note that the normalized generator functions are uniquely defined.
We will use normalized generator functions for conjunctions and disjunctions as well. This means that the normalized generator functions of conjunctions, disjunctions and negations are
We will suppose that , and are continuous and strictly monotonic functions.
Two special negations can be generated by the normalized additive generators of the conjunction and the disjunction.
The negations and generated by and respectively,
are called natural negations of and .
This means that for a connective system with normalized generator functions we can associate three negations, . In a consistent system, always holds.
A nilpotent connective system is called a bounded system if
holds for all where and are the normalized generator functions of the conjunction and disjunction, and are the natural negations.
For examples for consistent bounded systems see bounded .
Let us define the cutting operation by
With the help of the cutting operator, we can write the conjunction and disjunction in the following form, where and are decreasing and increasing normalized generator functions respectively.
All basic operators discussed so far can be handled in a common framework, since they all can be described by the following parametric form.
Let , and let be a strictly increasing bijection. Let the general parametric operator be
The most commonly used operators for special values of and , also for , are listed in Table 1.
Now let us focus on the unary (1-variable) case. The unary operators are mainly used to construct modifiers and membership functions from the generator function. The membership functions can be interpreted as modelling an inequality memeva . Note that non-symmetrical membership functions can be also constructed by connecting two unary operators with a conjunction iwobi ; ijcci .
Let , and let , a strictly increasing bijection. Then
For special values, see Table 2.
The main drawback of the Łukasiewicz operator family is the lack of differentiability, which would be necessary for numerous practical applications. Although most fuzzy applications (e.g. embedded fuzzy control) use piecewise linear membership functions owing to their easy handling, there are areas where the parameters are learned by a gradient-based optimization method. In this case, the lack of continuous derivatives makes the application impossible. For example, the membership functions have to be differentiable for each input in order to fine-tune a fuzzy control system by a simple gradient-based technique. This problem could be easily solved by using the so-called squashing function (see Dombi and Gera Gera ), which provides a solution to the above-mentioned problem by a continuously differentiable approximation of the cut function.
The squashing function defined below is a continuously differentiable approximation of the generalized cutting function by means of sigmoid functions (see Figure 2).
By increasing the value of , the squashing function approaches the generalized cut function. The parameters and determine its center and width.
The error of the approximation can be upper bounded by , which means that by increasing the parameter , the error decreases by the same order of magnitude.
The derivatives of the squashing function are easy to calculate and can be expressed by sigmoid functions and itself:
By using squashing functions one can approximate the Łukasiewicz operators and piecewise linear membership functions (i.e. trapezoidal or triangular) by substituting the cut function. The approximated membership functions are called soft and defined as:
The derivatives of a soft trapezoidal are simply the derivatives of the proper squashing function.
In this framework it becomes possible to define all the operators by a single generator function and a few parameters. Examples for he main operarors and their soft approximations using the squashing function with different parameter values are shown in Figure 1. Note that assymmetrical membership functions can also be easily defined by connecting two unary operators with a conjunction.
Iv Neural Networks Based on Nilpotent Logic
The results on nilpotent logical systems recalled in Section III offer a new approach to construct neural networks using continuous logic.
Neural nets are composed of networks of computational models of neurons called perceptron. As it is illustrated in Figure 3, the threshold can be added as an additional input with weight to simplify the computation. Note that the perceptron can be interpreted as a linear classifier modelling the inequality
where are the input values, are the weights and is the threshold. Also note that as we mentioned above, the membership functions can also be interpreted as modelling an inequality.
It is well-known that any Boolean function can be composed using a multi‐layer perceptron. As examples, the conjunction, the disjunction and the implication are illustrated in Figure 4 and 5. Note that for the XOR gate, an additional hidden layer is also needed. It can be shown that a network of linear classifiers that fires if the input is in a given area with arbitrary complex decision boundaries can be constructed with only one hidden layer and a single output. This means that if a neural network learns to separate different regions in in the -dimensional space having input values, each node in the first layer can separate the space into two half-spaces by drawing one hyperplane, while the nodes in the hidden layers can combine them using logical operators.
In Figure 6, some basic types of neural networks are shown with two input values, finding different regions of the plane. Generally speaking, each node in the neural net represents one threshold and therefore can draw one line into the picture. The line can be diagonal if that nodes receives both of the inputs and . The line has to be horizontal or vertical if the node only receives one of the inputs. The deeper hidden levels are responsible for the logical operations.
Here, we suggest applying the nilpotent logical concept in the neural architecture to get nilpotent logic based neural networks (NL model) in the following way. In the first layer, the activation function in the nodes are membership functions, representing the truth value of (13). The nilpotent logical operators (see table 1) work in the hidden layers. To ensure differentiability, the cutting function should be approximated by the squashing function from Definition 9. As an example, output values for a triangular domain using nilpotent logic and its continuous approximation for different parameter values are illustrated in Figure 9.
Furthermore, taking into account that the area inside or outside a circle is described by an inequality containing the squares of the input values, it is also possible to construct a novel type of unit by adding the square of each input into the input layer (see Figure 7). This way, the polygon approximation of the circle can be eliminated. For an illustration see Figure 8. Note that by changing the weights an arbitrary conic section can also be described.
V Experimental Results: Activation Function Performance
Choosing the right activation function for each layer is crucial and may have a significant impact on metric scores and the training speed of the neural model.
In the NL modell introduced in Section IV, the smooth approximation of the cutting function is a natural choice for the activation function in the first layer as well as in the hidden layers, where the logical operators work.
Although there are a vast number of activation functions (e.g. linear, sigmoid, , or the recently introduced Rectified Linear Unit (ReLU) relu , exponential linear unit (ELU) elu , sigmoid-weighted linear unit (SiLU) silu ) considered in the literature, most of them are introduced based on some desired properties, without any theoretical background. The parameters are usually fitted only on the basis of experimental results. The squashing function (also soft cutting or soft clipping function) introduced above stands out of the other candidates by having a theoretical background thanks to the nilpotent logic which lies behind the scenes.
In softclip , Klimek and Perelstein presented a Neural Network (NN) algorithm optimized to perform a Monte Carlo methods, which are widely used in particle physics to integrate and sample probability distributions on multi-dimensional phase spaces. The algorithm has been applied to several examples of direct relevance for particle physics, including situations with non-trivial features such as sharp resonances and soft/collinear enhancements. In this algorithm, each node in a hidden layer of the NN takes a linear combination of the outputs of the nodes in the previous layer and applies an activation function. The nodes in the final layer again take a linear combination of the values in the next-to-last layer, but then apply another function, the output function, which is chosen to map onto the set of possible outcomes for the given situation. For their purpose, an output function that maps onto the unit interval was needed, since phase space is described as a unit hypercube. Sigmoids approach the boundary values of 0 and 1 very slowly, which makes it difficult for the NN to populate the edges of the hypercube. Therefore the exponential linear unit (ELU) was used as the activation function and since the ELU does not generate exponentially large values, a special case of the above defined (see Definition 9) squashing function (soft cutting or alternatively soft clipping function) was introduced and used as the output function. This soft clipping function is approximately linear within and asymptotes very quickly outside that range. It is parameterized by a parameter which determines how close to linear the central region is and how sharply the linear region turns to the asymptotic values.
Excellent performance has been demonstrated in all examples in favor of the squashing function (see Figures 8 and 9 in softclip ).
In our experiments, we compared the performance of the squashing function (SQ) with the sigmoid function and the sigmoid linear unit (SiLu) using the Fashion MNIST dataset, which is a dataset consisting of a training set of 60,000 examples and a test set of 10,000 examples. Each example is a 28x28 grayscale image, associated with a label from 10 classes. Each pixel has a single pixel-value associated with it, indicating the lightness or darkness of that pixel, with higher numbers meaning darker. This pixel-value is an integer between 0 and 255. The training and test data sets have 785 columns. The first column consists of the class labels and represents the article of clothing. The rest of the columns contain the pixel-values of the associated image.
The results shown in Figure 10 demonstrate the excellent performance of the squashing function.
In this work we suggested combining deep neural networks with the concepts of nilpotent logical systems to reduce uninterpretability of neural models by proposing a general mathematical framework.
In our model, the activation functions in the nodes of the first layer are membership functions representing inequalities, and the nilpotent logical operators work in the hidden layers. To ensure differentiability, the cutting function (in the membership functions as well as in the logical operators) is approximated by the squashing function. All the operators in the model can be defined by a single generator function and a few parameters.
A novel type of neural unit was also introduced by adding the square of each input into the input layer (see Figure 7) to describe the inside / outside of a circle without polygon approximation.
Finally we showed that the squashing function as an activation function not only stands out of the other candidates considered in the literature by having a theoretical background, but it also performs well in the first experiments on the FASHION MNIST dataset.
This research was partially supported by grant TUDFO/47138-1/2019-ITM of the Ministry for Innovation and Technology, Hungary.
- (1) D. Clevert, T. Unterthiner, S. Hochreiter, Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs), arXiv:1511.07289, 2015.
- (2) O. Csiszár, J. Dombi, Generator-based Modifiers and Membership Functions in Nilpotent Operator Systems, IEEE International Work Conference on Bioinspired Intelligence (iwobi 2019), July 3-5, 2019, Budapest, Hungary, 2019.
- (3) J. Dombi, Membership function as an evaluation, Fuzzy Sets Syst., 35, 1-21, 1990.
- (4) J. Dombi, O. Csiszár, The general nilpotent operator system, Fuzzy Sets Syst., 261, 1-19, 2015.
- (5) J. Dombi, O. Csiszár, Implications in bounded systems, Inform. Sciences, 283, 229-240, 2014.
- (6) J. Dombi, O. Csiszár, Equivalence operators in nilpotent systems, Fuzzy Sets Syst., doi:10.1016/j.fss.2015.08.012, available online, 2015.
- (7) J. Dombi, O. Csiszár, Self-dual operators and a general framework for weighted nilpotent operators, Int J Approx Reason, 81, 115-127, 2017.
- (8) J. Dombi, O. Csiszár, Operator-dependent Modifiers in Nilpotent Logical Systems, Operator-dependent Modifiers in Nilpotent Logical Systems, In Proceedings of the 10th International Joint Conference on Computational Intelligence (IJCCI 2018), 126-134, 2018
- (9) D. Dubois and H. Prade, Fuzzy sets in approximate reasoning. Part 1: Inference with possibility distributions, Fuzzy Sets and Syst., 40, 143-202, 1991.
- (10) S. Elfwing, E. Uchibe, K. Doya, Sigmoid-weighted linear units for neural network function approximation in reinforcement learning, Neural Networks 107, 3-11, 2018.
- (11) M. Fisher, M. Balunovic, D. Drachsler-Cohen, T. Gehr, C. Zhang and M. Vechev, DL2: Training and Querying Neural Networks with Logic, Proceedings of the 36 th International Conference on Machine Learning, Long Beach, California, PMLR 97, 2019.
- (12) M. V., Franca, G. Zaverucha and A. S. d. Garcez, Fast relational learning using bottom clause propositionalization with artificial neural networks, Machine learning, 94(1):81-104, 2014.
- (13) A. S. d. Garcez, K. Broda, and D. M. Gabbay, Neural-symbolic learning systems: foundations and applications, Springer Science & Business Media, 2012.
- (14) J. Dombi, Zs. Gera, Fuzzy rule based classifier construction using squashing functions. J. Intell. Fuzzy Syst. 19, 3-8, 2008.
- (15) J. Dombi, Zs. Gera, The approximation of piecewise linear membership functions and Łukasiewicz operators, Fuzzy Sets Syst., 154, 275- 286, 2005.
- (16) Z. Hu, X. Ma, Z. Liu, E. Hovy, E. P. Xing, Harnessing Deep Neural Networks with Logic Rules, ArXiv:1603.06318v5
- (17) M. D. Klimek, M. Perelstein, Neural Network-Based Approach to Phase Space Integration, arXiv:1810.11509v1
- (18) T. D. Kulkarni, W. F. Whitney, P. Kohli, and J. Tenenbaum, Deep convolutional inverse graphics network, In Proc. of NIPS, 2530-2538, 2015.
- (19) C. T. Lin, C.S.G. Lee, Neural Fuzzy Systems: A Neuro-Fuzzy Synergism to Intelligent Systems, Upper Saddle River, NJ: Prentice Hall, 1996.
- (20) A. L. Maas, A. Y. Hannun, A. Y. Ng, Rectifier Nonlinearities Improve Neural Network Acoustic Models, 2014.
- (21) G. G. Towell, J. W. Shavlik and M. O. Noordewier, Refinement of approximate domain theories by knowledge-based neural networks, in Proceedings of the eighth National Conference on Artificial Intelligence, Boston, MA, 861-866, 1990.
- (22) E. Trillas and L. Valverde, On some functionally expressable implications for fuzzy set theory, Proc. of the 3rd International Seminar on Fuzzy Set Theory, Linz, Austria, 173-190, 1981.
- (23) J. Xu, Z. Zhang, T. Friedman, Y. Liang, and G. V. den Broeck, A semantic loss function for deep learning with symbolic knowledge, Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholm, Volume 80, 5498–5507, 2018.