Probabilistic Approximate Logic
and its Implementation
in the Logical Imagination Engine
In spite of the rapidly increasing number of applications of machine learning in various domains, a principled and systematic approach to the incorporation of domain knowledge in the engineering process is still lacking and ad hoc solutions that are difficult to validate are still the norm in practice, which is of growing concern not only in mission-critical applications. While AI has a long history of developing logics for knowledge representation, reasoning, and verification, we believe that in spite of rapid advances in both fields (cognitive and symbolic AI) there is a fundamental mismatch of technologies and foundations that is preventing unified solutions to emerge.
In this note, we introduce Probabilistic Approximate Logic (PALO) as a logic based on the notion of mean approximate probability to overcome conceptual and computational difficulties inherent to strictly probabilistic logics. The logic is approximate in several dimensions. Logical independence assumptions are used to obtain approximate probabilities, but by averaging over many instances of formulas a useful estimate of mean probability with known confidence can usually be obtained. To enable efficient computational inference, the logic has a continuous semantics that reflects only a subset of the structural properties of classical logic, but this imprecision can be partly compensated by richer theories obtained by classical inference or other means. Computational inference, which refers to the construction of models and validation of logical properties, is based on Stochastic Gradient Descent (SGD) and Markov Chain Monte Carlo (MCMC) techniques and hence another dimension where approximations are involved. The propositional core of PALO is based on ingredients from Hajek’s Product logic, Łukasiewicz logic, and Gödel logic, but a non-standard semantics of quantifiers and theories is the key to a probabilistic interpretation and a practical implementation.
We also present the Logical Imagination Engine (LIME), a prototypical implementation of PALO based on TensorFlow. Albeit not limited to the biological domain, we illustrate its operation in a quite substantial bioinformatics machine learning application concerned with network synthesis and analysis in a recent DARPA project.
Data and knowledge can be both seen as instances of uncertain information, even if the source of uncertainty may not be the same. A lot of research in machine learning is naturally adopting a data-centric view, while research in logics is primarily concerned with knowledge and its manipulation. Models and uncertainty are key concepts that lie at the intersection of both fields and hence might be good starting points to lay the foundations for a more unified treatment that could serve as a basis for a better theoretical and technological integration. In practice, applications of machine learning should benefit from systematic methods to incorporate domain knowledge into the system engineering process. Conversely, logics and their implementations (e.g., validation/verification systems) should benefit from models that are learned from potentially large amounts of data in the context of other domain knowledge.
At the intersection, we have the concept of uncertainty that already has a long tradition in mathematics, engineering, computer science, and especially in the field of Artificial Intelligence. Often uncertainty is considered a nuisance or a challenge and pure qualitative and quantitative models, formalisms, and systems are extended to deal with uncertainty, sometimes in an ad hoc fashion viewing uncertainty as an add-on feature and an obstacle to a formally clean and computationally efficient treatment. A different approach is to think of uncertainty as an opportunity. Models and systems can be inherently based on uncertainty (very much like biological systems) and take advantage of uncertainty for conceptual and computational benefits. This seems to be the direction in which most of the research in machine learning is heading, especially with the probabilistic/approximate modeling that underpins many advances in the foundations of deep learning. In this note we would also like to use it as a motivation for PALO, our Probabilistic Approximate Logic, which we first put into context with some traditional and related work.
Building on the work of Bacchus , Halpern  considered two classical probabilistic first-order logics and their combination. The first logic assumes that domain of each variable constitutes a probability distribution, while the second uses a semantics where formulas are interpreted w.r.t. a set of worlds which is equipped with a probability distribution. PALO is not a classical logic but rather a soft logic with a continuous interpretation of propositions as real numbers in the interval . Domains are not equipped with probability distributions, but propositional formulas are interpreted as approximate probabilities. Such probabilities are lifted to quantified formulas by averaging to compensate for their approximate nature. Probabilities are further lifted to models of theories in a way that compensates for the natural uncertainty in the weight/relevance of the axioms. This last point is essential because as a substructural logic a soft logic is necessarily incomplete w.r.t. to a classical semantics, which is more intuitive for the user to reason with, and hence an important reference influencing the design of PALO and its use.
Structure of this Paper
After discussing some important related work in the following section, we introduce the syntax of the PALO core language in Section 3 together with primarily two flavors of semantics (soft and classical logic interpretations) to support different types of inference. In Section 4, we discuss our current prototype implementation of PALO in what we call the Logical Imagination Engine (LIME). This prototype will be used in Section 5 to illustrate the application of PALO to a network synthesis problem in the bioinformatics domain. This is the problem that originally motivated the development of PALO and its implementation. A number of extensions and opportunities for future work in the context of PALO and LIME will then be discussed in Section 6, followed by a slightly broader view of a potential role of PALO for addressing some key limitations of deep learning architectures in the conclusion.
2 Related Work
The idea of viewing probabilities as generalized truth values goes back to Reichenbach  but has remained unsatisfactory due to the non-extensional (in other words, non-truth-functional or non-denotational) nature of this interpretation. Generally, the probability can not be defined as a function of and , and assumptions need to be made to achieve a strict correspondence to a meaningful probability. Our approach is conceptually related to a solution proposed by Gaines . He defines what he calls a Standard Uncertainty Logic (SUL) that by simple axiomatic extensions yields either Stochastic Logic or Fuzzy Logic. Both logics can be equipped with a population semantics (not necessarily limited to human individuals), where in case of the Stochastic Logic can be interpreted as a product with a suitable independence assumption that can be realized ”externally” by ”choosing a number of different individuals at random to answer each question involved in evaluating a compound”. Gaines considered only propositional logic, but in the more general first-order setting of PALO we also need to define the semantics of quantifiers, which helps us to realize his idea in a quite natural fashion. Another difference to Gaines’ work is that as a soft logic PALO does not satisfy the strict absorption (and hence idempotence) axioms of SUL and the additional distributivity and excluded middle axiom of Stochastic Logic (which is still a classical logic). It does so however with increasing precision when approaching the classical limit case where interpretations are constrained to .
It is noteworthy that our approach of combining selected operators from Hajek’s Product logic, Łukasiewicz logic, and Gödel logic in a non-standard fashion needs to be differentiated from work in the area of fuzzy logics, which is not aiming at a probabilistic interpretation but an orthogonal notion of truthiness (see also  for his population-based interpretation of Fuzzy Logic). For example,  investigates a propositional fuzzy logic that contains Product logic, Łukasiewicz logic, and Gödel logic as sublogics and the focus is on identifying a suitable axiomatization and a class of models so that soundness and completeness can be established. In contrast, our approach with PALO is purely semantic and motivated by computational feasibility. We do not attempt to establish an axiomatic system for symbolic inference in soft logic, but rather maintain a connection to classical logic for which symbolic methods and technologies are well developed.
Incidence Calculus  is another approach to overcome the fact that a probabilistic interpretation of formulas is not truth-functional by using a less abstract semantics that interprets each formula as the set of assignments for which it holds so that conjunction becomes a simple intersection. Although this is an elegant solution, with our mean probability semantics that includes lower and upper bounds, it turns out that the bounds are sufficiently tight so that replacing our approximate by a strict probabilistic interpretation is unnecessary for the data-rich applications we are targeting. Two other practical difficulties with an exact probabilistic semantics are that dependencies between subformulas referring to external data are often unknown and even if all known dependencies would be taken into account it would lead to an unacceptably high computational complexity in the context of model generation and learning.
Soft logics have found renewed interest in the machine learning community, because of their potential to incorporate logical knowledge into the learning process. Most notably, Real Logic [33, 34] is the culmination of state-of-the-art efforts [31, 32, 35, 7, 13, 12] to develop soft logic with distributional (i.e., feature-based) semantics that can be directly compiled into neural networks, specifically a subclass called Logic Tensor Networks (LTNs). Real Logic and its associated LTNs comprise the first framework of this kind that supports full first-order logic with functions (as opposed to relations only) without being subject to the closed-world assumption. A key innovative idea of Neural Tensor Networks (NTNs) , that is incorporated into LTNs and the Real Logic semantics, is the use of a family of efficiently learnable continuous predicates (and functions in LTNs) represented as neural networks. This is the key to leverage deep learning technologies for model construction and enable a seamless integration. Real Logic considers truth values as degrees of satisfaction111More precisely, Real Logic and LTNs are parameterized by a t-norm, which is flexible enough to represent a whole family of soft logics, including (variations of) Product logic, Łukasiewicz logic, and Gödel logic. and does not come with a probabilistic semantics. However, PALO can be regarded as a probabilistic variation of Real Logic, and we generalized its implementation in terms of LTNs to serve as a suitable basis for the PALO prototype.
A large body of research in the intersection of machine learning and logic has been conducted in the context of Markov Logic Networks  which are equipped with a semantics in terms of Markov Random Fields. They are also the basis for SRI’s Probabilistic Consistency Engine (PCE) [25, 27, 10]. Model sampling and counting are used to assess the degree of satisfaction of theories with weighted rules, where weights can be learned from data . Limitations of these approaches include the inherent closed-world assumption (which is too limiting for general knowledge representation), lack of expressiveness and extensibility (e.g., no explicit probabilities, no functions, no equations), computationally expensive MCMC model sampling and weight learning (the model space is discrete leading to problems with state space explosion), and incompatibility with deep learning technologies (e.g., difficult to integrate with black-box deep learning components and not clear how to take advantage of massively parallel computing technologies such as GPUs).
A fairly recent improvement is Probabilistic Soft Logic with Hinge-Loss Markov Random Fields as models . The study  shows that the relation between Markov Logic and Probabilistic Soft Logic is analogous to the relation between Classical (Boolean) Logic and Fuzzy Logic. As in Markov Logic Networks, given a logical theory, the space of models is defined by a parameterized family of Gibbs distributions using potentials derived from (weighted) logical rules. Thanks to the use of a soft logic (specifically Łukasiewic logic), the maximum a posteriori distribution can be efficiently computed by solving a convex optimization problem. While this approach is mathematically and computationally appealing, it is still based on a closed-world assumption, the use of Gibbs distributions is a significant limitation, and the relation between the distribution and the logical axioms is not as direct as desirable for a probabilistic logic. The potentials are based on Łukasiewicz logic which does not have a standard probabilistic interpretation, and the precise impact of weights that are associated with the axioms is difficult to predict. PALO uses a more direct and intuitive mean probability semantics for logical theories. The tradeoff is that it does not constrain the model distribution to a known well-behaved family, and hence we are paying the price of dealing with non-convex albeit continuous optimization problems. Fortunately, we are in a position to take advantage of a broad range of algorithms for Stochastic Gradient Descent (SGD) optimization (such as ) and Markov Chain Monte Carlo (MCMC) sampling methods leveraging SGD (such as Bayesian learning via Stochastic Gradient Langevin Dynamics [40, 21]) that are available and still being further advanced due to the rapidly growing demands of deep learning architectures (see, e.g., , which shows how SGD can be viewed as approximate Bayesian inference).
3 Syntax and Semantics of PALO
PALO is still work in progress. For clarity, we focus on the current design for the core language and semantics of PALO in this section. The full semantics of PALO combines a mean probability interpretation of formulas with upper and lower bounds and will be introduced in several stages. In addition, two types of classical semantics with different degrees of abstraction will defined to highlight, quantify, and take advantage of the connection to classical logic. A number of practically important extensions will be discussed in Section 6, and for most parts our definitions should be sufficiently general to accommodate these extensions with minor modifications.
3.1 Syntax of the Core Logic
A type signature is defined by a finite set of data types denoted by and the distinct propositional type (the type of formulas).
The set of types over , denoted by , is inductively defined: contains all data types in , all (Cartesian) product types of the form for , all function types of the form , and all predicate types of the form , where and . The subsets of product types, function types, and predicate types are denoted by , , and , respectively. Note that product types are defined over data types and not nested (not an essential limitation) and include all data types as a subset, i.e., data types are identified with single component product types.
Given a type signature and a dimensionality specification we inductively define the interpretation of types : (1) (interval of denoting truth values); (2) for all ; (3) ; (3) where denotes the space of continuous functions. It should be noted that there are no types with discrete interpretations, such as the type of natural numbers, in the current version of PALO, although extensions of PALO with such more traditional types are clearly conceivable. In the current version, such discrete types need to be embedded into continuous domains.
To simplify notation, we extend the notion of a type signature so that a dimensionality specification is part of for all of the following, and we simply use to refer to it in the context of . We further extend the type signature by a complexity specification , that will later be used to limit the complexity of the interpretation for symbols of predicate types so that they can be efficiently learned from data.
A signature extends a type signature by the following components so that all its components are pairwise disjoint: (1) a countable set of constant symbols for each data type , (2) a countable set of function symbols for each function type , (3) a countable set of propositional constant symbols, and (4) a countable set of predicate symbols for each predicate type . The pairwise disjointness requirement ensures that we do not have overloading, which simplifies the presentation of the semantics, but could be relaxed in future practical versions of PALO. To simplify notation, the dimensionality and complexity specifications are lifted to symbols through their types.
A signature with sorts extends a signature with a countable set of sort symbols for each product type such that all components remain pairwise disjoint. In the present version of PALO, the main role of sorts is to refer to (external) data sets inside the logic (see notion of sort binding below).222The use of two semantic levels, types and sorts, is quite similar to the use of kinds and sorts in membership equational logic  which underlies SRI’s Maude system . A signature with variables extends a signature with a countable set of variable symbols for each such that all its components remain pairwise disjoint. We write to denote an extended signature with a fresh added to . Note that each sort, constant, function, predicate, and variable symbol has a unique type in this setting. Note also that while constants are associated with data types only, variables can be more generally associated with Cartesian product types, which is important for the expressiveness of quantifiers (introduced below).
Given a signature with sorts and variables we define the set of terms over and their types inductively as follows: (1) a variable is a term of data type ; (2) a constant is a term of data type ; (3) a tuple is a term of product type if is a term of data type ; (3) a function application is a term of type for if is a term of product type ; (4) a propositional constant is a term of type Prop; (5) the propositions (false) and (true) are terms of type Prop; (6) a predicate application is a term of type Prop for if is a term of product type ; (7) logical conjunction , disjunction , negation , implication , and equivalence are terms of type Prop if and are terms of type Prop; (8) quantifications (universal quantification), (existential quantification), and (mean quantification) are terms of type Prop if is a sort of product type and is a term of type Prop over ; and (9) tensor abstraction is a term of type if is a sort of product type and t is a term of type over . Terms of type Prop are also called propositions or formulas. All other terms are also called proper terms. The set of terms over of type is denoted as . In the quantifiers defined by (6) the variable is bound in . We use standard definitions of bound/free variables and closed terms/formulas and we identify terms that are equivalent modulo renaming of bound variables. Note that in the current version of PALO each term has a unique type, which is not essential but mainly done to simplify the semantics.
3.2 Approximate Probability Semantics
Given a signature , we define a -algebra to consist of the following components: (1) for ; (2) for for ; (3) for and for ; (4) for ; and (5) for . Note that our current description focuses on the core logic. If the signature is extended with built-in symbols, their interpretations should be added here.
Given a signature with sorts and a -algebra , a sort binding for is a family of functions (implicitly) indexed by product types of the form , such that and is finite for . Note that to establish a semantic connection to real world data sets we use an interpretation of sorts as finite, albeit typically large sets. Given a signature with variables and a -algebra , a variable binding for is a family of functions (implicitly) indexed by product types such that for . Given a binding , a variable , and , the updated binding is defined by and if .
Let be a signature with sorts and variables , and a family of model parameters for that will be made precise incrementally. For the presentation of the semantics, it will be useful to fix some variable naming conventions. We use to range over , to range over , to range over , to range over , to range over , and to range over , for all suitable types and . We will also use and to range over proper terms over and , to range over propositions over .
A -algebra is a -algebra if constant, function, and predicate symbols are interpreted as follows:
where ranges over all vectors consistent with the corresponding function or predicate type, is the Sigmoid function, and is applied elementwise, thereby lifted to vectors. We identify with its flattened representation, that is if is of type the length of is and analogously for predicates . We use to denote transposition and try to stick close to the notation used in [33, 34, 35]. Furthermore, , , , and , , , are constant, function, and predicate model parameters given by . Recall that is the complexity specification that is part of a type signature. Assuming that is of type , is a vector of length . Assuming is of type , is a vector of dimension , and is a matrix of dimension . The parameter is a scalar in . Analogously to the functional case and assuming is of type , the parameter is a vector of dimension , and the parameter is a matrix of dimension . The parameter denotes a -indexed family of matrices of dimension , and the expression in the above definition denotes a lifted version of the bilinear tensor product resulting in a -dimensional vector defined by for . Finally, the parameter is a vector of dimension . Such a -algebra is uniquely defined by and and will be denoted by or simply .
The above parametric class of efficiently learnable, continuous, and differentiable predicates (with their natural representation as neural networks) is the same as in LTNs [33, 34] and a direct generalization of the representation of binary predicates in NTNs . The power of this representation is that it can capture complex non-linear interactions between the predicate arguments. One may think of as a hyperparameter limiting the number of such interactions and hence together with the dimensionality the complexity of the class of learnable interpretations. However, there is no reason to claim that this family is sufficiently rich for all applications. Similarly, while linear functions are an important subclass for many applications, it will be too restrictive for others and does not match the power of predicates. Other families of learnable interpretations are clearly conceivable (see Section 6), and we should rather think of the above choice motivated by LTNs as a particular instantiation of PALO.
For the following, we assume a signature with sorts and variables , a -algebra , and bindings for and for .
With that we inductively define the approximate probability interpretation of terms (including proper terms and propositions) as follows:
Note that while conjunction is defined as in Hajek’s Product logic , the involutive negation is defined as in Łukasiewicz logic and implication has the standard classical definition333also called S-implication which is different from Product logic’s R-implication. For equivalence, we use the probabilistically more accurate definition consistent with  rather than defining it as a derived operator. The meaning of conjunction is an exact probability if its subformulas are statistically independent, and should otherwise be seen as a best guess or approximate probability in absence of information about their dependence.444A more accurate explanation is that there is an implicit assumption that the subformulas are semantically sufficiently diverse so that a potential dependence is negligible relative to the diversity caused by different instantiations. It’s (average) precision will be made more precise in the full semantics which contains lower and upper bounds.
Disjunction is defined using De Morgan’s laws, establishing full symmetry between conjunction and disjunction. Associativity holds for both. However, the substructural (specifically linear) nature of this soft logic manifests itself by the fact that idempotence, and consequently absorption and distributivity are not valid.555It is noteworthy that already in  it is pointed out that a case can be made for weaker systems of his Standard Uncertainty Logic (SUL) without idempotence, which is the only reason for the lack of these properties in PALO. The law of the excluded middle and the law of contradiction are equivalent, because holds just as in , but neither is valid in PALO. However, idempotence and hence all of these properties are valid in the classical limit case where formulas are interpreted in , which will be the case in the alternative classical semantics for PALO. A key property identified in  that also holds in PALO in spite of the lack of idempotence is
which allows a limited form of modus ponens in the sense that it enforces a lower bound for given and , but as we will see this is only one form of inference that can take place in PALO which unlike most deductive systems does not favor any particular direction of execution.
The existential quantifier, on the other hand, is defined in terms of the maximum, which is consistent with Skolemnization (the most typical and intuitive classical interpretation) if the set of functions was sufficiently rich (which is not the case in our basic semantics, and another motivation for considering larger classes of functions). The universal quantifier is interpreted as a particular geometric mean, which is consistent with a (normalized) product interpretation of the quantifier viewed as a large conjunction over a batch of data. Here we need the assumption of (approximate) independence for the elements within each batch, which in practice limits the maximum batch size (see next subsection). Clearly, universal and existential quantifiers are not duals of each other, but for each quantifier we could formally define its dual counterpart by and giving rise to four different quantifiers, thereby reestablishing a formal symmetry.666More experience is needed to determine if the dual quantifiers are of any practical use. The mean quantifier is quantitatively between universal and existential quantifiers and most directly captures the arithmetic mean probability of a formula.
Also note that the tensor abstraction could be used to define all three quantifiers, but for clarity we have used an explicit definition here. Hence, in the core logic we are presenting, the tensor abstraction cannot appear as an argument, but with suitable built-in (higher order) function symbols that would be possible in an extension of the core logic.
3.3 Mean Probability Semantics
Given a signature with sorts , a -algebra , and a sort binding for , a batch cover is a function with and for each sort and implies for all . A possible choice is the set of all subsets of of a fixed size. Lifting this concept from sorts to sort bindings, we define the batch cover as a set of all sort bindings such that for each . Using to generically denote any set of sort bindings for under (typically it will be ), this gives rise to the mean probability interpretation of closed formulas :
The mean probability semantics can seen as an approximation of the population semantics of the Stochastic Logic in  for formulas that exhibit enough statistical diversity. Two important distinctions are however that we have to deal with quantifiers and we use a soft logic rather than a classical one.
While we average over all possible combinations of batches for all sorts in the above definition, this is usually not feasible in practice, and hence a natural place where an implementation would approximate the sum by using a random subset of , where each sort is interpreted by batches of a fixed sample size. The determination of a suitable batch size for the application is left as a topic for further investigation. A rough although incomplete guide can be the concept of effective sample size from statistics.
For the following, we assume a signature with sorts, a -algebra , and a set of sort bindings under (constructed from a batch cover as above). For a closed formula we say that satisfies or is a model of with lower mean probability and upper mean probability (denoted by ) iff . In the following we lift this notion to theories.
3.4 Probabilistic Theories and their Semantics
A probabilistic theory is a set of triples , where is a closed formula over , and with define a probability interval for the truth value of . We say -algebra satisfies or is a model of (denoted by ) iff for all . We write if for some ,, and call an axiom of .
We extend the interpretation to probabilistic theories as follows:
where if else .
While in the previous definition the -algebra was unrestricted, we now return to the class of algebras generated by model parameters , where denotes the set of all model parameter instantiations. Similarly, we assume that is of the form , meaning that it is generated by a batch cover and a sort binding representing the data.
A probabilistic theory is satisfiable if for some . For a satisfiable probabilistic theory , the maximum likelihood model is defined by the following optimization problem:
The simple product formulation above, however, is too inflexible to account for the complexity of logical theories used in practical applications that by their very nature often involve complex dependencies that cannot be eliminated or sufficiently reduced by exploiting randomization or diversity. Hence, this formulation only serves as a stepping stone to a more general approach that introduces flexibility at the level of theories, which in a sense is dual to and complements the flexibility at the level of models that we already take for granted.
A flexible probabilistic theory is a family of probabilistic theories parameterized by such that is a set of triples of the form where is a parameter in specific to , also called a (flexible) weight for . Together, all parameters are called theory parameters to distinguish them from the model parameters defined previously. From now on all instantiations for both types of parameters will be included in .
We now extend the interpretation to flexible probabilistic theories as follows:
where if else . Note that the constraints are on rather than , which means that the weights are irrelevant for the probabilistic interpretation and constraints for individual axioms, but rather control how axioms are composed. With flexible weights PALO tries to (partially) compensate for complex dependencies between axioms. We will see later that such dependencies will be unavoidable and in fact essential in a context where classical logic is our reference for approximation.777This is in contrast to attempts to base theories on minimal axiomatizations which is beneficial for inductive arguments in the proof theory of logics.
A flexible probabilistic theory is satisfiable if for some . For a satisfiable flexible probabilistic theory the maximum likelihood theory and model is defined by the following optimization problem:
In general, however, the interpretation of a theory is a non-convex function admitting many local maxima, some may be caused by symmetries in the theory or in the interpretation of the symbols.888More precisely, there is no guarantee of consistency for the likelihood function associated with (flexible) theories. We might also have prior knowledge about the distribution of parameters or enforce certain types of regularization, e.g., to balance model complexity and the amount of available data. In such case, a Bayesian treatment is more appropriate, where a flexible probabilistic theory induces a conditional probability (proportional to the likelihood999We are dealing here with an unnormalized likelihood, because like complex systems in statistical physics, our interpretation of theories contains an unknown normalization constant whose computation is infeasible in practice but fortunately irrelevant for the semantics.)
and the joint probability factorizes as
if is an assumed prior for the parameters .
Given a flexible theory and a prior for all parameters , we can now define three flavors of Bayesian semantics (which have to be approximated in practice). The posterior mode semantics is the set of all triples such that is a local maximum w.r.t. . The maximum a posteriori semantics is the set of all triples such that is a global maximum w.r.t. . Note that we allow a set of global optima instead of only a single one to account for symmetries. Finally, the most general posterior distribution semantics is the set of all triples equipped with a probability distribution .
Depending on the intended applications, several variations are possible in this framework such as the following staged hybrid semantics. A posterior model distribution semantics is the set of all triples with a probability distribution where are the model parameters and the theory parameters are determined by the maximum a posteriori semantics.
Any form of statistical inference related to the different flavors of posterior semantics will also be called (approximate) computational inference to distinguish it from (exact) symbolic inference as it is traditionally used in logics.
3.5 Lower and Upper Probability Semantics
The approximate probability semantics gives an exact probability only if the subformulas of composite formulas are independent which is rarely the case in practice. Consider, for example the extreme case of an atomic formula with . The approximate semantics yields even though classically the formula is equivalent to . Another extreme example exploiting the lack of idempotence is . A semantics that captures this propositional imprecision by interpreting formulas in terms of probability intervals can be defined as follows.
A lower and upper probability semantics (which we also call Freché semantics) can be obtained by using Freché bounds as follows:
All other equations from the definition of are duplicated for and without changes. The same holds for the mean probability interpretation , which is duplicated for and . The independence assumptions for batches are still required under this semantics.
Note that while defines conjunction as in Hajek’s Product logic and disjunction using involutive negation, and are defined as (strong) conjunction and disjunction in Łukasiewicz logic, while and correspond to Gödel logic (also called weak conjunction and disjunction in Łukasiewicz logic).
Since lower and upper probabilities are essential to understand the precision of the approximate semantics, it is natural to extend and to the combined semantics based on triples:
Applying this extended semantics to the extreme example we obtain . While this does not give us a better estimate, it alerts us about the high degree of imprecision in the approximate probability. Typically, formulas appear in the context of quantifiers, resulting in reasonably narrow bounds for the approximate mean probability. However, if the interval is still too large, it can be a sign of possible inherent dependencies between subformulas and a reformulation of the formula would be a natural response, for example using classical reasoning, which would yield in our example.
3.6 Abstract Classical Semantics for Reasoning
An abstract classical semantics (for the fragment without the mean quantifier) can be obtained by restricting to the Boolean set and using a trivial batch cover . The latter means that the semantics of quantifiers is exact rather than approximate, which suggests that computationally this semantics will not be useful in most practical cases. However, the use of a simple abstract semantics, justifies the use of symbolic deduction. Although formally a special case, we use and instead of and to make clear that we are using the classical semantics.
Let be an arbitrary -algebra and a closed formula not containing the mean quantifier. We say that satisfies or is a model of (denoted by ) iff . We say satisfies or is a model of (denoted by ) iff for all . We say is a tautology of , written , iff for all -algebras such that . Note that the classical semantics uses all -algebras , not a parameterized subset such as . Classical tautologies can be established symbolically, e.g., as theorems generated by a sound and complete101010It should be noted that completeness does not imply completeness for our approximate semantics, because the class of predicates and functions is unrestricted in the abstract classical semantics. Hence soundness is the more important property here. proof system for first-order logic, but the particular method is not relevant for this discussion.
A (flexible) probabilistic theory should be regarded as an approximation of a classical theory in the sense that starting from a core theory we can generate potentially infinite chains of theories using symbolic deduction
that are all classically equivalent but not necessarily equivalent under the probabilistic approximate semantics.111111Our argument naturally extended to a chain of embeddings, which allow us to introduce new (auxiliary) functions and predicates.
Due to the inherent limitations of a soft logic to mimic exact inference, instead of using as the basis for probabilistic approximate inference the use of for some can lead to substantial improvements in efficiency and precision in approximating classical reasoning. An equivalent perspective is that in addition to the axiomatic domain knowledge additional classical theorems can be made available to the probabilistic engine. The determination of a suitable set of theorems is an interesting problem by itself and most likely related to a tradeoff between computational efficiency and precision that should be further investigated. Note that the classical semantics completely abstracts from the probabilities, which means both the approximate as well as the classical semantics are limited in their own ways, and it would be inappropriate to consider one superior over the other.
The complementary nature of computational inference using the approximate semantics and symbolic inference using the classical semantics is an interesting topic by itself that, although beyond the scope of this paper, leads to the idea of hybrid neural-symbolic architectures (to be briefly discussed in Section 6) that integrate both forms of reasoning in a synergistic way. The advantage of computational inference is its non-sequential, non-directed, and fuzzy nature that can benefit from today’s massively parallel hardware architectures. Symbolic inference, on the other hand, can maintain logical precision over many reasoning steps and at the same time work with templates of formulas or entire classes of models, essentially leading to a logical form of parallelism by exploiting symmetries and abstractions.
3.7 Concrete Classical Semantics for Validation
A less abstract probabilistic classical semantics can be obtained by crispification.121212The term crispification is inspired by , albeit in that reference it is enforced as an axiom, while we define it at the semantic level. This can be useful if a soft-logic model has already been identified, and we would like to reinterpret the model from a classical viewpoint.
To this end, we define a crispification operator , where is a fixed threshold, by if and otherwise. Now is inserted in each equation for dealing with a term of propositional type. The resulting semantics is the mean probability interpretation for a -algebra and batch cover , which we denote as or simply using the default value for .
While crispification itself leads to a loss of precision regarding the model, the use of classical logic, on the other hand, increases logical precision (by avoiding the incompleteness of the soft-logic approximation). This is a tradeoff that may give some insights into the structure of a given model and motivate new hypothesis and extensions of the domain theory. It is also a reference to validate the standard semantics against, e.g., to quantify the degree of incompleteness in the context of specific theories and applications.
4 The Logical Imagination Engine
A partial prototype of the Logical Imagination Engine (LIME) has been developed using a generalized and extended version of Logic Tensor Networks (LTNs), which are implemented as a layer on top of TensorFlow . Currently it implements the posterior mode semantics (including lower and upper probabilities) using model sampling with Adam  and also the concrete classical semantics for a given model. We expect that implementing the full posterior distribution semantics would be straightforward using SGD/Langevin MCMC sampling  that is already available in TensorFlow. Our LIME prototype operates within JupyterFlow [36, 39], our notebook-based distributed workflow framework for Python/TensorFlow that transparently takes advantage of clusters of heterogeneous machines (e.g., with varying numbers of CPUs and GPUs).131313Our latest version also supports virtual kubernetes clusters and takes advantage of special features in the Google cloud to efficiently share large amounts of data. We currently use a cluster of machines to parallelize model sampling and other application-specific tasks (such as graph synthesis in our bioinformatics application discussed in the next section).
There are some minor limitations of our prototype that do not lead to severe restrictions in practice. For example, no type checking is implemented yet and each variable is associated with a unique fixed sort (i.e., we can view the variable name as the sort). Also, a temporary limitation is that axioms have to be in a particular form (essentially negation is pushed down to the level of atomic propositions) to make use of the upper and lower bounds semantics. Finally, we use a restricted family of batch covers (used in the sampling semantics for quantifiers) that are parameterized by sample size (number of random sort-bindings) and a sort-specific batch size that should be sufficient for most applications. One limitation that is quite significant is that the current prototype is inheriting the linear semantics for functions from LTNs, leading to a mismatch with the quite rich interpretation of predicates, and it does not allow for type-dependency of the complexity specification. This will be easy to rectify, but needs to be carefully evaluated in the context of an application that makes use of more complex functions such as multivariate polynomials (our current bioinformatics application is using predicates only).
With these limitations in mind, LIME implements the following functionality.141414It is based on an extension of the LTN API which well-structured and easy to use. A signature can be defined by listing symbols for constants, predicates, and functions with their type (currently only their dimension needs to be specified). A theory is defined by listing the axioms, where each axiom can be equipped with lower and upper mean probability constraints. The constraints are taken into account by a suitable extension of the TensorFlow objective function, which without constraints is simply given by the likelihood (defined by the mean probability semantics). Note that the mean probability semantics is dependent on model and theory parameters, which are translated into TensorFlow variables behind the scenes.151515To represent flexible theories, flexible axioms are used, but for experimentation we also support the option of using fixed weight axioms, in which case the weight has to be specified explicitly. We do not recommend its use, however, due to the non-intuitive behavior of weights.
Given a signature and a theory, LIME supports model synthesis (also called learning or training) and model analysis (also called querying, but we distinguish between model validation161616This is similar to model checking, which verifies if a given model satisfies a given property, but in PALO models are learned from a logical theory and data. and model evaluation). Model sampling is the (repeated) use of model synthesis to generate a distribution of models, which is strongly biased towards maximum likelihood in the posterior mode semantics. All functions are parameterized by a binding of sorts to their associated domains (e.g., data sets for training or validation), which defines the set of sample sort-bindings and forms the basis of our semantics.
Model synthesis has additional parameters such as the maximum number of training epochs (maximum number of sample sort-bindings used for training), a patience parameter for early stopping if no progress is made for the specified number of epochs (to reduce overfitting or overthinking as it might be called in this context of a logical theory), a minimum likelihood threshold to discard models with lower likelihood, and a maximum number of trials to find a model above this threshold. If successful, model synthesis results in an implicitly stored model that can be subjected to further model analysis.
Model analysis comes in two flavors. In both cases, the currently active model is an implicit parameter. Model validation (based on incomplete sampling) computes the mean probability semantics of a formula (using sampling for the quantifiers). A sort binding and a sample size (that is the number of sample sort-bindings over which the mean is taken) is passed as a parameter. Model evaluation (based on exhaustive iteration) considers all free variables of the given formula implicitly bound by the tensor abstraction and hence, under a given sort binding, results in a tensor with one dimension for each free variable. This provides a natural way to extract detailed information from the model, e.g., an exhaustive enumeration of the probabilistic interpretation of a predicate for a finite set of arguments. Note that model evaluation does not use the mean probability semantics. Instead of using to generate sample sort-bindings , it directly uses to perform the evaluation, which is typically exhaustive for the domains of interest.
It should be noted that the combination with an expressive logic naturally leads to a generalization of the traditional notion of validation in machine learning (e.g., the simple notion based on a separation of training and test sets). With PALO we are concerned with two orthogonal dimensions of generalization: the generalization of a property from the training to the test set, and the generalization of an axiom (which has been used during training) to another property (which is used during testing only). LIME’s model validation functionality is sufficiently general to support both notions and a combination of these. In addition, there is another dimension of validation offered by the flexible semantics and its configuration at runtime (see below).
In the following we summarize some of the generalizations/extensions of the LTN library that were necessary for the implementation of the LIME prototype. The LTN syntax was extended with a mean operator, as it needs to be clearly distinguished from universal and existential quantifiers. The parameterization has been extended to include PALO as a logic together with a definition of its non-standard semantics. As part of the PALO semantics, we added sampling, leading to new implementations of model synthesis and analysis with the parameters mentioned above. Furthermore, the selection of the semantics has been made dynamic, so that existing models can be viewed or reinterpreted under different semantics, simply by switching the semantics at runtime. In this configurable framework, we also added the lower/upper probability semantics and the concrete classical semantics based on crispification.
As an experimental feature we have implemented an alternative approximation of the posterior mode semantics. It exploits the new capability of switching the semantics at runtime, which is supported even during the training process. Inspired by the notion of curriculum learning , which proceeds in stages of increasing complexity of the training data, we consider our semantics as a limit case of a chain of semantics that differ in the interpretation of the existential quantifier. A staged training schedule with increasing semantic complexity can avoid the computationally fragile maximum interpretation of the existential quantifier early in the training process171717In fact, training starts with an interpretation that matches that has been earlier mentioned as a dual of and hence might shed some light on its role. and has the potential to improve stability and efficiency of model synthesis, but more experience with applications is needed and a detailed evaluation has to remain a topic for future work.
5 Sample Bioinformatics Application
In the DARPA Rapid Threat Assessment (RTA) program we have been developing data analysis, machine-learning, and logic-based techniques to support biologists in understanding the so-called mechanism of action (MoA) that is triggered when an (unknown) drug or toxic substance hits a human cell. From relatively short windows of time after the event in question (e.g., 48 hours) our algorithms generate graphs representing potential causality between compounds.181818In spite of the use of perturbations, it should be noted that this abstract notion of causality is based on observational data with its known limitations (e.g., confounding effects), and might be better called causality modulo observational equivalence. This is in contrast to for example knock-out studies for individual genes, which however due to higher cost cannot compete with the sheer data volume and coverage typical for observational studies. The basis are time series of typically high-dimensional data, e.g., transcriptomics (gene expression), proteomics, and metabolomics data. We have also developed algorithms for anomaly detection that highlight certain nodes in such graphs as potentially impacted and allow the biologist to narrow down the mechanism of action. The algorithms developed use a variety of models including Gaussian processes (on non-linear time scales) and other linear and non-linear models, ranging from principal component analysis and various types of clustering to a broad range of neural network models. For more details about the project and some initial results we refer to .
More recent algorithms that we developed include anomaly detection using convolutional autoencoders, autoencoder-based causality detection and network graph synthesis, predictive deep neural networks for temporal evolution and their visualization as graphs, generative adversarial networks for synthetic modeling and detection of typical vs. unusual behavior, and Siamese (twin) neural networks for probabilistic causality detection (validated using a dynamic gene expression model taking advantage of our original Gaussian process model). An informal presentation of our causal network synthesis algorithms and some sample results for the RTA data can be found in .
One challenge that we encountered is that each type of algorithm has its own representation of biological assumptions. For example, autoencoder-based causal network synthesis makes some assumptions about the nature of biological causality, which are hardwired into the algorithm. This is not only unsatisfactory from an engineering point of view but also leads to limitations and inflexibility regarding the kind of domain knowledge that can be represented.
In the latest generation of algorithms we used PALO to represent biological domain knowledge as a logical theory. A domain theory of causality specific to the biological domain is used as background knowledge during learning, resulting in an entire distribution of models that are probabilistically consistent with the theory. The biologist can select and explore suitable models in their graph representation and further evolve the domain theory as more knowledge becomes available. Any hypothesis that should be tested can also be formalized as part of the domain theory.
5.1 Specification using PALO
Our formalization of the domain theory combines a generic theory of physical causality with axioms191919Our theory of causality is loosely inspired by the theory of concurrency and causality that underlies Petri nets  but greatly simplified. taking into account observational evidence and some limited biological domain knowledge. The source of observational evidence is another neural network model that has been trained and validated to detect the existence of causality between genes solely based on their expression time profiles (modeled as Gaussian Processes) but without determining its direction. The details of this Siamese neural network model can be found in , but are not essential for the following formalization,202020A useful feature of Siamese networks is that they allow us to represent certain structural properties of relations, e.g., symmetry, directly in terms of the network structure. This is part of a general symmetry theme in our (equational) logic-based view discussed in Section 6. which can be employed as long as we can obtain approximate probabilities for casual relations to start with.212121In the RTA project we also developed models to predict direction, but we are intentionally using the simpler model here to avoid introducing additional uncertainty. An extension of this approach which incorporates both undirected and directed probabilistic causality is possible and has recently been implemented in the RTA workflow as well. Utilizing a synthetic gene expression model, it is trained to determine the probability of a causal relationship between any two genes. The same model can be used to approximate independence, which we define as the probability of a causal relation being low.
As a basis we use a gene expression data set obtained by treating human cells with what was called during a DARPA challenge  and later revealed to be a common drug (atorvastatin) that regulates cholesterol biosynthesis. We use PCA-based dimensionality reduction to generate an embedding of all protein-coding genes with significant perturbations (comparing treated against control timeseries), for which interesting causal relationships can be expected, in a 10-dimensional Euclidian space.222222We use 10 for our gene dimension specification and 100 as a universal complexity specification (defining the family of learnable predicates). These hyperparameters were experimentally determined and reflect a tradeoff between computational resources and modeling precision. The full space is represented by a type of dimensionality 10, and the sort is bound to the relevant data set, namely the finite subset of all protein-coding genes with significant changes (approx. out of protein-coding genes).
One important application-dependent choice is the definition of the batch cover used to limit the batch size in the sampling semantics of quantifiers. Here we used a rough correlation-based analysis to establish that the effective sample size is larger than , which should then be a reasonable choice for the batch size to maintain independence between genes in a single batch.
Two other sorts are needed to establish the link to experimental data: contains all pairs of genes for which causality can be detected with high probability (e.g., yielding pairs) and contains all pairs for which absence of causality is detected with high probability (e.g., yielding pairs). It should be noted that in RTA, the data sets and most of these constants are determined by parameters, and the analysis is part of a larger workflow, but we are tying to keep things simple here to convey the basic idea.
Our domain theory is formalized by four binary relations, , , , and (which can all be visualized as graphs), which means they are of type . The relations and stand for undirected causality and independence (concurrency). The relation stands for directed causality and for immediate causality (also directed). In the following we list and briefly motivate the axioms of the domain theory.
The basic axioms formalize irreflexitity of causality and reflexivity of independence (a useful convention although other formalizations are possible). The two symmetry axioms reflect the undirected nature of these basic relations, and the last two axioms express that these concepts are mutually exclusive and complementary.
Consistency with experimental data:
Here we can interpret the geometric mean semantics of the universal quantifiers roughly as an approximation of a minimum probability that is robust to outliers. While the lower bound on the probability predicted by our causality detector model for both causally dependent and independent pairs is , we use the intervals (that is ) and (that is ), respectively, to account for some uncertainty (using a larger interval for independence due to the increased chance that long range causality can be mistaken as independence).232323Our validation studies indicated that the causality detector model tends to slightly underestimate higher probabilities, but there are also some biological assumptions underlying that model that are reflected by our relatively large uncertainty intervals.
Axioms for directed causality:
Directed causality is formalized as a partial order that implies and generates undirected causality (first and second axiom). The subsequent axioms are simply the strict partial order axioms (irreflexivity, asymmetry, and transitivity). Note that asymmetry is an example of an axiom that can be classically derived, but it is stated explicitly due to its importance. The partial order formalizes global consistency of the causal direction in the larger context but does not introduce a directional bias (that is at least two directions are possible as in microscopic physics). Also it can be easily verified that the remaining axioms maintain this time reversal invariance.
Axioms for immediate causality:
Immediate causality is a subset of directed causality (third axiom) such that an intermediate causal element does not exist (fourth axiom). The other axioms are key properties that can be classically derived (theorems). They are intended to make the soft-logic theory more precise and the approximate computational inference more efficient. In effect, the last three are local consistency rules for causality and independence.
Estimates about density and degree of causality:
In addition to the basic physical and experimental data axioms for causality, we can use some domain knowledge to further narrow down the biologically plausible models. For instance, from curated gene expression networks we can use estimates for the expected density of immediate causality (first axiom) and for the mean in- and out-degree of typical networks (last two axioms), which is low, and most likely much less than out of due to the approximately scale-free nature of these graphs242424There is some experimental evidence for an asymmetry between in- and out-degree that could be reflected by using more precise intervals.. These examples also show how the mean probability quantifier can serve a useful purpose.
Our probability intervals for experimental data and domain expertise are fairly wide to avoid inadvertently excluding feasible models, but it turns out that even such wide intervals are sufficient to narrow down the set of the most likely models to a plausible subset (often with narrow ranges on the questionable parameters thanks to the multitude of other constraints and the large amount of data) that can be further analyzed quantitatively and inspected by a biologist.
5.2 Sample Results
In the following we visualize a sample model as a graph that depicts , the immediate causality relation that can be extracted from the model using the model evaluation functionality described in the Section 4. The graph has been simplified by removing edges with a probability below and by removing isolated nodes, that is genes that do not exhibit any highly probable immediate causal connections. An automatically generated layout of the resulting graph with a subgraph that contains genes relevant to the mechanism of action of the drug is shown in Fig. 1.
It is important to understand that this model represents one sample model in the posterior mode semantics, which is biased towards models with (locally) high likelihood. To understand the broader range of possibilities, for each instantiation of parameters, we typically generate 100 sample models in our automatic parallelized workflow on a cluster of GPU servers. In contrast to traditional deep learning, where a single model is usually sufficient and it has been argued that the non-convexity does not matter, our model space is strongly non-convex in a way that (partly) matters for the result. For instance, as already indicated above for each model of our theory the inverse model is equally likely, always leading to a complementary mode in the distribution.
The graphs underlying , , and are too large to show, but using model evaluation together with sampling we can show abstractions, such as the histograms in Fig. 2, which can often shed light on the convergence of the model synthesis process. Comparing the histograms for and , we can clearly see how their complementary nature also shows up at the statistical level. Furthermore, the fact that is approximately the symmetric closure of is consistent with the similarity in the shape of their histograms. Finally, is a very sparse subrelation of and , which manifests itself in a highly asymmetric shape.
While we used the model evaluation functionality to obtain Fig. 1 we now use the model validation functionality to quantify the mean probability of the properties of interest. For sake of brevity, we focus on the axioms of our theory, even if in practice we may verify other implied and non-implied properties using the same functionality. In Fig. 3 we list each validated property together with three numbers. The first is the relative importance (that we also called the flexible weight) for each axiom. Recall that these are theory parameters that have been inferred during model synthesis together with all model parameters. The second number is the mean probability of the quantified property. The last number is the mean probability of the formula under the quantifier (in other words, the top-level quantifier is replaced by a mean quantifier), which is often more intuitive for the user. The validated properties are ranked by their mean probabilities.
An interesting observation is that the importance/weight is not a very intuitive number and hence not necessarily a parameter that should be exposed to the normal user. For example, the axiom involving is satisfied in the model with an adequate (normalized) probability of more than (quite consistent with the capability of our causality detector which claims a lower bound of ), but the weight is relatively very low, which intuitively might suggest that the axiom has not been heavily used to achieve/maintain this result (presumably partly because we have another axiom involving that is not independent and there are other axioms to infer undirected causality). A similar observation holds for the axiom with the existential quantifier that helps to generate immediate causality . It should be noted, however, that in the PALO semantics, implication does not have a preferred direction so that any axiom involving can potentially contribute to the generation of new probabilistic pairs. Apart from this omnidirectional inference, another factor that complicates the understanding of the approximate reasoning that takes place during model synthesis is that by being intertwined with learning the notions of generalization and reasoning by similarity have a major impact on the result and on how the axioms are used. For example a property established for one gene may automatically transfer to similar genes, albeit with varying degree.
Finally, we would like to gain an understanding of the precision of our approximate validation for the given model. One might expect a high imprecision as the approximate probability does not account for logical dependencies. To this end, we use the full semantics that includes lower and upper mean probabilities. The results of model validation using this semantics are shown in Fig. 4. For a better intuition, we validate the mean probabilities of all axioms without the universal quantifier. It turns out that even we add confidence intervals the bounds are very tight in spite of the fact that we only used 100 samples of sort-bindings (of modest batches of size for genes and for pairs) to compute the mean. This shows that our data sets contain enough diversity and our theory is suitably structured to achieve a very good precision in the relevant mean probabilities.
It is noteworthy that sampling-based model evaluation can be not only used to extract relations, but it can be applied to any term in the logic, in particular to propositional terms, that is formulas. This can be used to obtain more details about the satisfaction of an axiom or any other property in a given model as illustrated in Fig. 5.
We have illustrated approximate computational inference using the primary semantics of PALO, namely the mean probability semantics with lower and upper bounds, but we like to point out that the results are based on a formalization of the domain theory that is sufficiently explicit and hence computationally efficient for our purposes. Deriving such richer theories from a small set of basic axioms is an interesting topic by itself, and a place where the abstract classical semantics is essential. Our logical theory was simple enough to verify the derived axioms manually under this semantics, but more complex domain theories may benefit from automatic symbolic inference (see Section 6 on possible extensions of LIME).
Finally, the concrete classical semantics can be used to evaluate the imprecision introduced by our soft logic approximation, against a classical semantics which necessarily suffers from a very different type of imprecision caused by crispification. The results are shown in Fig. 6 and are quite acceptable for this type of application that involves many other sources of uncertainty.
We conclude this section with another caveat regarding our formalization. Immediate causality, which is the basis for our biological network graphs, is relative to the chosen level of abstraction, which is a subset of genes in our example. The reality is far more complex, as some protein-coding genes encode transcription factors, that is proteins that again regulate gene expression in the context of other transcription factors in a complex fashion that can favor up or down regulation. More complete networks with proteins and positive and negative dependencies have been studied in the RTA project as well. We do not expect that applying PALO and LIME to such networks would require fundamental changes in the theory. On the other hand, a logical treatment of our more abstract cluster-based graph synthesis  would lead to some modifications and could be an interesting topic for future work.
The time reversal invariance of