Conditions for Unnecessary Logical
Constraints in Kernel Machines
A main property of support vector machines consists in the fact that only a small portion of the training data is significant to determine the maximum margin separating hyperplane in the feature space, the so called support vectors. In a similar way, in the general scheme of learning from constraints, where possibly several constraints are considered, some of them may turn out to be unnecessary with respect to the learning optimization, even if they are active for a given optimal solution. In this paper we extend the definition of support vector to support constraint and we provide some criteria to determine which constraints can be removed from the learning problem still yielding the same optimal solutions. In particular, we discuss the case of logical constraints expressed by Łukasiewicz logic, where both inferential and algebraic arguments can be considered. Some theoretical results that characterize the concept of unnecessary constraint are proved and explained by means of examples.
Keywords:Support Vectors First–Order Logic Kernel Machines.
Support vector machines (SVMs) are a class of kernel methods originally conceived by Vapnik and Chervonenkis . One of the main advantages of this approach is the capacity to create nonlinear classifiers by applying the kernel trick to maximum–margin hyperplanes [1, 3]. This property derives from the implicit definition of a (possibly infinite) high–dimensional feature representation of data determined by the chosen kernel. In the supervised case, the learning strategy consists in the optimization of an objective function, given by a regularization term, subject to a set of constraints that enforce the membership of the example points to the positive or negative class, as specified by the provided targets. The satisfaction of these constraints can be obtained also by the minimization of a hinge loss function that does not penalize output values “beyond” the target. As a consequence, the solution of the optimization problem will depend only on a subset of the given training data, namely those that contribute to the definition of the maximum–margin hyperplane separating the two classes in the feature space. In fact, if we approach the problem in the framework of constrained optimization, these points will correspond to the active constraints in the Lagrangian formulation. This means that we can split the training examples into two categories, the support vectors, that completely determine the optimal solution of the problem, and the straw vectors. By solving the Lagrangian dual of the optimization problem, the support vectors are those supervised examples corresponding to constraints whose Lagrangian multiplier is not null. In this paper we extend this paradigm to a class of semi–supervised learning problems where logical constraints are enforced on the available samples.
Learning from constraints has been proposed in the framework of kernel machines as an approach to combine prior knowledge and learning from examples . In particular, some techniques to exploit knowledge expressed in a description logic language  and by means of first-order logic (FOL) rules have been proposed in the literature [15, 5]. In general, these techniques assume a multi–task learning paradigm where the functions to be learnt are subject to a set of logical constraints, which provide an expressive and formally well–defined representation for abstract knowledge. For instance, logical formulas may be translated into continuous functions by means of t-norms theory . This mapping allows the definition of an optimization problem that integrates supervised examples and the enforcement of logical constraints on a set of available groundings. In general, the resulting optimization problem is not guaranteed to be convex as in the original SVM framework due to contribution of the constraints. However, it turns out to be convex when considering formulas expressed with a fragment of the Łukasiewicz logic . In this case, the problem can be formulated as quadratic optimization since the constraints are convex piece-wise linear functions. Other related methods to embed logical rules into learning schemes have been considered, such as [19, 18], where a framework called Logic Tensor Networks has been proposed, and , where logic rules are combined with neural network learning.
The notion of support constraints has been proposed in [10, 11] to provide an extension of the concept of support vector when dealing with learning from constraints. The idea is based on the definition of entailment relations among constraints and the possibility of constraint checking on the data distribution. In this paper, we provide a formal definition of unnecessary constraints that refines the concept of support constraint and we provide some theoretical results characterizing the presence of such constraints. These results are illustrated by examples that show in practice how the conditions are verified. The main idea is that unnecessary constraints can be removed from a learning problem without modifying the set of optimal solutions. Similarly, with the specific goal to define algorithms accelerating the search for solutions in optimization problems, it is worth to mention the works in the Constraint Reduction (CR) field. In particular, in  it is shown how to reduce the computational burden in a convex optimization problem by considering at each iteration the subset of the constraints that contains only the most critical (or necessary) ones. In this sense, our approach allows us to determine theoretically which are the unnecessary constraints as well as to enlighten their logical relations with the other constraints.
The paper is organized as follows. In Section 2 we introduce the notation and the problem formulation. Then, Section 3 analyzes the structure of the optimal solutions, providing the conditions to determine the presence of unnecessary constraints. The formal definition of unnecessary constraint and the related theorems are reported in Section 3.2. In Section 4 we show how the proposed method is applied by means of some examples and finally, some conclusions and future directions are discussed in Section 5.
2 Learning from Constraints in Kernel Machines
We consider a multi–task learning problem with denoting a set of functions to be learned. We assume that each belongs to a Reproducing Kernel Hilbert Space (RKHS)  and it is expressed as
where is a function that maps the input space into a feature space (possibly having infinite dimensions), such that , where is the -th kernel function. The notation is quite general to take into account the fact that predicates (f.i. unary or binary) can be defined on different domains and approximated by different kernel functions111Predicates sharing the same domain may be approximated in the same RKHS by using the same kernel function..
We assume a semi-supervised scheme in which each is trained on two datasets, containing the supervised examples and containing the unsupervised ones, while all the available input samples for are collected in , as follows
In the following, whenever we write , we assume . Functions in P are assumed to be predicates subject to some prior knowledge expressed by a set of First–Order Logic (FOL) formulas with in a knowledge base , and evaluated on the available samples for each predicate.
The learning problem is formulated to require the satisfaction of three classes of constraints, defined as follows.
Consistency constraints derive from the need to limit the values of predicates into , in order to be consistent with the logical operators:
Pointwise constraints derive from the supervisions by requiring the output of the functions to be 1 for target and 0 for :
Logical constraints are obtained by mapping each formula in KB into a continuous real-valued function according to the operations of a certain t-norm fuzzy logic222See e.g.  for more details on fuzzy logics. (see Tab. 1 for the Łukasiewicz fuzzy logic) and then forcing their satisfaction by
where for any , is the vector of the evaluations (groundings) of the -th predicate on the samples in and is the concatenation of the groundings of all the predicates.
2.2 Optimization Problem
Given the previously defined constraints, the learning problem can be formulated as primal optimization as,
This problem was shown to be solvable by quadratic optimization provided the formulas in belong to the convex Łukasiewicz fragment (i.e. formulas exploiting only the operators in Tab. 1 ) and, in the following, we keep this assumption. This yields the functional constraints to be both convex and piecewise linear functions, hence they can be expressed as the max of a set of affine functions333The number of linear pieces depends on both the formula and the number of groundings used in that formula. (see Theorem 2.49 in )
where is a vector defining the -th linear piece depending on the structure of the -th formula, and . Basically any weighs the contribution of the -th sample in for the -th predicate in the -th linear piece deriving from the Łukasiewicz formula of the -th logic constraint. The matrix , obtained concatenating all the by row, may have several null elements, as shown in the examples reported in the following.
Let be a unary and a binary predicate, respectively, evaluated on and so that , denote their grounding vectors. Given the formula , according to the convex Łukasiewicz operators, its corresponding functional constraint can be rewritten as the max of a set of affine functions, i.e. , that can be made explicit with respect to the grounding vectors of and by:
In this case , and, for instance, and .
According to eq. (1), any logical constraint for can be replaced by linear constraints , yielding Problem 1 to be reformulated as quadratic programming. Hence, assuming to satisfy the associated KKT–conditions and that the feasible set of solutions is not empty, for any the optimal solution obtained by differentiating the Lagrangian function of Problem 1 (see ) is computed as:
Each solution can be written as an expansion of the -th kernel with respect to the three different types of constraints on the corresponding sample points. As in classical SVMs, we may study the constraints whose optimal Lagrange multipliers are not null, namely the support (active) constraints.
3 Unnecessary Constraints
The optimal solution of Problem 1 is determined only by the support constraints. The problem is convex if the Gram matrix of the chosen kernel is positive-semidefinite and strictly convex if it is positive-definite. The solution is guaranteed to be unique only in this second case . For both cases, different multiplier vectors , may yield an optimal solution for the Lagrangian function associated to the problem, e.g. see Example 3.
In this study, we are interested in constraints that are not necessary for the optimization, even if they may turn out to be active for a certain solution. The main results of this paper establish some criteria to discover unnecessary constraints and their relationship with the underling consequence relation among formulas in Łukasiewicz logic.
3.1 About Multipliers for Logical Constraints
By construction, pointwise and consistency constraints are both related to a single sample for a given predicate. This means that the contribution of the active constraints of this type in any point is weighted by a specific multiplier, as expressed by the first and third summations in eq. . On the opposite, each logical constraint involves in general more predicates eventually evaluated on different points (each Lagrange multiplier in the second summation of eq. is associated to a set of samples). Hence we may wonder if it exists a vector of Lagrange multipliers yielding the same contribution to the solution for each point, for which as much as possible multipliers are null.
For simplicity, eq. can be rewritten more compactly by grouping the terms with respect to any sample as
where denote the vectors of optimal coefficients (depending on optimal Lagrange multipliers) of the kernel expansion for pointwise, logical and consistency constraints respectively. In particular, the term for the logical constraints is defined as
Since (4) corresponds to the overall contribution of the logical constraints to the -th optimal solution in its -th point, we are interested in the case where we obtain the same term with different values for the multipliers . In particular, we would like to verify if there exists such that for every and for every , it is possible to compute such that
This condition yields the same solution to the original problem but without any direct dependence on the -th constraint. This case can be determined as defined in the following Problem 2, where a matrix formulation is considered, and then by looking for a solution (if there exist) with null components for the -th constraint.
Given an optimal solution for Problem 1, find such that
where and .
Let be an orthonormal base of the space generated by , such that any solution can be expressed as
for some . We have the following cases:
if then the system allows the unique solution ;
if then there exist infinite solutions.
In the first case, the only constraints whose multipliers give null contribution to the optimal solution are the original straw constraints. Whereas in the second case, we look for a solution (if there exists) where for any for some . Indeed in such a case, we can replace with by transferring the contribution of the -th constraint to the other constraints still obtaining the same optimal solution for the predicates. This is carried out by solving the linear system with equations and variables .
In the following, we will say that a vector is a solution of Problem 2 with respect to , if it is a solution and for every .
3.2 Unnecessary Hard–Constraints
Roughly speaking, we say that a given constraint is unnecessary for a certain optimization problem if its enforcement does not affect the solution of the problem. The main idea is that if we consider two problems (defined on the same sample sets and with the same loss), one with and one without the considered constraint, both have the same optimal solutions. The relation between logical inference and deducible constraints arises naturally in this frame, indeed logical deductive systems involve truth-preserving inference. In addition, logical constraints are quite general to include both pointwise and consistency constraints. A supervision for a predicate can be expressed by if and by if , while the consistency constraints by . We note that in this uniform view, Problem 2 applies to all the constraints.
Let us consider the learnable functions in P evaluated on a sample and . We say that is unnecessary for if the optimal solutions of problems and coincide, where
and is the Gram matrix of .
If and are the feasible sets of and respectively, we have , however in general they are not the same set.
Since all the considered constraints correspond to logical formulas, we can also exploit some consequence relation among formulas in Łukasiewicz logic. In the following, we will write , where is a set of propositional formulas, to express the true-preserving logical consequence in Ł, stating that has to be evaluated as true for any assignation satisfying all the formulas in .
If then is unnecessary for .
By hypothesis, any solution satisfying the constraints of satisfies the constraints of as well, namely we have . The conclusion easily follows since the two problems have the same loss function with the same feasible set.
One advantage of this approach is providing some criteria to determine the constraints that are not necessary for a learning problem. Indeed, in presence of a large amount of logical rules, Proposition 1 guarantees we can remove all the deducible constraints simplifying the optimization still getting the same solutions.
The vice versa of Proposition 1 is not achievable, since the logical consequence has to hold for every assignation. The notion of unnecessary constraint is local to a given dataset, indeed the available sample is limited and fixed in general. However, if a constraint is unnecessary then the optimal solutions with or without it coincide and we have that such constraint is satisfied whenever the other ones are satisfied by any optimal assignations. Such consequence among constraints, taking into account only the assignations leading to best solutions on a given dataset, provides an equivalence with the notion of unnecessary constraint. It is interesting to notice that a slightly different version of this consequence has already been considered in .
3.3 Towards an Algebraic Characterization
In Sec. 3.1 we introduced a criterion to discover if a given constraint can be deactivated solving Problem 2. The method consists in finding a vector of Lagrange multipliers with null components corresponding to . We are now interested in discovering the relation between this criterion and the notion of unnecessary constraint. Some results are stated by the following propositions.
If is unnecessary for then for any optimal solution of this problem there exists a -solution of Problem 2 with respect to .
If is unnecessary then and have the same optimal solutions. Let us consider one of them, lets say , where for the two problems with respect to some multipliers vectors and . Since the two vectors of multipliers yield the same optimal solution, then we can define for every a solution still satisfying the KKT-conditions (also called a KKT-solution) of Problem 2 as:
This has to be thought of as a necessary condition to discover which logical constraints can be removed from still preserving its optimal solutions. However, the other way round does hold in case either or has a unique solution , but in general we can only prove a weaker result.
If there exists a -solution of Problem 2 with respect to (for a certain optimal solution of ), then the set of optimal solutions of is included in the set of optimal solutions of .
Given any optimal solution of , since the problem is (at least) convex, we have . At this point, we note that is also feasible for and that the restriction of on components is a vector of Lagrange multipliers for satisfying the KKT-conditions. The convexity of the problem guarantees that the KKT-conditions are sufficient as well. This means that is also an optimal solution for , hence its loss value is a global minimum and the same holds for .
In this case we can not conclude that any optimal solution of is an optimal solution for HP because in general this solution could be not feasible for this problem. However as we pointed out above, we have the following result.
If either or equivalently has a unique solution then the premise of Proposition 3 is also sufficient.
The solution is unique if the Gram matrix K, that is the same in both the problems, is positive-definite. Hence, requiring the uniqueness of the solution for the two problems is equivalent and the claim is trivial from Proposition 3.
4 Some Examples
Here we illustrateq , by means of some cases solved in MATLAB with the interior-point-convex algorithm, how the method works and we discuss the results to clarify what described so far. In particular, we exploit the transitive law as an example to enlighten how the presented theoretical results apply.
We are given the predicates subject to , , . Given a common evaluation dataset , the logical formulas can be translated into the following linear constraints
and yield the following terms for the Lagrangian associated to Problem 1,
At first we solve the optimization problem where, to avoid trivial solutions, we provide few supervisions for the predicates and we exploit a polynomial kernel. To keep things clear, we consider only two points defined in , . Hence, given the solution (uniqueness holds) of Problem 1 (see Fig. 1), where , we have
In this case all the solutions of Problem 2 are given for any by
where the pair of vectors is a base for . From this, we get that the only way to obtain the same nullifying the contribution of the third constraint is taking , namely taking . It is worth to notice that we can also decide to nullify the contribution of the first or of the second constraint taking or . In these cases we get , , but the third one is a support constraint.
Although it is easy to see that the third constraint is deducible from the other ones, Problem 1 may give a different perspective in terms of support constraints.
Given the same problem as Example 2 with the additional point in , we get , hence the third constraint turns out to be initially supporting. However we may wonder if there is another solution of Problem 2 where the components of the third constraint are null. The matrix is obtained from by adding three rows and three columns corresponding to the additional grounding of the predicates and to the components for the logical constraints on the new point.
In this case, the dimension of is increased exactly by one, as the number of affine components of any involved logical constraint. This means, we can try to find a in which a certain constraint has null values. For instance, the vector = is a solution of Problem 2 with respect to the third constraint. However, as in Example 2, it is the only KKT-solution allowing us to remove the contribution of a constraint.
4.1 From Support to Necessary Constraints
Combining pointwise and consistency constraints brings any optimal solution to be evaluated exactly to 0 or 1 on any supervised sample and all the corresponding Lagrange multipliers to be different from zero, namely they will turn out to be support constraints. However, they could be unnecessary constraints for the problem and we could actually remove them from the optimization.
We consider the same problem as Example 2 where is labelled as negative for and positive for both and . We express the pointwise and the consistency constraints in logical form. All the constraints are obtained requiring the following linear functions to be less or equal to zero:
Exploiting the complementary slackness and the condition for the Lagrange multipliers given by Problem 2, we can provide several combinations of values for the multipliers yielding the same solution. The Gram matrix is positive-definite () and the solution provided by a linear kernel is unique. For this simple example we have only two possible KKT-solutions of Problem 2 minimizing the number of necessary constraints, they are and . This may be easily shown since the complementarity slackness force and multiplying by the remaining multipliers, they have to satisfy:
Since HP has a unique solution, from Corollary 1, we have two different minimal optimization problems. One with only and as necessary constraints and the other with only and once again.
In general, in learning from constraints, several constraints are combined into an optimization scheme and often it is quite difficult to identify the contribution of each of them. In particular, some constraints could turn out to be not necessary for finding a solution. In this paper, we propose a formal definition of unnecessary constraint as well as a method to determine which are the unnecessary constraints for a learning process in a multi-task problem. The necessity of a certain constraint is related to the notion of consequences among the other constraints that are enforced at the same time. This is a reason why we suppose to deal with logical constraints that are quite general to include both pointwise and consistency constraints. The logical consequence among formulas is a sufficient condition to conclude that a constraint, corresponding to a certain formula, is unnecessary. However, we also provide an algebraic necessary condition that turns out to be sufficient in case the Gram matrices associated to the kernel functions are positive-definite.
-  Boser, B.E., Guyon, I.M., Vapnik, V.N.: A training algorithm for optimal margin classifiers. In: Proceedings of the fifth annual workshop on Computational learning theory. pp. 144–152. ACM (1992)
-  Boyd, S., Vandenberghe, L.: Convex optimization. Cambridge university press (2004)
-  Cortes, C., Vapnik, V.: Support-vector networks. Machine Learning 20(3), 273–297 (Sep 1995). https://doi.org/10.1023/A:1022627411411, https://doi.org/10.1023/A:1022627411411
-  Cumby, C.M., Roth, D.: On kernel methods for relational learning. In: Proceedings of the 20th International Conference on Machine Learning (ICML-03). pp. 107–114 (2003)
-  Diligenti, M., Gori, M., Maggini, M., Rigutini, L.: Bridging logic and kernel machines. Machine learning 86(1), 57–88 (2012)
-  Diligenti, M., Gori, M., Saccà, C.: Semantic-based regularization for learning and inference. Artificial Intelligence (2015)
-  Giannini, F., Diligenti, M., Gori, M., Maggini, M.: Learning łukasiewicz logic fragments by quadratic programming. In: Joint European Conference on Machine Learning and Knowledge Discovery in Databases. pp. 410–426. Springer (2017)
-  Giannini, F., Diligenti, M., Gori, M., Maggini, M.: On a convex logic fragment for learning and reasoning. IEEE Transactions on Fuzzy Systems (2018)
-  Gnecco, G., Gori, M., Melacci, S., Sanguineti, M.: Foundations of support constraint machines. Neural computation 27(2), 388–480 (2015)
-  Gori, M., Melacci, S.: Support constraint machines. In: Lu, B.L., Zhang, L., Kwok, J. (eds.) Neural Information Processing. pp. 28–37. Springer Berlin Heidelberg, Berlin, Heidelberg (2011)
-  Gori, M., Melacci, S.: Constraint verification with kernel machines. IEEE transactions on neural networks and learning systems 24(5), 825–831 (2013)
-  Hájek, P.: Metamathematics of fuzzy logic, vol. 4. Springer Science & Business Media (1998)
-  Hu, Z., Ma, X., Liu, Z., Hovy, E., Xing, E.: Harnessing deep neural networks with logic rules. arXiv preprint arXiv:1603.06318 (2016)
-  Jung, J.H., O’Leary, D.P., Tits, A.L.: Adaptive constraint reduction for convex quadratic programming. Computational Optimization and Applications 51(1), 125–157 (2012)
-  Muggleton, S., Lodhi, H., Amini, A., Sternberg, M.J.: Support vector inductive logic programming. In: Discovery science. vol. 3735, pp. 163–175. Springer (2005)
-  Paulsen, V.I., Raghupathi, M.: An introduction to the theory of reproducing kernel Hilbert spaces, vol. 152. Cambridge University Press (2016)
-  Rockafellar, R.T., Wets, R.J.B.: Variational analysis, vol. 317. Springer Science & Business Media (2009)
-  Serafini, L., Garcez, A.d.: Logic tensor networks: Deep learning and logical reasoning from data and knowledge. arXiv preprint arXiv:1606.04422 (2016)
-  Serafini, L., Garcez, A.S.d.: Learning and reasoning with logic tensor networks. In: AI* IA. pp. 334–348 (2016)