# Teaching the Old Dog New Tricks: Supervised Learning with Constraints

## Abstract

Methods for taking into account external knowledge in Machine Learning models have the potential to address outstanding issues in data-driven AI methods, such as improving safety and fairness, and can simplify training in the presence of scarce data. We propose a simple, but effective, method for injecting constraints at training time in supervised learning, based on decomposition and bi-level optimization: a master step is in charge of enforcing the constraints, while a learner step takes care of training the model. The process leads to approximate constraint satisfaction. The method is applicable to any ML approach for which the concept of label (or target) is well defined (most regression and classification scenarios), and allows to reuse existing training algorithms with no modifications. We require no assumption on the constraints, although their properties affect the shape and complexity of the master problem. Convergence guarantees are hard to provide, but we found that the approach performs well on ML tasks with fairness constraints and on classical datasets with synthetic constraints.

## 1 Introduction

Techniques to deal with constraints in Machine Learning (ML) have the potential to address outstanding issues in data-driven AI methods. Constraints representing (e.g.) physical laws can be employed to improve generalization; constraints may encode negative patterns (e.g. excluded classes) and relational information (e.g. involving multiple examples); constraints can ensure the satisfaction of desired properties, such as fairness, safety, or lawfulness; they can even be used to extract symbolic information from data.

To the best of the authors knowledge, the vast majority of approaches for taking into account external knowledge in ML make assumptions that restrict the type of constraints (e.g. differentiability, no relational information), the type of models (e.g. only Decision Trees, only differentiable approaches), and often require modifications in the employed training algorithms (e.g. specialized loss terms).

We propose a decomposition-based method, referred to as Moving Targets, to augment supervised learning with constraints. A master step “teaches” constraint satisfaction to a learning algorithm by iteratively adjusting the sample labels. The master and learner have no direct knowledge of each other, meaning that: 1) any ML method can be used for the learner, with no modifications; 2) the master can be defined via techniques such as Mathematical or Constraint Programming, to support discrete values or non-differentiable constraints. Our method is also well suited to deal with relational constraints over large populations (e.g. fairness indicators). Moving Targets subsumes the few existing techniques – such as the one by Kamiran and Calders (2009) – capable of offering the same degree of versatility.

When constraints conflict with the data, the approach prioritizes constraint satisfaction over accuracy. For this reason, it is not well suited to deal with fuzzy information. Moreover, due to our open setting, it is hard to determine convergence properties. Despite this, we found that the approach performs well (compared to state of the art methods) on classification and regression tasks with fairness constraints, and on classification problems with balance constraints.

Due to its combination of simplicity, generality, and the observed empirical performance, Moving Targets can represent a valuable addition to the arsenal of techniques for dealing with constraints in Machine Learning. The paper is organized as follows: in Section 2 we briefly survey related works on the integration of constraints in ML; in Section 3 we present our method and in Section 4 our empirical evaluation. Concluding remarks are in Section 5.

## 2 Related Works

Here we provide an overview of representative approaches for integrating constraints in ML, and we discuss their differences with our method.

Most approaches in the literature build over just a few key ideas. One of them is using the constraints to adjust the output of a trained ML model. This is done in DeepProbLog Manhaeve et al. (2018), where Neural Networks with probabilistic output (mostly classifiers) are treated as predicates. Rocktäschel and Riedel (2017) presents a Neural Theorem Prover using differentiable predicates and the Prolog backward chaining algorithm. The original Markov Logic Networks Richardson and Domingos (2006) rely instead on Markov Fields defined over First Order Logic formulas. As a drawback, with these approaches the constraints have no effect on the model parameters, which complicates the analysis of feature importances. Moreover, dealing with relational constraints (e.g. fairness) requires access at prediction time either to a representative population or to its distribution Hardt et al. (2016); Fish et al. (2016).

A second group of approaches operate by using constraint-based expressions as regularization terms at training time. In Semantic Based Regularization Diligenti et al. (2017b) constraints are expressed as fuzzy logical formulas over differentiable predicates. Logic Tensor Networks Serafini and Garcez (2016) focus on Neural Networks and replace the entire loss function with a fuzzy formula. Differentiable Reasoning Van Krieken et al. (2019) uses in a similar fashion relational background knowledge to benefit from unlabeled data. In the context of fairness constraints, this approach has been taken in Aghaei et al. (2019); Dwork et al. (2012); Zemel et al. (2013); Calders and Verwer (2010); Kamiran et al. (2010). Methods in this class account for the constraints by adjusting the model parameters, and can therefore be used to analyze feature importance. They can deal with relational constraints without additional examples at prediction time; however, they ideally require simultaneous access at training time to all the examples linked by relational constraints (which can be problematic when using mini-batches). They often require properties on the constraints (e.g. differentiability), which may force approximations; they may also be susceptible to numerical issues.

A third idea consists in enforcing constraint satisfaction in the data via a pre-processing step. This is proposed in the context of fairness constraints by Kamiran and Calders (2009, 2011); Luong et al. (2011). The approach enables the use of standard ML methods with no modification, and can deal with relational constraints on large sets of examples. As a main drawback, bias in the model or the training algorithm may prevent getting close to the pre-processed labels.

Multiple ideas can be combined: domain knowledge has been introduced in differentiable Machine Learning (e.g. Deep Networks) by designing their structure, rather than the loss function: examples include Deep Structured Models in Lin et al. (2016) and Ma and Hovy (2016). These approaches can use constraints to support both training and inference.

Though less related to our approach, constraints can be used to extract symbolic knowledge from data, for example by allowing the training algorithm to adjusting the regularizer weights. This approach is considered (e.g.) in Lippi and Frasconi (2009); Marra et al. (2019); Daniele and Serafini (2019).

Our approach is closely related to the idea of enforcing constraints by altering the data, and shares the same advantages (versatility, support for relational constraints and feature importance analysis, no differentiability assumptions). We counter the main drawbacks mentioned above by using an iterative algorithm rather than a single pre-processing step.

Loss Function | Expression | Label Space |
---|---|---|

Mean Squared Error | ||

Hamming Distance | ||

Cross Entropy |

## 3 Moving Targets

In this section we present our method, discuss its properties, draw connections with related algorithms and provide some convergence considerations.

### The Algorithm

Our goal is to adjust the parameters of a ML model so as to minimize a loss function with clearly defined labels, under a set of generic constraints. We acknowledge that any constrained learning problem must trade prediction mistakes for a better level of constraint satisfaction, and we attempt to control this process by carefully selecting which mistakes should be made. This is similar to Kamiran and Calders (2009, 2011); Luong et al. (2011), but: 1) we consider generic constraints rather than focusing on fairness; and 2) we rely on an iterative process (which alternates “master” and “learner” steps) to improve the results.

Let be the loss function, where is the prediction vector and is the label vector. We make the (non restrictive) assumption that the loss is a pre-metric – i.e. and iff . Examples of how to treat common loss functions can be found in Table 1.

We then want to solve, in an exact or approximate fashion, the following constrained optimization problem:

(1) |

where is the ML model and its parameter vector. With some abuse of notation we refer as to the vector of predictions for the examples in the training set . Since the model input at training time is known, constraints can be represented as a feasible set for the sole predictions .

The problem can be rewritten without loss of generality by introducing a second set corresponding to the ML model bias. This leads to a formulation in pure label space:

(2) |

where .

The Moving Targets method is described in Algorithm 1, and starts with a learner step w.r.t. the original label vector (pretraining). Each learner step, given a label vector as input, solves approximately or exactly the problem:

(3) |

Note that this is a traditional unconstrained learning problem, since is just the model/algorithm bias. The result of the first learner step gives an initial vector of predictions .

Next comes a master step to adjust the label vector: this can take two forms, depending on the current predictions. In case of an infeasibility, i.e. , we solve:

(4) |

Intuitively, we try to find a feasible label vector that is close (in terms of loss function value) to both the original labels and the current prediction . A parameter controls which of the two should be preferred. If the input vector is feasible we instead solve:

(5) |

i.e. we look for a feasible label vector that is 1) not too far from the current predictions (in the ball defined by ) and 2) closer (in terms of loss) to the true labels . Here, we are seeking an accuracy improvement.

We then make a learner step trying to reach the adjusted labels; the new predictions will be adjusted at the next iteration and so on. In case of convergence, the predictions and the adjusted labels become stationary (but not necessarily identical). An example run, for a Mean Squared Error loss and convex constraints and bias, is in Figure 1.

### Properties

The learner is not directly affected by the constraints, thus enabling the use of arbitrary ML approaches. The master problems do not depend on the ML model, often leading to clean structures that are easier to solve. Since we make no explicit use of mini-batches, we can deal well with relational constraints on large groups (e.g. fairness). The master step can be addressed via any suitable solver, so that discrete variables and non-differentiable constraints can be tackled via (e.g.) Mathematical Programming, Constraint Programming, or SAT Modulo Theories. Depending on the constraints, the loss functions, and the label space (e.g. numeric vs discrete) the master problems may be NP-hard. Even in this case, their clean structure may allow for exact solutions for datasets of practical size. Moreover, for separable loss functions (e.g. all those from Table 1), the master problems can be defined over only the constrained examples, with a possibly significant size reduction. If scalability is still a concern, the master step can be solved in an approximate fashion: this may lead to a lower accuracy and a higher level of constraint violation; however, such issues are partly inevitable, due to algorithm/model bias, and since constraint satisfaction on the training data does not necessarily transfer to unseen examples.

### Analysis and Convergence

Due to the its open nature and minimal assumptions, establishing the convergence of our method is hard. Here we provide some considerations and connect the approach to existing algorithms. We will make the simplifying assumption that the learner problem from Equation 3 can be solved exactly. This holds for a few cases (e.g. convex ML models trained to close optimality), but in general the assumption will not be strictly satisfied.

We start by observing that Equation 2 is simply the Best Approximation Problem (BAP), which involves finding a point in the intersection of two sets (say and ) that is as close as possible to a reference point (say ). For convex sets in Hilbert spaces, the BAP can be solved optimally via Douglas-Rachford splits or the method from Artacho and Campoy (2018), both relying on projection operators. Since our learner is essentially a projection operator on , it would seem sensible to apply these methods in our setting. Unfortunately: 1) we cannot reliably assume convexity; and 2) in a discrete space, the Douglas-Rachford splits may lead to “label” vectors that are meaningless for the learner.

We therefore chose a design that is less elegant, but also less sensitive to which assumptions are valid. In particular: 1) in the steps, we use a modification of a suboptimal BAP algorithm to find a feasible prediction vector; then 2) in we apply a modified proximal operator to improve its distance (in terms of loss) w.r.t. the original labels . The basis for our step is the Alternating Projections (AP) method, discussed e.g. in Boyd and Dattorro (2003). The AP simply alternates projection operators on the two sets, which never generates vectors outside of the label space.

Indeed, for our step becomes a projection of the predictions on . With this setup we recover the AP behavior, and its convergence to a feasible point for convex sets. For we obtain the essentially the pre-processing method from Kamiran and Calders (2009): becomes a projection of the true labels on , and subsequent iterations have no further effect; convergence to a feasible point is achieved only if the pre-processed labels are in , which may not be the case (e.g. a quadratic distribution for a linear model). For reasonable values, our steps balances the distance (loss) from both the predictions and the targets . Convergence in this case is an open question, but especially in a non-convex or discrete setting (where multiple projections may exist) this modification helps escaping local minima and accelerate progress.

When a feasible prediction vector is obtained, our method switches to the step; we then search for a point in that is closer to the true labels, but also not too far from the predictions. This is related to the Proximal Gradient method, discussed e.g. in Parikh and Boyd (2014), but we limit the distance via a constraint rather than a squared norm, and we search in a ball rather than on a line. As in the proximal gradient, a too large search radius prevents convergence: for the step always returns the same adjusted labels, corresponding to the projection of on set.

## 4 Empirical Evaluation

Our experimentation is designed around a few main questions: 1) How does the Moving Targets method work on a variety of constraints, tasks, and datasets? 2) What is the effect of the parameters? 3) How does the approach scale? 4) How different is the behavior with different ML models? 5) How does the method compare with alternative approaches? We proceed to describe our setup and the experiments we performed to address such questions.

### Tasks and Constraints

We experiment on three case studies, covering two types of ML tasks and two types of constraints. First, we consider a classification problem augmented with a balance constraint, which forces the distribution over the classes to be approximately uniform. The Hamming distance is the loss function and the label space is . The problem is defined as a Mixed Integer Linear Program with binary variables such that iff the adjusted class for the -th example is . Formally:

(6) | ||||

s.t. | (7) | |||

(8) | ||||

(9) |

The summations in Equation 6 encode the Hamming distance w.r.t. the true labels and the predictions . Equation 7 prevents assigning two classes to the same example. Equation 8 requires an equal count for each class, with tolerance defined by ( in all our experiments); the balance constraint is stated in exact form, thanks to the discrete labels. The formulation generalizes the knapsack problem and is hence NP-hard; since all examples appear in Equation 8, no problem size reduction is possible. The problem can be derived from by changing the objective function and by adding the ball constraint as in Equation 5.

Our second use case is a classification problem with realistic fairness constraints, based on the DIDI indicator from Aghaei et al. (2019):

(10) | |||

where contains the indices of “protected features” (e.g. ethnicity, gender, etc.). is the set of possible values for the -th feature, and is the set of examples having value for the -th feature. The problem can be defined via the following Mathematical Program:

(11) |

s.t. | ||||

(12) | ||||

(13) | ||||

(14) | ||||

(15) |

where Equation 12 is the constraint on the DIDI value and Equation 13-(14) linearize the absolute values in its definition. The DIDI scales with the number of examples and has an intrinsic value due to the discrimination in the data. Therefore, we compute for the training set, then in our experiments we have . This is again an NP-hard problem defined over all training examples. The formulation can be derived as in the previous case.

Our third case study is a regression problem with fairness constraints, based on a specialized DIDI version from Aghaei et al. (2019):

(16) | |||

(17) |

In this case, we use the Mean Squared Error (MSE) as a loss function, and the label space is . The problem can be defined via the following Mathematical Program:

(18) | ||||

s.t. | (19) | |||

(20) | ||||

(21) | ||||

(22) |

This is a linearly constrained, convex, Quadratic Programming problem that can be solved (unlike our classification examples) in polynomial time. The problem can be derived as in the previous cases: while still convex, is in this case a quadratically constrained problem.

NN (, ) | Ptr | Ideal case | ||||||
---|---|---|---|---|---|---|---|---|

Iris | S | |||||||

C | ||||||||

Redwine | S | |||||||

C | ||||||||

Whitewine | S | |||||||

C | ||||||||

Shuttle | S | |||||||

C | ||||||||

Dota2 | S | |||||||

C | ||||||||

Adult | S | |||||||

C | ||||||||

Crime | S | |||||||

C | - |

### Datasets, Preparation, and General Setup

We test our method on seven datasets from the UCI Machine Learning repository Dua and Graff (2017), namely iris (150 examples), redwine (1,599), crime (2,215), whitewine (4,898), adult (32,561), shuttle (43,500), and dota2 (92,650). We use adult for the classification/fairness case study, crime for regression/fairness, and the remaining datasets for the classification/balance case study.

For each experiment, we perform a 5-fold cross validation (with a fixed seed for random reshuffling) to account for noise due to sampling and in the training process. Hence, the training set for each fold will include of the data. All our experiments are run on an Intel Core i7 laptop with 16GB RAM, and we use Cplex 12.8 to solve the master problems. Our code and datasets are publicly available^{1}

All the datasets for the classification/balance case study are prepared by standardizing all input features (on the training folds) to have mean 0 and unit variance. The iris and dota2 datasets are very balanced (the constraint is easily satisfied), while the remaining datasets are quite unbalanced. In the adult (also known as “Census Income”) dataset the target is “income” and the protected attribute is “race”. We remove the features “education” (duplicated) and “native country” and use a one-hot encoding on all categorical features. Features are normalized between 0 and 1. Our crime dataset is the “Communities and Crime Unnormalized” table. The target is “violentPerPop” and the protected feature is “race”. We remove features that are empty almost everywhere and features trivially related to the target (“murders”, “robberies”, etc.). Features are normalized between 0 and 1 and we select the top 15 ones according to the SelectKBest method of scikit-learn (excluding “race”). The protected feature is then reintroduced.

### Parameter tuning

We know that extreme choices for and can dramatically alter the method behavior, but not what effect can be expected for more reasonable values. With this aim, we perform an investigation by running the algorithm for 15 iterations (used in all experiments), with different and values. As a ML model, we use a fully-connected, feed-forward Neural Network (NN) with two hidden layers with 32-Rectifier Linear Units. The last layer has either a SoftMax activation (for classification) or Linear (for regression). The loss function is respectively the categorical cross-entropy or the MSE. The network is trained with 100 epochs of RMSProp in Keras/Tensorflow 2.0 (default parameters and batch size 64).

The results are in Table 2. We report a score (row S, higher is better) and a level of constraint violation (row C, lower is better). The S score is the accuracy for classification and the R2 coefficient for regression. For the balance constraint, the C score is the standard deviation of the class frequencies; in the fairness case studies, we use the ratio between the DIDI of the predictions and that of the training data. Each cell reports mean and standard deviation for the 5 runs. Near feasible values are marked with a ; accuracy comparisons are fair only for similar constraint violation scores.

All columns labeled with and values refer to our method with the corresponding parameters. We explore different values of (for a fixed ), corresponding to different ball radii in the step; we also explore different values of , corresponding to a behavior of the step progressively closer to that of the Alternating Projections method. The ideal case column refers to a simple projection of the true labels on the feasible space : this corresponds (on the training data) to the best possible accuracy that can be achieved if the constraints are fully satisfied. The ptr column reports the results of the pretraining step.

The Moving Targets algorithm can significantly improve the satisfaction of non-trivial constraints: this is evident for the (very) unbalanced datasets redwine, whitewine, and shuttle and all fairness use cases, for which feasible (or close) results are almost always obtained. Satisfying very tight constraints (e.g. in the unbalanced dataset) generally comes at a significant cost in terms of accuracy. When the constraints are less demanding, however, we often observe accuracy improvements w.r.t. the pretraining results (even substantial ones for adult and crime): this is not simply a side effect of the larger number of training epochs, since we reset the NN weights at each iteration. Rather, this seems a positive side-effect of using an iterative method to guide training (which may simplify escaping from local optima): further investigation is needed to characterize this phenomenon. Finally, reasonable parameter choices have only a mild effect on the algorithm behavior, thus simplifying its configuration. Empirically, seems to works well and is used for all subsequent experiments.

### Scalability

We next turn to investigating the method scalability: from this perspective our examples are worst cases, since they must be defined on all the training data, and in some case involve NP-hard problems. We report the average time for a master step in Figure 2. The average time for a learner step (100 epochs of our NN) is shown as a reference. At least in our experimentation, the time for a master step is always very reasonable, even for the dota2 dataset for which we solve NP-hard problems on 74,120 examples. This is mostly due to the clean structure of the and problems. Of course, for sufficiently large training sets, exact solutions will become impractical.

### Setup of Alternative Approaches

Here we describe how to setup to alternative approaches that will be used for comparison. Namely, we consider the regularized linear approach from Berk et al. (2017), referred to as RLR, and a Neural Network with Semantic Based Regularization Diligenti et al. (2017a), referred to as SBR. Both approaches are based on the idea of introducing constraints as regularizers at training time. Hence, their loss function is in the form:

(23) |

The regularization term must be differentiable. We use SBR only for the case studies with the balance constraint, which we are forced to approximate to obtain differentiability:

(24) |

i.e. we use the sums of the NN output neurons to approximate the class counts and the maximum as a penalty; this proved superior to other attempts in preliminary tests. The term is the categorical cross-entropy.

Iris | S | |||
---|---|---|---|---|

C | ||||

Redwine | S | |||

C | ||||

Whitewine | S | |||

C | ||||

Shuttle | S | |||

C | ||||

Dota2 | S | |||

C | ||||

Adult | S | |||

C | ||||

Crime | S | |||

C |

Our SBR approach relies on the NN model from the previous paragraphs. Since access to the network structure is needed to differentiate the regularizer, SBR works best when all the examples linked by relational constraints can be included in the same batch. When this is not viable the regularizer can be treated stochastically (via subsets of examples), at the cost of one additional approximation. We use a batch size of 2,048 as a compromise between memory usage and noise. The SBR method is trained for 1,600 epochs.

The RLR approach relies on linear models (Logistic or Linear Regression), which are simple enough to consider large group of examples simultaneously. We use this approach for the fairness use cases. In the crime (regression) dataset is the MSE and the regularizer is simply Equation 17. In the adult (classification) dataset is the cross-entropy; the regularizer is Equation 10, with the following substitution:

This is an approximation obtained according to Berk et al. (2017) by disregarding the sigmoid in the Logistic Regressor to preserve convexity. We train this approach to convergence using the CVXPY 1.1 library (with default configuration). In RLR and SBR classification, the introduced approximations permit to satisfy the constraints by having equal output for all classes, i.e. completely random predictions. This undesirable behavior is countered by the term.

There is no simple recipe for choosing the value of in Equation 23; therefore, we performed experiments with different values to characterize its effect. The results are reported in Table 3. In most cases, larger values tend as expected to result in better constraint satisfaction, with a few notable exceptions for classification tasks (iris, dota, and adult). The issue is likely due to the approximations introduced in the regularizers, since it arises even on small datasets that fit in a single mini-batch (iris). Further analysis will be needed to confirm this intuitions. The accuracy decreases for a larger , as expected, but at a rather rapid pace. In the subsequent experiments, we will use for each dataset the RLR and SBR that offer the best accuracy while being as close to feasible as possible (the “ideal case” column from Table 2): these are the cells in bold font in Table 3.

Regularized methods | NN | LR | Ensemble trees | NN | ||
---|---|---|---|---|---|---|

Iris | S | |||||

C | ||||||

Redwine | S | |||||

C | ||||||

Whitewine | S | |||||

C | ||||||

Shuttle | S | |||||

C | ||||||

Dota2 | S | |||||

C | ||||||

Adult | S | |||||

C | ||||||

Crime | S | |||||

C |

### Alternative Approaches and ML Models

We can now compare the performance of Moving Targets using different ML models with that of the alternative approaches presented above (with ), plus a pre-processing approach adapted from Kamiran and Calders (2009), referred to as and obtained by setting in our method.

For our method, we consider the following ML models: 1) the NN from the previous section; 2a) a Random Forest Classifier with 50 estimators and maximum depth of 5 (used for all classification case studies); 2b) a Gradient Boosted Trees model, with 50 estimators, maximum depth 4, and a minimum threshold of samples per leaf of 5 (for the regression case study); 4a) a Logistic Regression model (for classification); 4b) a Linear Regression model (for regression). All models except the NN are implemented using scikit-learn Pedregosa et al. (2011). In the table, the tree ensemble method are reported on a single column, while another column (LR) groups Logistic and Linear regression.

Our algorithm seems to work well with all the considered ML models: tree ensembles and the NN have generally better constraint satisfaction (and higher accuracy for constraint satisfaction) than linear models, thanks to their larger variance. The preprocessing approach is effective when constraints are easy to satisfy (iris and dota2) and on all the fairness case studies, though less so on the remaining datasets. All Moving Targets approaches tend to perform better and more reliably than RLR and SBR. The case of RLR and LR is especially interesting, as they differ only for the mechanism used to enforce constraint satisfaction. The gap is partly due to the approximations in the regularizer (and the sampling noise for SBR), but there at least one more factor at play. On the crime dataset, RLR solves uses an exact regularizer and is trained to (near) optimality: the only possible reason for the gap is that the optimum for our regularized formulation does not correspond to an optimum for the original constrained problem (and in fact may be quite far). This is a potentially serious issue that deserves further investigation.

### Generalization

Since our main contribution is an optimization algorithm, we have focused so far on evaluating its performance on the training data, as it simplifies its analysis. Now that the property of our methods are clearer, we can assess the performance of the models we trained on the test data. The results of this evaluation are reported in Table 5, in the form of average ration between the scores and the level of constraint satisfaction in the test and the train data. With a few exceptions (e.g. satisfiability in iris), the models (especially the tree ensembles and LR) generalize well in terms of both accuracy and constraint satisfaction. Given the tightness of some of the original constraint and the degree to which the labels were altered, this is a remarkable result.

NN | Ensemble Trees | LR | ||
---|---|---|---|---|

Iris | ||||

Redwine | ||||

Whitewine | ||||

Shuttle | ||||

Dota2 | ||||

Adult | ||||

Crime | ||||

## 5 Conclusion

In this paper we have introduced Moving Targets, a decomposition approach to augment a generic supervised learning algorithm with constraints, by iterative adjusting the example labels. The method is designed to prioritize constraint satisfaction over accuracy, and proved to behave very well on a selection of tasks, constraints, and datasets. Its relative simplicity, reasonable scalability, and the ability to handle any classical ML model make it well suited for use in real world settings.

Many open questions remain: we highlighted limitations of regularization based techniques that deserve a much deeper analysis. The convergence properties of our method still need to be formally characterized (even in simpler, convex, scenarios). The method scalability should be tested on larger datasets, for which using approximate master steps will be necessary. Given the good performance of the pre-processing approach in Table 4, it may be interesting to skip the pretraining step in our method. Improvements may be possible by using specialized algorithms in specific settings: Douglas-Rachford splits could be applied in a numeric setting, or probabilistic predictions could be employed (when available) to refine the projection operators.

### Footnotes

- git repository after the reviewing process

### References

- Learning optimal and fair decision trees for non-discriminative decision-making. See ?, pp. 1418–1426. External Links: Link, Document Cited by: §2, §4, §4.
- A new projection method for finding the closest point in the intersection of convex sets. Computational optimization and applications 69 (1), pp. 99–132. Cited by: §3.
- A convex framework for fair regression. CoRR abs/1706.02409. External Links: Link, 1706.02409 Cited by: §4, §4.
- Alternating projections. EE392o, Stanford University. Cited by: §3.
- Three naive bayes approaches for discrimination-free classification. Data Mining and Knowledge Discovery 21 (2), pp. 277–292. Cited by: §2.
- Knowledge enhanced neural networks. See ?, pp. 542–554. External Links: Link, Document Cited by: §2.
- Semantic-based regularization for learning and inference. Artificial Intelligence 244, pp. 143 – 165. Note: Combining Constraint Solving with Mining and Learning External Links: ISSN 0004-3702, Document, Link Cited by: §4.
- Semantic-based regularization for learning and inference. Artificial Intelligence 244, pp. 143–165. Cited by: §2.
- UCI machine learning repository. University of California, Irvine, School of Information and Computer Sciences. External Links: Link Cited by: §4.
- Fairness through awareness. In Proceedings of the 3rd innovations in theoretical computer science conference, pp. 214–226. Cited by: §2.
- A confidence-based approach for balancing fairness and accuracy. In Proceedings of the 2016 SIAM International Conference on Data Mining, pp. 144–152. Cited by: §2.
- Equality of opportunity in supervised learning. In Advances in neural information processing systems, pp. 3315–3323. Cited by: §2.
- Discrimination aware decision tree learning. In 2010 IEEE International Conference on Data Mining, pp. 869–874. Cited by: §2.
- Classifying without discriminating. In 2009 2nd International Conference on Computer, Control and Communication, pp. 1–6. Cited by: §1, §2, §3, §3, §4.
- Data preprocessing techniques for classification without discrimination. Knowl. Inf. Syst. 33 (1), pp. 1–33. External Links: Link, Document Cited by: §2, §3.
- Efficient piecewise training of deep structured models for semantic segmentation. In Proc. of the IEEE CVPR, pp. 3194–3203. Cited by: §2.
- Prediction of protein -residue contacts by markov logic networks with grounding–specific weights. Bioinformatics 25 (18), pp. 2326–2333. Cited by: §2.
- K-nn as an implementation of situation testing for discrimination discovery and prevention. In Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 502–510. Cited by: §2, §3.
- End-to-end sequence labeling via bi-directional lstm-cnns-crf. In Proc. of ACL, pp. 1064–1074. External Links: Document, Link Cited by: §2.
- DeepProbLog: neural probabilistic logic programming. arXiv preprint arXiv:1805.10872. Cited by: §2.
- Integrating learning and reasoning with deep logic models. In Proc. of ECML, Cited by: §2.
- Proximal algorithms. Foundations and Trends® in Optimization 1 (3), pp. 127–239. Cited by: §3.
- Scikit-learn: machine learning in Python. Journal of Machine Learning Research 12, pp. 2825–2830. Cited by: §4.
- Markov logic networks. Machine learning 62 (1-2), pp. 107–136. Cited by: §2.
- End-to-end differentiable proving. In Advances in Neural Information Processing Systems, pp. 3788–3800. Cited by: §2.
- Logic tensor networks: deep learning and logical reasoning from data and knowledge. arXiv preprint arXiv:1606.04422. Cited by: §2.
- Semi-supervised learning using differentiable reasoning. Journal of Applied Logic. Note: to Appear Cited by: §2.
- Learning fair representations. In International Conference on Machine Learning, pp. 325–333. Cited by: §2.