RUN-CSP: Unsupervised Learning of Message Passing Networks for Binary Constraint Satisfaction Problems
Constraint satisfaction problems form an important and wide class of combinatorial search and optimization problems with many applications in AI and other areas. We introduce a recurrent neural network architecture RUN-CSP (Recurrent Unsupervised Neural Network for Constraint Satisfaction Problems) to train message passing networks solving binary constraint satisfaction problems (CSPs) or their optimization versions (binary Max-CSP).
The architecture is universal in the sense that it works for all binary CSPs: depending on the constraint language, we can automtically design a loss function, which is then used to train generic neural nets. In this paper, we experimentally evaluate our approach for the 3-colorability problem (3-Col) and its optimization version (Max-3-Col) and for the maximum 2-satisfiability problem (Max-2-Sat). We also extend the framework to work for related optimization problems such as the maximum independent set problem (Max-IS).
Training is unsupervised, we train the network on arbitrary (unlabeled) instances of the problems. Moreover, we experimentally show that it suffices to train on relatively small instances; the resulting message passing network will perform well on much larger instances (at least 10-times larger).
bibliography.bib \publishersRWTH Aachen University
Constraint satisfaction is a general framework for casting combinatorial search and optimization problems; many well known NP-complete problems, for example, -colorability, Boolean satisfiability and optimization problems like maximum cut can be modeled as constraint satisfaction problems (CSPs). There is a long tradition of designing exact and heuristic algorithms for solving all kinds of CSPs. Our focus is on solving the optimization version of CSPs, where the objective is to satisfy as many constraints of a given instance as possible. Our work should be seen in the context of a recently renewed interest in heuristics for NP-hard combinatorial problems based on neural networks, mostly graph neural networks (for example, [DBLP:conf/iclr/SelsamLBLMD19, lemos2019graph, prates2019learning]).
We present a generic neural network based architecture called RUN-CSP (Recurrent Unsupervised Neural Network for Constraint Satisfaction Problems) which can be used for a variety of CSPs. The key features of our architecture are:
- Unsupervised Learning:
Training is completely unsupervised and just requires a set of instances of the problem at hand.
Even if training is carried out on small instances the resulting message passing network performs well on much larger (more than 10-times larger) instances.
The architecture is completely generic. We can automatically generate a loss function from the constraint language (that is, the types of constraints appearing in the instances of the CSP we want to solve) and use it to train a RUN-CSP message passing network solving the maximization version of this CSP.
In this paper, we focus on binary CSPs, where each constraint involves two variables, but the approach can easily be adapted to constraint languages of arbitrary arity.
We solve CSPs by message passing networks with nodes for all variables of the given instance. Associated with each node is a state, which is a vector of fixed length. Associated with each constraint is an edge (or rather two directed edges) between the nodes corresponding to the two variables appearing in the constraint. The messages are linear functions of the states. These linear functions, represented by a matrix, only depend on the type of constraint, and not on the specific variables. We update the states using an LSTM (Long Short-Term Memory) cell for each variable. We extract the value for each variable of the CSP from the state associated with the variable using a linear function with softmax activation. The parameters of the linear functions, the LSTMs, and the softmax layer are learned. Note that the parameters are shared for all variables and for all message passing functions corresponding to constraints of the same type. This allows us to scale the message passing networks to instances of arbitrary size. The loss functions we use are derived from the constraint language in a straightforward way, similar loss functions have already been used for Hopfield networks [Dahl1987NeuralNA, takefuji1991artificial]. Effectively, using these loss functions we train our networks to satisfy the maximum number of constraints. However, when given a satisfiable instance, the network often finds a satisfying assignment.
It is our focus on solving the maximization problem that allows us to train the networks completely unsupervised. This distinguishes our work from recent neural approaches to Boolean satisfiability [DBLP:conf/iclr/SelsamLBLMD19] and the coloring problem [lemos2019graph]. These approaches require supervised training, but as opposed to our framework, they also attempt to predict if an instance is unsatisfiable, which we do not. Instead, our networks simply return the best solution they can find (which for satisfiable instances often is a satisfying assignment).
We remark that the computations of RUN-CSP are very fast; in the range of mid-sized problem instances that we consider it scales linearly with the problem size.
We experimentally evaluate our approach on the following problems: the 3-colorability problem (3-Col), which asks for a 3-coloring of the vertices of a given graph such that the two endvertices of each edge have distinct colors; the maximization version Max-3-Col of 3-Col, which asks for a coloring maximizing the number of edges whose two endvertices have distinct colors; the maximum 2-satisfiability problem (Max-2-Sat), which asks for an assignment maximizing the number of satisfied clauses for a given Boolean formula in 2-conjunctive normal form. We also consider the maximum independent set problem Max-IS, which asks for an independent set of maximum cardinality in a given graph. Max-IS is a problem of a slightly different nature than Max-3-Col and Max-2-Sat, because the objective is not to maximize the number of satisfied constraints, but rather to satisfy all constraints and maximize the number of variables with a certain value. The reason we include this problem is to show that our approach can easily be adapted to this type of problem by adjusting the loss function to favor assignments making the independent set large.
We demonstrate that our simple generic approach works well for finding approximate solutions to various binary Max-CSP on small to medium sized instances (up to 1600 variables). Moreover, networks trained on small instances (of 100 variables) still work well on much larger instances (1000 or more variables). We do not claim that our method is competitive with state-of-the-art solvers for the specific CSPs or with highly optimized SAT solvers (which we can use to solve the decision problem 3-Col) or integer programming tools (which we can use to solve the maximization problems). However, we clearly demonstrate that our approach is competitive with or better than other neural approaches for solving CSPs (such as [lemos2019graph]) or simple greedy heuristics, even if they are designed for specific CSPs, whereas our approach is completely generic.
1.1 Related Work
The related work can be split into two main parts. The first group of papers dates back to the 1980’s and Hopfield Networks introduced by [hopfield1985neural] to solve TSP using neural networks. In this pioneering work, Hopfield and Tank used a single layer neural network with sigmoid activation together with gradient descend and a well chosen loss function as an approximation algorithm for TSP. They used loss function adopts soft assignments for the positions of the cities and return the length of the tour as loss plus an extra term to penalize incorrect tours. Their approach is unsupervised in the sense that there is no learning involved. Instead gradient descent is used to directly minimize the loss function for a given instance. This approach has been applied in [Dahl1987NeuralNA, takefuji1991artificial, harmanani2010neural, gassen1993graph] to the -colorability problem. adorf1990discrete (adorf1990discrete) adopted the Hopfield’s approach for more CSP problems and anderson1988neural (anderson1988neural) uses mean field theory to increase the scalability of it.
The second group of papers involve modern machine learning techniques and are often based on graph neural networks (GNNs). The learned message passing network DBLP:conf/iclr/SelsamLBLMD19 first announced in 2018 for predicting satisfiability reignited the interest in solving NP-complete problems using modern machine learning tools.
In prates2019learning the authors used GNNs to learn TSP. They use instances of the form asking whether the graph contains a Hamiltonian route of weight at most . They trained using examples and achieved good results on instances up to nodes. Based on the same technique, lemos2019graph learned to predict -colorability of graphs scaling to larger graphs and chromatic numbers than seen during training and achieving better results than a number of greedy heuristics. amizadeh2018learning; yao2019experimental used GNNs on SAT and Max-CUT. In both cases the loss function was chosen to encourage the network to maximize either the number of clauses or the size of the cut. For the #P-hard weighted model counting problem for DNF formulas, abboud2019learning achieved good results using a GNN-based message passing approach.
Pointer networks introduced by vinyals2015pointer (vinyals2015pointer) are based on the idea of sequence to sequence learning and attention networks. They used a supervised learning approach and were able to approximate TSP well up to about nodes. Based on those pointer networks, DBLP:conf/iclr/BelloPL0B17 (DBLP:conf/iclr/BelloPL0B17) introduced an unsupervised learning algorithm based on reinforcement learning scaling to TSP instances using up to nodes.
There are several approaches for large problem instances of more than nodes such as li2018combinatorial combining GNNs with tree search for Max-IS, khalil2017learning choosing the best heuristic for TSP by reinforcement learning, or huang2019coloring combining reinforcement learning and Monte Carlo tree search to improve a given greedy heuristic for 3-Col.
In this section, we describe our RUN-CSP architecture for training message passing networks for Max-CSPs.
Formally, a CSP-instance is a a triple , where is a set of variables, is a domain, and is a set of constraints of the form for some . We only consider binary constraints (with ) in this paper. A constraint language is a finite set of relations over some fixed domain , and is an instance of if for all constraints . An assignment satisfies a constraint if , and it satisfies the instance if it satisfies all constraints in . Now is the problem of deciding whether a given instance has a satisfying assignment and finding such an assignment if there is one, and is the problem of finding an assignment that satisfies the maximum number of constraints.
For example, an instance of 3-Col has a variable for each vertex of the input graph, domain , and a constraint for each edge of the graph. Here is the inequality relation on . Thus 3-Col is .
We use a randomized recurrent neural network architecture to evaluate a given problem instance using message passing. Intuitively, our network can be viewed as a trainable communication protocol. Since every message passing step uses the same set of weights, we are free to choose the number of iterations for which RUN-CSP runs on a given problem instance . This number may or may not be identical to the number of iterations used for training. In every iteration , a -dimensional state is associated with each variable . The size of the internal state and the number of iterations used for training and evaluations are the main hyperparameters of our network. The initial state for every variable is drawn from a normal distribution with mean and variance , that is for every . The updates of the internal state are performed based on the mean of all messages received for each variable. All variables that co-occur in a constraint can exchange messages based on their current states and . For every relation we have a different message generation function which is fed with the internal states of both endpoints and creates two messages, one for and the other for . This allows messages to depend on the target’s state as well as on the sender’s state enabling the network to send different messages whenever the states correlate to a satisfying or unsatisfying assignment for the constraint. This process of message passing and updating the internal state is repeated times.
For every variable the network produces a soft assignment in each iteration from the internal state . To obtain this assignment from the states, a trainable linear function is applied to each variable state to reduce the dimensionality from to , where . The output of this linear transformation is then passed through a softmax function to obtain a stochastic vector which we call a soft assignment. This soft assignment is the output of the algorithm and can be interpreted to contain probabilities of receiving a certain value for every . To obtain a hard variable assignment from the output, we assign the value with the highest estimated probability in for each variable.
The network is composed of multiple trainable functions. Messages are generated using a simple linear transformation from the internal states. For every relation the messaging function is defined by a trainable weight matrix with
This function takes both internal states as input and creates a pair of messages of length which are then used to update the internal states and , respectively. While the function to create those messages might be an arbitrary neural network, we found that a simple linear architecture yields at least as good results as more complicated nonlinear variants while being efficient and stable during training.
For symmetric relations we modify to enforce that it is a symmetric function. In this case is defined by a matrix such that:
We update every internal state using an update function with information about the old state and the messages from the variable’s neighbors (i.e. variables appearing in the same constraint ). The update of for a variable which received the messages is given by
For the update we average over all messages before applying . In our implementation we chose to be an LSTM cell where the long-term memory has been initialized by for every .
The soft assignment which is the output of our model is created by with . If the domain contains only two values, we make a small modification to simplify this architecture. Instead of choosing , we use and the sigmoid function (instead of softmax) to map the states to a scalar probability . The soft variable assignment is then defined as . Algorithm 1 describes the architecture in pseudocode.
The network’s output depends heavily on the random initialization of the states for every since those are the basis for all messages sent during inference. By applying the network multiple times to the same input and choosing the best solution, we can therefore boost the performance.
2.2 Loss Function
In the following we describe how to derive the loss function. Let be a CSP-instance, where without loss of generality we assume that for a positive integer . Given , our network will produce a soft variable assignment , where is a stochastic vector for every . We could obtain a hard assignment by independently sampling a value for each variable from the distribution specified by . In this case, the probability that any given constraint is satisfied by can be expressed by
where is the characteristic matrix of the relation with . Our training then aims to minimize the combined negative log-likelihood over all constraints:
For training, we combine the loss function (Equation (5)) with a discount factor to get our training objective which we minimize using the Adam optimizer:
This loss does not depend on any ground truth variable assignments such that we can train the network without resolving to optimal solutions. Computing larger optimal solutions for supervised training can easily turn out to be prohibitive. Our approach avoids such computations.
(1) When training a RUN-CSP network, we always focus on a specific CSP specified by its constraint language . For example, we consider 3-Col (or Max-3-Col) with the constraint language . Then we only need to find parameter matrices for relations , and the loss function only depends on the characteristic matrices for .
(2) In this paper, we focus on binary CSPs. To extend the approach to -ary CSPs for some (for example 3-SAT), we set up the message passing networks slightly differently: we introduce an additional node for each constraint and edges between the node for and the nodes corresponding to the variables . We also associate states with the constraint nodes and update these in a similar fashion as the variable nodes. It remains future work to experimentally evaluate this generalized setup.
(3) It may also be possible to extend the framework to the weighted version of Max-CSP, where a weight is associated with each constraint. To do this, we need to replace the averages in the loss function and message collection steps by weighted averages.
To validate our method empirically, we performed experiments111Our Tensorflow implementation of RUN-CSP is available at https://github.com/toenshoff/RUN-CSP. for Max-2-Sat, Max-3-Col and Max-IS. For all experiments, we chose the size of the internal states to be . We set the number of iterations to during training and for evaluation. During evaluation, we use parallel runs for every instance and use the best result, which gives us a boost in accuracy. We trained on relatively small randomly generated instances (- variables) and used training sets of size - since larger sets and larger instances did not improve performance. The learning rate was initially set to and decayed with a factor of after every training steps. We trained for epochs using a batch size of . In the loss function we used a decay rate of .
3.1 Maximum 2-SAT
We view Max-2-Sat as a binary CSP with domain and a constraint language consisting of three relations (for clauses with two negated literals), (one negated literal), (no negated literals). For example, is the set of satisfying assignments for a clause .
For training RUN-CSP, we computed a dataset of random formulas where each formula has variables and clauses. The Max-2-Sat formulas are chosen at random in the following way. At first, we uniformly draw two distinct variables for every clause such that no clause is a tautology. Then, we independently negate the two literals with probability .
To test the performance and generalization ability of the trained model, we evaluated it on random -CNF datasets with a varying number of variables and clauses. Figure 1 shows the number of violated clauses in the solutions found by the trained model over different sizes of formulas and different ratios of clauses per variable. Each color corresponds to a distinct number of variables. We plot the number of violated clauses found by RUN-CSP (solid lines) and optimal values for them (dashed lines) over the number of clauses per variable. Each data point in the plot corresponds to the mean number of violated clauses across random formulas. Optimal solutions have been determined using the LMHS Max-SAT solver LMHS. Due to the hard nature of Max-SAT, we only computed the optimal solutions for formulas where this was feasible and cut off computations after 5 hours on each formula. We show each curve up to the point where optimal solutions became infeasible. We observe that RUN-CSP performs well on these instances and returns results close to the optimum over a wide range of formulas. This even holds for formulas with variables, that are times as large as the formulas used for training. As the size and number of constraints of the formulas increase, the gap between the curves also grows i.e. RUN-CSP fulfills less clauses than the optimum. This indicates that the quality of RUN-CSP’s approximations does decrease for denser formulas.
Figure 2 depicts the same experiment as Figure 1 but instead of absolute numbers the plot shows the ratio of unsatisfied clauses. We omit the optimal solutions and provide the results for up to 6 clauses per variable. We observe that the ratio of unsatisfied clauses is practically independent from the number of variables. This is another indication that RUN-CSP generalizes well on larger formulas and more constraints than used for training.
We have already seen how to model 3-Col as a CSP with domain using a constraint language consisting of the inequality relation . As the constraint language consists of a single relation, we only need a single messaging function in RUN-CSP and, as is symmetric, we use the special case for symmetric relations described in the architecture section.
Max-3-Col with Hard Instances
We were interested in the behavior of RUN-CSP on graphs that are particularly challenging instances for Max-3-Col. We randomly generated “hard” satisfiable instances of 3-Col with the property that adding a single edge makes them unsatisfiable. We did this by initializing a graph with vertices and randomly sampled edges. After this initialization, we iteratively added more random edges one-by-one until the graph was no longer 3-colorable. If the initial graph was not 3-colorable, we repeated the initialization with fewer edges. To speed up convergence, we introduced new edges only between nodes with the same color in the current computed coloring. Using a SAT solver, we were able to generate such graphs with up to 400 nodes resulting in an average degree of around . A similar class of graphs has also been proposed as a candidate for hard instances in lemos2019graph. We create training datasets based on those hard instances, each containing graphs with nodes. Hard-Pos contains -colorable graphs, for which one additional edge would prohibit -colorability. Hard-Neg contains the corresponding non--colorable graphs with the additional edge. Hard-Mix contains -colorable and non--colorable instances. We generated datasets of each type and trained one RUN-CSP model on each of them. For evaluation, we generated hard -colorable instances for several graph sizes and let each trained model predict colorings for these graphs. Table 1 contains the average percentage of conflicting edges in the predicted colorings. We report the mean and the standard deviation across the 5 models for each dataset. The percentage of conflicting edges remained below one percent across all trained models and tested graph sizes. We observe that the percentage of conflicting edges increases with the size of the graphs. Table 1 shows RUN-CSP performs best if trained only on 3-colorable instances. However, in contrast to amizadeh2018learning, the models trained on strictly non-3-colorable graphs still perform reasonably well.
While the aim of RUN-CSP is to produce approximate solutions, we found that its accuracy is sufficient to produce colorings without conflicting edges for many instances. Table 2 provides the percentage of graphs that were optimally colored by the RUN-CSP models trained on Hard-Pos. Again, we provide the mean and standard deviation across the 5 separate models. Additionally, we compare RUN-CSP with the performance of classical -coloring heuristics. We provide the results for a simple greedy heuristic using the DSATUR strategy as well as a state-of-the-art heuristic called HybridEA lewis2015guide.
|Nodes||Hard-Pos (%)||Hard-Neg (%)||Hard-Mix (%)|
|50||0.02 (0.01)||0.04 (0.02)||0.02 (0.01)|
|100||0.25 (0.03)||0.31 (0.04)||0.31 (0.02)|
|200||0.51 (0.02)||0.60 (0.06)||0.56 (0.03)|
|300||0.66 (0.03)||0.77 (0.08)||0.73 (0.03)|
|400||0.79 (0.03)||0.90 (0.09)||0.86 (0.05)|
|Nodes||RUN-CSP (%)||Greedy (%)||HybridEA (%)|
We observe that RUN-CSP finds optimal 3-colorings for () of the instances with 50 nodes. This accuracy exceeds the one achieved by lemos2019graph (lemos2019graph) on a similar class of 3-colorable graphs. The fraction of optimal colorings declines for larger instances. While almost half of the graphs with 100 nodes are colored without conflict, the ratio drops below 1% for instances with more than 200 nodes. RUN-CSP performs significantly better than the greedy heuristic but worse than HybridEA across all tested graph sizes. The number of graphs that HybridEA colors with three colors also declines substantially for larger graphs. This indicates that our generated graphs are indeed difficult Max-3-Col instances, even for state-of-the-art heuristics.
3-Col with Erdős-Rényi Graphs
To test the performance of RUN-CSP for Max-3-Col on larger instances, we applied our method to Erdős-Rényi- random graphs. mulet2002coloring (mulet2002coloring) have shown that for this type of random graphs the phase transition from -colorable to non--colorable occurs at an average degree of approximately . We varied the average degree from to in steps of . For each node count and degree we generated random graphs. An instance of RUN-CSP trained on Hard-Pos was used to predict 3-colorings for these random graphs. Figure 3 depicts the fraction of graphs for which the network produced a valid, conflict-free 3-coloring. We also report this value for the HybridEA heuristic on the same graphs. We observe that up to a degree of almost all graphs across all tested sizes are colored without conflicts. For instances with nodes the performance of RUN-CSP closely matches that of HybridEA. For larger graphs the threshold at which our model stops producing conflict free colorings shifts towards a smaller average degree. We remark that training on larger graphs did not improve the quality of solution of RUN-CSP on larger graphs. This coincides with our previous observation that our learned approximation function struggles to find optimal solutions for larger instances. For these graphs, HybridEA does find valid 3-colorings for significantly more graphs. However, the model trained on graphs with 100 nodes was able to produce optimal 3-colorings on graphs that are 16-times larger, up to a certain edge density.
3.3 Maximum Independent Set
Finally, we experimented with the maximum independent set problem Max-IS. We can view independent set as a CSP with domain and a constraint language consisting of a single relation . As for colorability, the CSP instance corresponding to a graph has a variable for every vertex of and a constraint for every edge . Value for a variable indicates that belongs to the independent set. Max-IS is not the maximization version of this CSP aiming to satisfy as many constraints as possible (we can always satisfy all constraints by setting all variables to ). Instead, the objective of Max-IS is to set as many variables as possible to value subject to all constraints being satisfied.
To model this in our RUN-CSP framework, we modify the loss function. For a graph and a soft assignment , we define
Observe that is the usual RUN-CSP loss function for independent set as a constraint satisfaction problem. favors larger independent sets. A naive weighted sum of both terms turned out to be unstable during training and yielded poor results, whereas the product in (7) worked well.
For training, the loss was combined across iterations with a discount factor as for the standard RUN-CSP architecture. We trained the network for epochs with a batch size of 32 on random graphs with nodes and edges and chose the network producing the lowest number of conflicts during training. This is not necessarily the one producing the largest independent set. Nevertheless, especially on denser graphs, the predictions tended to contain a small number of conflicting edges. To address this issue, we applied a simple post-processing step. For each conflicting edge, we removed one of the endpoints from the predicted set making it independent. There are smarter approaches to eliminate conflicts which may lead to larger independent sets. We decided to use the simplest way to focus on the performance of RUN-CSP.
We evaluated the performance across random graphs of different sizes and densities. The average degree was varied from to in steps of and the number of nodes was chosen as , , and . For each combination of node counts and degrees, we generated Erdős-Rényi graphs with the appropriate amount of edges. Figure 4 depicts the average sizes of the computed independent sets by RUN-CSP and the average sizes of the optima. We used the LMHS Solver LMHS to compute optimal solutions, with a time limit of five hours per instance. The figure shows that the network’s approximations tend to be close to the optimum. But again the quality of the approximation decreases as the graphs get larger and denser.
In Figure 5 we plot the average number of conflicting edges in the output of the network before post-processing over the graphs from Figure 4. The number of conflicting edges increases as the size and the density of the graphs increases. Without post-processing the model would produce mostly invalid independent sets for larger and denser graphs. When applying post-processing, the network appears to generalize across a wide variety of graph sizes and densities.
We have presented a universal approach for solving binary Max-CSPs with recurrent unsupervised neural networks. Our experiments on the optimization problems Max-2-Sat, Max-3-Col and Max-IS show that RUN-CSP computes good approximations which are close to the optimum. We showed that the learned message passing functions generalize to instances with significantly more variables and constraints than the instances seen during training. For the decision variant 3-Col we could even compute optimal solutions for many instances. On this problem, RUN-CSP performed better than current neural methods and simple heuristics (e.g. greedy) in terms of accuracy and size of the instances. RUN-CSP does not outperform state-of-the-art heuristics, but does match their performance for relatively small instances of 3-Col.
Even though RUN-CSP is a general approach for solving binary Max-CSPs, in practice it will not work for CSPs with a large number of different relations since we need a unique messaging function for each type of relation. If we would train an instance of RUN-CSP for all possible relations the size of the network would be exponential in the domain size. All in all, RUN-CSP is a promising framework for approximating binary Max-CSPs.
We plan to extend RUN-CSP to CSPs of arbitrary arity and to weighted CSPs. It will be interesting to see, for example, how it performs on 3-SAT and its maximization variant.
There are also interesting theoretical questions regarding the expressiveness of our message passing network. While we cannot hope to solve NP-complete problems with networks running in polynomial time, we could ask if our network can solve CSPs that are in polynomial time, for example 2-colorability or 2-satisfiability. Another question is whether the network can solve NP-complete CSPs if we allow vectors of arbitrary, possibly exponential, length as states, or if we allow the network to run for an arbitrary number of iterations.