Differentiable Fixed-Point Iteration Layer
Recently, several studies proposed methods to utilize some restricted classes of optimization problems as layers of deep neural networks. However, these methods are still in their infancy and require special treatments, i.e., analyzing the KKT condition, etc., for deriving the backpropagation formula. Instead, in this paper, we propose a method to utilize fixed-point iteration (FPI), a generalization of many types of numerical algorithms, as a network layer. We show that the derivative of an FPI layer depends only on the fixed point, and then we present a method to calculate it efficiently using another FPI which we call the backward FPI. The proposed method can be easily implemented based on the autograd functionalities in existing deep learning tools. Since FPI covers vast different types of numerical algorithms in machine learning and other fields, it has a lot of potential applications. In the experiments, the differentiable FPI layer is applied to two scenarios, i.e., gradient descent iterations for differentiable optimization problems and FPI with arbitrary neural network modules, of which the results demonstrate the simplicity and the effectiveness.
Fixed-point iteration (FPI) has been one of the most important building blocks in many areas of machine learning for a long time. It usually appears in the form of numerical optimization, etc., to find desired solutions for given problems. There are tons of examples in the literature; support vector machines, expectation-maximization, compressed sensing, value/policy iterations in reinforcement learning, and the list goes on. All of these are popular examples that shaped the field for the past few decades.
However, such existing techniques are becoming less used in the current deep learning era, due to the superior performance of deep neural networks. Even if there already exists a more traditional algorithm that is well-defined, using it directly as a part of a new machine learning algorithm with a deep network can be tricky. Unless the algorithm gives a closed-form solution or we are to use the algorithm separately as preprocessing or post-processing, incorporating it in a deep network is not always straightforward because the usual iterative process makes backpropagation difficult. A naïve way to resolve this issue is to regard each iteration as a node in the computational graph, but this requires a lot of computational resources. As a result, it is not an exaggeration that many of the current day works are rewriting a large part of the existing methodologies with deep networks.
Recently, several papers proposed using certain types of optimization problems as layers in deep networks Belanger and McCallum (2016); Amos and Kolter (2017); Amos et al. (2017). In these approaches, the input or the weights of the layers are used to define the cost functions of the optimization problems, and the solutions of the problems become the layers’ output. For certain classes of optimization problems, these layers are differentiable. These methods can be used to introduce a prior in a deep network and provide a possibility of bridging the gap between deep learning and some of the above traditional methods. However, they are still premature and require non-trivial efforts to implement in actual applications. Especially, the backpropagation formula has to be derived explicitly for each different formulation based on some criteria like the KKT conditions, etc. This limits the practicality of the approaches since there can be numerous different optimization formulations depending on the actual problems.
In this paper, we instead focus on FPI which is the basis of many numerical algorithms including most gradient-based optimizations. In the proposed FPI layer, the layer’s input or its weights are used for defining an update equation, and the output of the layer is the fixed point of the update equation. Under mild conditions, the FPI layer is differentiable and the derivative depends only on the fixed point, which is much more efficient than adding all the individual iterations to the computational graph. However, an even more important advantage of the proposed FPI layer is that the derivative can be easily calculated based on another independent computational graph that describes a single iteration of the update equation. In other words, we do not need a separate derivation for the backpropagation formula and can utilize the autograd functionalities in existing deep learning tools. This makes the proposed method very simple and practical to use in various applications.
Specifically, the derivative of the FPI layer in its original form requires to calculate the Jacobian of the update equation. Since this can be a computational burden, we show that this computation can be transformed to another equivalent FPI, which is called the backward FPI in this paper. We also show that if the aforementioned conditions for the FPI layer hold, then the backward FPI also converges. In summary, both the forward and backward processes are composed of FPIs in the proposed method.
Since FPI covers many other types of numerical algorithms as well as optimization problems, there are a lot of potential applications for the proposed method. Contributions of the paper are summarized as follows.
We propose a novel layer based on FPI. The FPI layer can perform similar functionalities as existing layers based on differentiable optimization problems, but the implementation is much simpler and the backpropagation formula can be universally derived.
We show that under mild conditions, the FPI layer is differentiable and the derivative depends only on the fixed point, which eliminates the necessity of constructing a large computational graph.
To reduce the memory consumption in backpropagation, we derive the backward FPI that can calculate the derivative with a reduced memory requirement. Under the same conditions mentioned in the previous item, it is guaranteed that the backward FPI converges.
2 Background and Related Work
Fixed-point iteration: For a given function , the fixed-point iteration is defined based on the following update equation:
where is a sequence of vectors. If converges to some , then it is called a fixed point of . The gradient descent method () and Newton’s method ( for a scalar sequence) are popular examples of fixed-point iteration. Many numerical algorithms are based on fixed-point iteration, and there are also many popular examples in machine learning.
Here are some important concepts about fixed-point iteration.
Definition 1 (Contraction mapping). Khamsi and Kirk (2011) On a metric space , function is a contraction mapping if there is a real number that satisfies the following inequality for all and in .
The smallest that satisfies the above condition is called the Lipschitz constant of . We use the most common distance metric in this paper.
Based on the above definition, the Banach fixed-point theorem Banach (1922) states the following.
Theorem 1 (Banach fixed-point theorem). A contraction mapping has exactly one fixed point and it can be found by starting with any initial point and iterating the update equation until convergence.
Therefore, if is a contraction mapping, it converges to a unique point regardless of the starting point . The above concepts are important in deriving the proposed FPI layer in this paper.
Energy function networks: Scalar-valued networks to estimate the energy (or error) functions have generated considerable recent research interest. These energy networks have a different structure from the general feed-forward neural networks, and the concept was first proposed in LeCun et al. 2006. They predict the answer by the input which minimizes the network’s output.
The structured prediction energy networks (SPENs) Belanger and McCallum (2016) perform gradient descent on an energy function network to find the solution, and an SSVM Tsochantaridis et al. (2004) loss is defined based on the solution to train the network. The input convex neural networks (ICNNs) Amos et al. (2017) are defined in a specific way so that the networks have convex structures with respect to (w.r.t.) the input, and its learning and inference are performed by the bundle entropy method which is derived based on the KKT optimality conditions. The deep value networks Gygli et al. (2017) and the IoU-Net Jiang et al. (2018) directly learns the loss metric such as the intersection over union (IoU) of bounding boxes then perform inference by gradient based optimization methods. However, these methods generally require complex learning processes and each method is specialized to a limited range of applications. Also, they require various approximations/relaxations or constraints such as bounded condition.
On the other hand, the end-to-end SPENs Belanger et al. (2017) directly backpropagates the whole gradient-based inference process that has a fixed number of gradient steps. Although this alleviates the need of a complicated derivation for backpropagation, the memory requirement increases as the number of steps increases since all the steps are constructed in a computational graph and the gradient of each step must be calculated. If the number of steps is small, there is a high possibility of not converging to the optimal solution, which is not ideal in the perspective of energy-based learning.
Although the above approaches provide novel ways of utilizing neural networks in optimization frameworks, there is no apparent way to combine them within other existing deep networks. Moreover, they are mostly limited to a certain type of problems and require complicated learning processes.
On the other hand, the proposed method can be applied in broader situations than these approaches, and these approaches can be equivalently implemented with the proposed method once the update equation for the optimization problem is derived.
Differentiable optimization layers:
Recently, a few papers using optimization problems as a layer of deep learning architecture have been proposed. Such a structure can contain a more complicated behavior in one layer than a feed-forward net, and can potentially reduce the depth of the network.
OptNet Amos and Kolter (2017) presented how to use the quadratic program (QP) as a layer of neural networks. They also uses the KKT conditions to compute the derivative of the solution of QP. Agrawal et al. 2019a proposed an approach to differentiate disciplined convex programs which is a subclass of convex optimization problems. There are a few other researches trying to differentiate optimization problems such as submodular models Djolonga and Krause (2017), cone program Agrawal et al. (2019b), semidefinite program Wang et al. (2019), and so on. However, most of them has limited applications and users need to adapt their problems to the rigid problem settings. On the other hand, our method makes it easy to use a large class of iterative algorithms as a network layer, which also includes the differentiable optimization problems.
3 Proposed Method
The fixed-point iteration formula contains a wide variety of forms and can be applied to most iterative algorithms. Section 3.1 describes the basic structure and principles of the FPI layer. Section 3.2 and 3.3 explains the differentiation of the layer for backpropagation. Section 3.5 presents some example applications. Section 3.4 describes the convergence of the backward FPI.
3.1 Structure of the fixed-point iteration layer
Here we describes the basic operation of the FPI layer. Let be a parametric function where and are vectors of real numbers and is the parameter. We assume that is differentiable for and also has a Lipschitz constant less than one for , and the following fixed point iteration converges to a unique point according to the Banach fixed-point iteration theorem:
In practice, could be a neural network or an algorithm like gradient descent. The FPI layer can be defined based on the above relations. The FPI layer takes data or output of the previous layer as input , and yields the fixed point of as the layer’s output:
Here, acts as the parameter of the layer. Here, we can notice that the layer receives the initial point as well, but its actual value does not matter in the training procedure because has a unique fixed point. This will also be confirmed later in the derivation of backpropagation. Hence, can be predetermined to any value. Accordingly, we will often express as a function of and , i.e., .
When using an FPI layer, (3) is actually repeated until convergence to find the output . We may use some acceleration techniques such as the Anderson acceleration Peng et al. (2018) if it takes too long to converge.
In multi-layer networks, from the previous layer is passed onto the FPI layer, and its output can be passed to another layer to continue the feed-forward process. Note that there is no apparent relation between the shapes of and . Hence the sizes of the input and output of an FPI layer can be different.
3.2 Differentiation of the FPI layer
Same as the other network layers, learning of is performed by updating based on backpropagation. For this, the derivatives of the FPI layer has to be calculated. One simple way to compute the gradients is to construct a computational graph for all the iterations up to the fixed point . For example, if it converges in iterations (), all the derivatives from to can be calculated by the chain rule. However, this method is not only time consuming but also requires a lot of memory.
In this section, we show that the derivative of the entire FPI layer depends only on the fixed point . In other words, all the before convergence are actually not needed in the computation of the derivatives. Hence, we can only retain the value of to perform backpropagation, and consider the entire as a single node in the computational graph. Here, we provide the derivation of , and that of (which is needed for the backpropagation of layers before ) are mostly similar.
Note that the following equation is satisfied at the fixed point :
If we differentiate both sides of the above equation w.r.t. , we have
Here, is not differentiate because and are independent. Rearranging the above equation gives
which confirms the fact that the derivative of the output of depends only on the value of .
A nice thing about the above derivation is that it only requires the derivatives of a single iteration . This can be easily calculated based on the autograd functionalities of existing deep learning tools, and there is no need for a separate derivation. However, care should be taken about the fact that the differentiations w.r.t. and are partial differentiations. and might have some dependency with each other in a usual autograd framework, hence we have to create an independent computational graph that having and as leaf variables cloned from the nodes in the original computational graph. In this way, we can perform the partial differentiation accurately.
One downside of the above derivation is that it requires to calculate the Jacobians of , which may need a lot of memory space. In the next section, we will provide an efficient way to resolve this issue with another FPI.
3.3 Backward fixed-point iteration
Here, we assume that the output of is passed onto some other layers for more processing, and then there is a loss function at the end of the network. If we summarize all the layers and the loss function after the FPI layer into a single function , then what we need for backpropagation are and . Similar to the previous section, we will only provide the derivation of .
According to (8), we have
This section describes how to calculate the above equation efficiently. (9) can be divided into two steps as follows:
Rearranging (10) yields the following results.
Note that the above equation haS the form of FPI, hence we perform the following FPI which is called the backward FPI:
can be obtained by initializing to some arbitrary value and repeating the above update until convergence. If the forward iteration is a contraction mapping, we can prove that the backward FPI is also a contraction mapping, which is guaranteed to converge to a unique point. The proof of convergence is discussed in detail in the next section.
In the above update equation, can be calculated with explicitly calculating the Jacobian. If we define a new function as
then (13) becomes
Note that the output of is scalar. Here, we can consider as another small network containing only one step of FPI(). We can also easily compute the gradient of based on the autograd functionalities.
Similarly, the second step (11) is also expressed using :
In this way, we can compute without any memory-intensive operations.
3.4 Convergence of the fixed-point iteration layer
The forward path of the FPI layer converges if the bounded Lipschitz assumption holds. For example, to make a fully connected layer a contraction mapping, simply dividing the weight by a number greater than the maximum singular value of the weight matrix will suffice. In practice, we empirically found out that setting the weights () to small values is enough for making a contraction mapping throughout the training procedure.
Convergence of the backward FPI. Backward FPI is a linear mapping based on the Jacobian on . Convergence of the backward FPI can be confirmed by the following proposition.
Proposition 1. If is a contraction mapping, the backward FPI (Eq. (13)) converges to a unique point.
For simplicity, we omit and from . By the definition of the contraction mapping and since we assume the metric is the L2 distance,
for all and (). For a unit vector and a scalar , let . Then, Eq. (17) is,
For another unit vector ,
which indicates that
According to the chain rule, where is the Jacobian of .
Let then for all , that satisfy . Therefore, which means the linear mapping by weight is a contraction mapping. By the Banach fixed-point theorem, backward FPI converges to the unique fixed-point. ∎
Gradient descent FPI layer
A perfect example for using the FPI layer is the gradient descent method. Energy function networks are scalar-valued networks which estimate the energy (or error) functions. Unlike a typical network which obtains the prediction directly from the output of the network (), the energy network gets the answer by optimizing the input variables for the network (). We can optimize the network by gradient descent.
This is a form of FPI and the fixed point is the optimal point of .
In the case of a single FPI layer network, the loss function is expressed as follows.
This exactly matches the objectives of energy function networks like ICNN. Therefore, the energy function networks can be effectively learned based on the proposed FPI layer. Moreover, we can build large networks by attaching different layers before and after the FPI layer, which becomes a large network containing an energy network as a component. Of course, using multiple FPI layers in a single network is also possible.
Neural net FPI layer
It is also possible that itself is a neural network module. The input variable recursively enters the same network and this is repeatedf until convergence. It can be much faster than the gradient descent FPI layer since it computes the output directly. This allows the FPI layer perform more complicated behaviors than using directly without any FPI.
Figure 1 shows an example multi-layer network for mini-batch size . is the prediction for the input data and is the ground truth. As can be seen in the figure, the FPI layer composed of a neural network module.
We show four experiments showing that the FPI layer can be used in various settings. In the toy example and image denoising problems, we compare the performance of the FPI layer to a non-FPI network that has the same structure with . In the optical flow problem, a relatively very small FPI layer is attached at the end of FlowNet Fischer et al. (2015) to show its effectiveness. Results on the multi-label classification problem shows that the FPI layer is superior in performance compared to the existing state-of-the-art algorithms.
4.1 Toy example: a constrained problem
We show the feasibility of our algorithm by learning a constrained optimization problem () with a box constraint (). Here, the inequalities are element-wise. The goal of the problem is to learn the functional relation based on training samples , where t is the ground truth solution of the problem.
4.2 Image denoising
Here we compare the image denoising performance for gray images perturbed by Gaussian noise with variance . To generate the image, we crop the flying chair dataset Fischer et al. (2015) and convert it to gray scale (400 images for training and 100 images for testing). We used the neural network FPI layer and the feedforward network which has the same structure consists of convolution and ReLU layers. Peak signal-to-noise ratio (PSNR) is reported in Table 1 which is the most widely used measure of image denoising.
Table 1 shows the FPI layer outperforms the feedforward net in all experiments. Note that the performance gap between the two algorithms is larger for more difficult tasks. This shows that the FPI layer is suitable for solving complex problems.
4.3 Optical flow
Optical flow is one of the major research areas of computer vision which aims to acquire motions by matching pixels in two image. We show that the FPI layer has a good effect by a simple experiment of attaching it at the end of the flownet Fischer et al. (2015). In this case, the FPI layer plays a role as post processing. We attach a very simple FPI layer consists of one conv/deconv layer and record the average end point error () per epoch in Figure 5.
Although the computation time is nearly same and the addition of the number of variables is extremely small, it shows a noticeable performance difference.
4.4 Multi label classification
Multi-label text classification dataset (Bibtex) was introduced by Katakis et al. 2008. Goal of the task is to find the correlation between the data and multi-labeled features. Both data and features are binary with 1836 indicators and 159 labels, respectively. The number of indicators and labels differs for each data, and the number of labels is not known during the evaluation process. We follow the same train an test split from Katakis et al. 2008 and report the scores.
Here we use the single FPI layer network for both gradient descent FPI (FPI_GD) and neural network FPI (FPI_NN) which has only one fully-connected hidden layer and ReLU activation. To format the output, FPI_NN has a 159 fully-connected layer equal to the number of labels, and FPI_GD has a mean-square term to make the output scalar. For experiments, we only need to specify the threshold of the convergence condition. There are no data-specific hyperparameters.
|MLP Belanger and McCallum (2016)||38.9|
|Feedforward net Amos et al. (2017)||39.6|
|SPEN Belanger and McCallum (2016)||42.2|
|ICNN Amos et al. (2017)||41.5|
|DVN(Adversarial)Gygli et al. (2017)||44.7|
|FPI_GD layer (Ours)||43.2|
|FPI_NN layer (Ours)||43.4|
Table 2 shows the score on Bibtex multi-label classification dataset (higher is better). DVN(Adversarial) uses data augmentation to generate adversarial samples. Despite its simple structure, our algorithm performs the best among algorithms using only training data and their ground truth (GT).
This paper proposed a novel architecture that uses the fixed-point iteration as a layer of the neural network. We derived the differentiation of the FPI layer using derivatives of the convergence (fixed) point. We also presented a method to compute the differentiation efficiently by another fixed-point iteration called backward FPI. Applications for two representative cases of FPI (gradient descent and neural network) are introduced, and there are various applications according to the form of FPI layer. Experiments show the potential power of the FPI layer compared to the feedforward network which has the same structure. It also shows that the FPI layer can be easily applied to various fields, and the performance is comparable with the state-of-the-art algorithms without the need for data specific hyperparameters.
- Differentiable convex optimization layers. In Advances in Neural Information Processing Systems, pp. 9558–9570. Cited by: §2.
- Differentiating through a conic program. arXiv preprint arXiv:1904.09043. Cited by: §2.
- Optnet: differentiable optimization as a layer in neural networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 136–145. Cited by: §1, §2.
- Input convex neural networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 146–155. Cited by: §1, §2, Table 2.
- Sur les opérations dans les ensembles abstraits et leur application aux équations intégrales. Fund. math 3 (1), pp. 133–181. Cited by: §2.
- Structured prediction energy networks. In International Conference on Machine Learning, pp. 983–992. Cited by: §1, §2, Table 2.
- End-to-end learning for structured prediction energy networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 429–439. Cited by: §2.
- Differentiable learning of submodular models. In Advances in Neural Information Processing Systems, pp. 1013–1023. Cited by: §2.
- Flownet: learning optical flow with convolutional networks. arXiv preprint arXiv:1504.06852. Cited by: §4.2, §4.3, §4.
- Deep value networks learn to evaluate and iteratively refine structured outputs. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1341–1351. Cited by: §2, Table 2.
- Acquisition of localization confidence for accurate object detection. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 784–799. Cited by: §2.
- Multilabel text classification for automated tag suggestion. In Proceedings of the ECML/PKDD, Vol. 18, pp. 5. Cited by: §4.4.
- An introduction to metric spaces and fixed point theory. Vol. 53, John Wiley & Sons. Cited by: §2.
- A tutorial on energy-based learning. Predicting structured data 1 (0). Cited by: §2.
- Anderson acceleration for geometry optimization and physics simulation. ACM Transactions on Graphics (TOG) 37 (4), pp. 1–14. Cited by: §3.1.
- Support vector machine learning for interdependent and structured output spaces. In Proceedings of the twenty-first international conference on Machine learning, pp. 104. Cited by: §2.
- SATNet: bridging deep learning and logical reasoning using a differentiable satisfiability solver. arXiv preprint arXiv:1905.12149. Cited by: §2.