Deep Algorithms: designs for networks
A new design methodology for neural networks that is guided by traditional algorithm design is presented. To prove our point, we present two heuristics and demonstrate an algorithmic technique for incorporating additional weights in their signal-flow graphs. We show that with training the performance of these networks can not only exceed the performance of the initial network, but can match the performance of more-traditional neural network architectures. A key feature of our approach is that these networks are initialized with parameters that provide a known performance threshold for the architecture on a given task.
Deep Algorithms: designs for networks
Abhejit Rajagopal Dept. of Electrical & Computer Engineering University of California, Santa Barbara 93106 firstname.lastname@example.org Shivkumar Chandrasekaran Dept. of Electrical & Computer Engineering University of California, Santa Barbara 93106 email@example.com Hrushikesh Mhaskar Institute of Mathematical Sciences Claremont Graduate University firstname.lastname@example.org
noticebox[b]Preprint. Work in progress.\end@float
The original interest in neural networks arose from their connection with biology. Even though the input and output of biological networks are continuous, much of the current interest lies in static inputs and outputs. For such feed-forward networks there has existed a strong theoretical foundation based on approximation theory since the pioneering work of Cybenko, Mhaskar, and others [1, 2, 3]. The recent resurgence of interest arises from the practical success of deep feed-forward networks on image classification problems, which in turn seems to be tied to several factors: the availability of massive training datasets, the availability of large amounts of cheap computing power, and the arrival of practical training algorithms [4, 5]. While the approximation theorists have shown the existence of good, but relatively large and shallow networks, practitioners have found success primarily with deep networks composed of several standard layers (convolutional, max-pooling, etc.) [6, 7]. In particular, practitioners have primarily worked with a lego-block style approach to design, where they successively add and remove standard layers of varying tensor dimensions, until a sufficiently good design is arrived upon. Each iteration requires careful tweaking of the learning parameters and a carefully calibrated sense of when to pull the plug on a slowly converging network and try a new design. There are many papers devoted to this art with many case studies . Recently this approach to network design has been ported to other non-classification problems .
We propose a more systematic approach to the design of networks. In particular we claim that the design of a problem-specific (rather than data-specific) heuristic algorithm is key to the design of the network. We illustrate this design principle by several examples drawn from both classical classification problems and other less traditional machine learning areas.
Machine learning is normally used when the problem specification is so complicated as to defy a compact mathematical presentation. Typical in this area are image and video classification problems, where no precise mathematical formalism exists, but rather the problem is presented as a large corpus of ground-truth data. However, machine learning is also a valid approach when the mathematical problem is so difficult as to defeat the best effort of human algorithm designers to come up with a well-performing algorithm (both in terms of accuracy and speed). There are many classical problems that easily fall into the latter category. A simple example that immediately springs to mind are root finders for systems of polynomial equations. The literature on this area is classical and vast, and yet one can safely say that there exists no reliable practical algorithm that can be used in a black-box manner. So, whether the problem is specified by data or mathematical formulas, one thing that is common is that the human algorithm designer has reason to believe that the problem is solvable and even has ideas on how to do so. We refer to these ideas as heuristics and will assume that they are presented as algorithms (or programs) that work reasonably well.
Our contention is that in many cases these heuristic algorithms can be viewed as special cases of a very large family of algorithms that can be parameterized by many real numbers. The initial heuristic itself can be viewed as a particular choice of these numerical parameters.
For instance, when evaluating the similarity of two feature vectors , it is natural to compare their distance in some norm. Usually in the absence of other information, the algorithm designer is likely to pick a familiar norm like the Euclidean distance:
Being aware that this might not be the best choice, designers usually generalize to the Mahalanobis distance instead:
where . Notice, that the original distance measure is recovered for the choice . We first observe that this is not the only such generalization. For example, one might embed this computation into an even larger computation graph, as:
where the original distance could be recovered for some suitable choice of (e.g. average of the trace) and .
In particular note that the original heuristic distance is being recovered in every case by carefully selecting the numbers in a larger matrix. This corresponds to the insertion of additional edges in the first computational graph with trivial weights. One can generalize this observation in another direction too. For example by picking to be a Toeplitz matrix we get a convolutional layer, and if we pick to be a Toeplitz-block-Toeplitz matrix we get a 2D convolutional layer. We can also choose to be the product of Toeplitz matrices in which case we would get several convolutional layers, and so on. Of course choosing to be a fully dense matrix would give us a full-connected layer at that stage of the computational graph.
We call this process of adding more weights as “tensorization” in general.
Heuristics as trainable networks
We observe that when these heuristic networks are parametrized by real-numbers, they can be tuned or calibrated by special training algorithms. For example, if one wishes to use the currently popular deep neural network (DNN) training algorithms in TensorFlow or Pytorch, one could convert the heuristics (either by hand or special purpose compilers) into classical looking networks to achieve a guaranteed baseline performance, and then improve further by training.
To convert the heuristic into a network suitable for TensorFlow/PyTorch, we note that any finite sequence of code that only utilizes floating-point arithmetic operators is easily encoded as a classical network by just writing out its data-flow (signal-flow) graph.
The first non-traditional construct would be if-else statements based on the truth values of numerical expressions involving inequality operators. In the network each of these truth values is encoded in a real weight as either a 1 or a 0. Then each piece of the heuristic guarded by if-else statements is split off into its own computational path in the network, and finally all the computational paths are added at the end of the if-else statement using the corresponding weights of the tests in guards. We just note that these can also be viewed as additional weights that will be tuned during the training phase.
The second non-traditional construct would be a for loop. If the number of loop executions is data independent then the for loop can be unrolled into a long sequence of statements. The number of layers that this generates in the network will be proportional to the number of loop executions. We conjecture that this is one of the primary reasons for the current crop of deep networks in classification problems.
If the for loop has a data dependent number of loop executions there are two possible strategies to follow. The simplest strategy is to choose a sufficiently large number of loop executions and just unroll the loop as before. In this case, care has to be taken to let the variables that are updated in the loop settle to their correct values by inserting suitable if statements. The second strategy would be to keep the data-dependent number of loop executions and use a simple adaptation in the DNN training algorithm instead. This strategy requires more space to explain and will be presented elsewhere.
Now there are several questions that arise: Are there examples where this strategy actually shows improvement over the base heuristic? How does this strategy do in traditional image classification problems? When this strategy works, does it produce networks that do not look like current networks for the same problem? Is it possible to embed the heuristic network in a larger network that looks like the existing networks? Are good heuristics always deep or can they be shallow too?
In this paper we present some early numerical evidence that purports to answer some of these questions.
1.1 Related Work
It is worthwhile at this point to note that similar ideas have definitely been expressed before outside the DNN literature. The classical ATLAS BLAS software, for example, would self-tune itself during installation by experimenting with a variety of integer parameters like block size . Similarly the award winning FFTW code would self-tune several discrete parameters that determined which flavor of FFTs would be used at each stage of the recursion for different problem sizes . However there was no emphasis on real parameters, and definitely no systematic use of a descent-based learning algorithm or a large corpus of ground-truth data.
Within the DNN literature itself, one can see a growing trend of composing classical convolutional layers, pooling layers, and fully connected layers, with some sort of meta-heuristics being used to justify the design [13, 14, 15]. However, as far as we know, there has been no attempt to fully express how a heuristic leads precisely to a neural network (deep or otherwise).
2 Designs for Networks
We now present two heuristics and demonstrate the incorporation of additional weights in their signal-flow graphs. We show that with training, the performance of these networks exceeds the performance of the initial network, and can also match the performance of more-traditional neural network architectures. We note that for improvement over the initial heuristic, it is sufficient that the gradient of the residual or error function with respect to the new parameters is non-zero, which is likely to be the case, and shows that human designers are very unlikely to produce optimal heuristics for a particular dataset or problem.
The Newton heuristic for solving systems of polynomial equations
Our first example is a root-finder for a system of polynomial equations in variables:
where is the unknown. It is well-known that this is an extremely difficult problem in floating-point arithmetic. One approach is to convert it into a polynomial equation in a single variable, but the price to pay is an exponential growth in the degree of the polynomial and potentially in the size of the coefficients. Another popular approach is to use a continuation technique . However, the latter is notoriously difficult to implement well in floating-point arithmetic and slow to boot. Therefore the most popular approach is to use a locally convergent method like Newton, may be backed up with some kind of line search or back-stepping technique 111However it is difficult to get algorithms of this form to compute more than one root reliably as they need some kind of deflation technique to mask already computed roots.. As one can see, all of these methods fall into the category of what we call heuristics.
If we represent the polynomial system as , then the simple Newton heuristic could be expressed as follows:
where is a step length to be chosen and is a number to be provided by the user.
First note that there are some obvious numerical parameters that have to be chosen, namely and . However, we note that we can easily make many more. For example we could make a matrix that depends on the iteration number: . Another possibility is to borrow from accelerated gradient methods and other higher order methods and look for an iteration of the form:
where . We can go one step further and allow multiple choices per step and choose the best one:
When we generalize so much it is good to observe that the original trusted Newton method can be recovered for special choices of the new weight matrices and . This is an important observation as when this heuristic is unrolled and trained via one’s favorite machine learning framework (Figure 1) we are assured of good starting weights with a known performance threshold.
Note also that rather than just setting we could also use a more complicated expression like
or some more intelligent class of heuristics based on root localization theorems .
A simple heuristic for image classification
We now present a simple heuristic for the classical MNIST and Fashion MNIST data classification problems.
The MNIST dataset consists of example images of handwritten digits from 0 to 9. The goal is to come up with an algorithm that is capable of recognizing similar types of handwritten digits. As can be seen, there is no precise mathematical formulation of the problem other than what is specifiable via the ground-truth data in the collection. Nevertheless as humans we have some pre-conceived notions of how handwritten digits could be recognized.
For example, we use the common heuristic:
Choose a representative number of samples from the training data for each digit.
Given a new query image, compare using some appropriate metric the query to the set of representative samples.
Based on the distance to the representative samples, make a decision on which digit the query corresponds to.
The performance of this heuristic depends crucially on the image metric that is used. It is well-known that classical norm-based distance functions tend to perform poorly and many alternatives have been proposed in the literature . Now we describe the heuristic we used for the image metric.
In MNIST the images are all of the same size and shape with the digit roughly in the middle. To compare two images we proceed as follows. Let and
where denotes rotation by an angle , denotes a linear operator, and denotes the standard -norm. Note that measures the distance between two images of the same size by considering the minimum over all rotations in the range from to . We are assuming here that the images and have -dimensional pixels.
where denotes translation by the vector . Let be defined such that
Let denote the sub-image of of size centered at with pixels outside the region set to 0. Then let
where is a weight function. We take as our measure of the distance between image and . In spite of its messy appearance, the image metric is quite simple: it is computing the distance between the pixels in and by comparing patches of size , and when it compares patches it allows a little bit of translation and rotation, picking the closest match in each case. Then it looks at the induced optical flow and further penalizes those distance where the Laplacian of the flow is large. We believe that this type of heuristic is not uncommon in the literature .
This heuristic can now be unrolled into a network and more parameters introduced as needed.
In this section, we provide some initial numerical evidence to substantiate our network design methodology on the previously mentioned problems of polynomial root finding and image classification. In particular, we demonstrate how simple heuristics for these problems can be discretized and compiled (currently, manually) into neural network algorithms, and trained for better results using one’s favorite machine learning framework. For more detail, we refer readers to the Supplementary Information.
DeepNewtonNet for 1D and 2D polynomial systems
The -iterations of the previously described Newton heuristic can be unrolled and written as a simple -layer network, by computing at each layer:
where represents the coefficients of a polynomial system in some basis (e.g. monomial), represents the evaluation of each of the polynomials in this system at the points , and represents the generalized inverse of the Jacobian of with respect to . As mentioned, the parameters of this algorithm are the tensorized weights (e.g. matrices) representing possible step-lengths, hysteresis, and momentum factors at each iteration; these possibilities are generated in the network by taking the Cartesian product of the sets , , and . Notice here that , , and are themselves polynomial networks, composed of purely additions and multiplications (e.g. computed via Horner’s rule, or similar network).
For simplicity, in our initial tests we fix the degree and number of iterations , initialize parameters as , , , compute a baseline performance using these weights, and continue training on the residual error (i.e. without ground-truth data). A summary of our initial results on a small class of real 1D and 2D polynomial systems is displayed in Table 1, while the behavior of the method for an even smaller subclass is depicted in Fig. 2. Note, that in these tests we intentionally picked , , and all weights as real numbers so the network would only produce real-valued root estimates (even though imaginary roots exist in the general case)222We refer readers to the supplementary information for additional notes about these choices, network specifics, and the test metrics that were used in the evaluation..
ClusterNet for Image Classification
Similarly, one possible discretization of the previously described heuristic for the classification of RGB-K images can be realized by computing:
where is the query image, and are free parameters initialized as selected cluster centers (e.g. random samples from each class) and their label vectors, represents a free weight mask that is initialized conformally to , represents point-wise multiplication (Schur-product), represents a translation-shift operator (with fill-value 0), represents convolution over the “valid” region, represents a discrete Laplace operator (e.g. computed using simple 1st-order finite-differences), represents a free weight on the magnitude of the Laplacian, and we have used the square-Frobenius norm inside the network as a natural replacement for the 2-norm distance for operation on images. Finally, we note the use of a softmax-layer to produce vector , which is used to compute a weighted-average of the cluster labels after multiplication with (i.e. a fully-connected layer with linear activation). can be initialized trivially as identity, or with the the eigenvectors of the generalized Gramian matrix of the cluster centers; this later initialization corresponds to an interpretation of the algorithm as a version of spectral clustering, or non-linear PCA, with respect to the heuristic distance function defined by [20, 21, 22].
This algorithm’s corresponding signal-flow graph is shown below in Fig. 3.
We tried various choices for the structural parameters, , which control the width of the network in different ways (e.g. if loops are allowed in the network, then controls the number of internal loop-iterations), and evaluated the baseline classification performance of these methods on the aforementioned datasets using a sensible initialization for the remaining parameters. Once a reasonable baseline was achieved, we trained the network using stochastic gradient descent and observed strictly better performance. In general, we observed an increase in the baseline training and testing accuracy as the total number of cluster centers increased, but this improvement is somewhat incremental even with training. A thorough analysis of this result will be presented elsewhere, but we comment that this behavior is common in clustering algorithms, and could be alleviated by inclusion of a cluster-repulsion term in the objective as in .
The confusion matrices corresponding to ClusterNet performance on the MNIST and Fashion-MNIST testing data is reported in Tables 2-2 and 3-3 respectively. Each network was initialized with a reasonable number of cluster centers (10 and 25 per class, respectively), relative to the number of parameters used in other state of the art solutions , and trained for sufficient time to show improvement ( epochs for MNIST, epochs for Fashion-MNIST). The point here is not to show that our networks are more “optimal” than conventional networks; rather, we demonstrate that well-loved heuristics can (in some cases) perform as well as DNNs when their graphs are sufficiently parametrized and, most importantly, are allowed to train with respect to a corpus of data.
As seen in Table 2, our heuristic performs quite well on the MNIST data ( test accuracy) initialized with just 10 examples from each class, and no other training. However, even with a modest amount of training (Table 2), we are able to match the performance of DNN approaches (). To achieve around the same level of initial accuracy in the Fashion-MNIST dataset, around 20-30 examples were needed from each class (Table 3); however, the testing accuracy in these cases would sometimes saturate after training for -epochs to values not substantially better than the result that was achieved after training with just 10 examples from each class (Table 3). Since the purpose of this paper is not to demonstrate an effective training strategy for the proposed style of algorithmic networks, but instead to show that these networks are viable in terms of capacity, we have only depicted the results for the 10-example-per-class case, which was achieved using only naive optimization strategies (e.g. stochastic gradient descent on a mean-square error residual). We believe further experimentation using special training algorithms and longer training periods can produce even better results.
We do not claim that the algorithms and networks presented in this paper are optimal for solving the presented problems. While in this case our networks achieved a desirable performance, it is possible that parametrizations based on human intuition of the problem can make the task of finding the optimal weights more challenging than is necessary, though this might be compensated by providing better initial weights . With that being said, we note that our DeepNewton and ClusterNet implementations did not require any sigmoidal or ReLU-like activation functions in the network, and instead rely on a purely “polynomial” implementation composed of adds and multiplies.
Moreover by embedding the heuristic algorithmic networks into larger DNNs, we have a methodical strategy for picking initial weights with a known performance threshold. Once embedded into a larger network, the weights can be tuned using standard techniques, such as back propagation. In a sense, this can be thought of as transfer learning, where the model weights and architecture are adopted from a heuristic that the algorithm designer believes would perform reasonably, even if it may not be optimal, for the given task.
In general, we believe that our design methodology will extend well to heuristics used in other computer vision tasks. For example, in video processing and understanding, it was common for motion features to be hand-crafted. When viewed as a heuristic, these features can be naturally initialized, generalized, and trained in an end-to-end algorithm. We believe this viewpoint will help practitioners re-consider leveraging algorithms (heuristics), that historically have shown promise but have been largely discarded in the wake of deep learning. This will also aid in the design of networks with understandable and predictable performance, and also with better composability properties.
We note that we do not address the problem of algorithm (or network) synthesis or concerns related with differentiable interpreters.
-  George Cybenko. Approximation by superpositions of a sigmoidal function. Mathematics of control, signals and systems, 2(4):303–314, 1989.
-  Hrushikesh N Mhaskar and Charles A Micchelli. Approximation by superposition of sigmoidal and radial basis functions. Advances in Applied mathematics, 13(3):350–373, 1992.
-  Charles K Chui and Hrushikesh Narhar Mhaskar. Deep nets for local manifold learning. arXiv preprint arXiv:1607.07110, 2016.
-  DE Rumelhart, GE Hinton, and RJ Williams. Learning internal representations by error propagation. In Parallel distributed processing: explorations in the microstructure of cognition, vol. 1, pages 318–362. MIT Press, 1986.
-  Graham W Taylor, Rob Fergus, Yann LeCun, and Christoph Bregler. Convolutional learning of spatio-temporal features. In European conference on computer vision, pages 140–153. Springer, 2010.
-  Shin Suzuki. Constructive function-approximation by three-layer artificial neural networks. Neural Networks, 11(6):1049–1058, 1998.
-  Kurt Hornik. Approximation capabilities of multilayer feedforward networks. Neural networks, 4(2):251–257, 1991.
-  Simon Haykin. A comprehensive foundation. Neural networks, 2(2004):41, 2004.
-  Hyeji Kim, Yihan Jiang, Ranvir Rana, Sreeram Kannan, Sewoong Oh, and Pramod Viswanath. Communication algorithms via deep learning. In The International Zurich Seminar on Information and Communication (IZS 2018) Proceedings, pages 48–50. ETH Zurich, 2018.
-  Alexis Mignon and Frédéric Jurie. Pcca: A new approach for distance learning from sparse pairwise constraints. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, pages 2666–2672. IEEE, 2012.
-  Jim Demmel, Jack Dongarra, Victor Eijkhout, Erika Fuentes, Antoine Petitet, Rich Vuduc, R Clint Whaley, and Katherine Yelick. Self-adapting linear algebra algorithms and software. Proceedings of the IEEE, 93(2):293–312, 2005.
-  Matteo Frigo and Steven G Johnson. Fftw: An adaptive software architecture for the fft. In Acoustics, Speech and Signal Processing, 1998. Proceedings of the 1998 IEEE International Conference on, volume 3, pages 1381–1384. IEEE, 1998.
-  Kristen Grauman, Fei Sha, and Sung Ju Hwang. Learning a tree of metrics with disjoint visual features. In Advances in neural information processing systems, pages 621–629, 2011.
-  Xufeng Han, Thomas Leung, Yangqing Jia, Rahul Sukthankar, and Alexander C Berg. Matchnet: Unifying feature and metric learning for patch-based matching. In Computer Vision and Pattern Recognition (CVPR), 2015 IEEE Conference on, pages 3279–3286. IEEE, 2015.
-  Pedro O Pinheiro, Tsung-Yi Lin, Ronan Collobert, and Piotr Dollár. Learning to refine object segments. In European Conference on Computer Vision, pages 75–91. Springer, 2016.
-  Eugene L Allgower and Kurt Georg. Numerical continuation methods: an introduction, volume 13. Springer Science & Business Media, 2012.
-  Dario Andrea Bini. Numerical computation of polynomial zeros by means of aberth’s method. Numerical algorithms, 13(2):179–200, 1996.
-  Patrice Simard, Yann LeCun, and John S Denker. Efficient pattern recognition using a new transformation distance. In Advances in neural information processing systems, pages 50–58, 1993.
-  Chyuan-Huei Thomas Yang, Shang-Hong Lai, and Long-Wen Chang. Hybrid image matching combining hausdorff distance with normalized gradient matching. Pattern Recognition, 40(4):1173–1181, 2007.
-  Pierre Baldi and Kurt Hornik. Neural networks and principal component analysis: Learning from examples without local minima. Neural networks, 2(1):53–58, 1989.
-  Aapo Hyvärinen and Erkki Oja. Independent component analysis: algorithms and applications. Neural networks, 13(4-5):411–430, 2000.
-  Youngmin Cho and Lawrence K Saul. Kernel methods for deep learning. In Advances in neural information processing systems, pages 342–350, 2009.
-  S Chandrasekaran and A Rajagopal. Fast indefinite multi-point (imp) clustering. Calcolo, 54(1):401–421, 2017.
-  Waseem Rawat and Zenghui Wang. Deep convolutional neural networks for image classification: A comprehensive review. Neural computation, 29(9):2352–2449, 2017.
-  Dmytro Mishkin and Jiri Matas. All you need is a good init. arXiv preprint arXiv:1511.06422, 2015.