On the Computational Power of Online Gradient Descent

On the Computational Power of Online Gradient Descent

Vaggos Chatziafratis Department of Computer Science, Stanford University. Tim Roughgarden Department of Computer Science, Stanford University. Joshua R. Wang Department of Computer Science, Stanford University.
Abstract

We prove that the evolution of weight vectors in online gradient descent can encode arbitrary polynomial-space computations, even in the special case of soft-margin support vector machines. Our results imply that, under weak complexity-theoretic assumptions, it is impossible to reason efficiently about the fine-grained behavior of online gradient descent.

1 Introduction

In online convex optimization (OCO), an online algorithm picks a sequence of points from a compact convex set while an adversary chooses a sequence of convex cost functions (from to ). The online algorithm can choose based on the previously-seen but not later functions; the adversary can choose based on . The algorithm incurs a cost of at time . Canonically, in a machine learning context, is the set of allowable weight vectors or hypotheses (e.g., vectors with bounded -norm), and is induced by a data point , a label , and a loss function (e.g., absolute, hinge, or squared loss) via .

One of the most well-studied algorithms for OCO is online gradient descent (OGD), which always chooses the point (Zinkevich Zinkevich (2003)) (and projects back to , if necessary). This algorithm enjoys good guarantees for OCO problems, such as vanishing regret (see e.g. Hazan Hazan (2016)).

Soft-margin support vector machines (SVMs) for binary classification (see Section 2), with their strongly convex loss functions, would seem to be a particularly benign setting in which to apply OGD. And yet:

  • OGD captures arbitrary polynomial-space computations,
    even in the case of soft-margin SVMs.

A bit more precisely: for every polynomial-space computation, there is a sequence of data points that have polynomial bit complexity such that, if these data points are fed to OGD (specialized to soft-margin SVMs, with all-zero initial weights) in this order over and over again, the consequent sequence of weight vectors simulates the given computation. Figure 1 gives a cartoon view of what such a simulation looks like.111Our actual simulation in Section 3 and Section 4 is similar in spirit to but more complicated than the picture in Figure 1. For example, we use a constant number of OGD updates to simulate each circuit gate (not just one), and each weight can take on up to a polynomial number of different values.

Figure 1: Cartoon view of simulating a computation using a sequence of weight vectors. On the left, the evaluation of a Boolean circuit on a specific input (with “T” and “F” indicating which inputs and gates evaluate to true and false, respectively). On the right, a corresponding sequence of weight vectors (with updates triggered by a carefully chosen data set), with each vector evaluating one more gate of the circuit than the previous one. Weights of , and indicate that an input has been assigned true, has been assigned false, or has not yet been assigned a value, respectively.

Our simulation implies that, under weak complexity-theoretic assumptions, it is impossible to reason efficiently about the fine-grained behavior of OGD. For example, the following problem is -hard: given a sequence of data points, to be fed into OGD over and over again (in the same order), with initial weights , does any weight vector  produced by OGD (with soft-margin SVM updates) have a positive first coordinate?222 is the set of decision problems decidable by a Turing machine that uses space at most polynomial in the input size, and it contains problems that are believed to be very hard (much harder than -complete). For example, the problem of deciding which player has a winning strategy in chess (for a suitable asymptotic generalization of chess) belongs to (and is complete for)  Storer (1983).

In the instances produced by our reduction, the optimal point in hindsight converges over time to a single point (the regularized ERM solution for the initial data set), and the well-known regret guarantees for OGD imply that its iterates grow close to (in objective function value and, by strong convexity, in distance as well). Viewed from afar, OGD is nearly converging; viewed up close, it exhibits astonishing complexity.

Our results have similar implications for a common-in-practice variant of stochastic gradient descent (SGD), where every epoch performs a single pass over the data points, in a fixed but arbitrary order. Our work implies that this variant of SGD can also simulate arbitrary computations (when the data points and their ordering can be chosen adversarially).

1.1 Related Work

There are a number of excellent sources for further background on OCO, OGD, and SVMs; see e.g. Hazan (2016); Shalev-Shwartz & Ben-David (2014). We use only classical concepts from complexity theory, covered e.g. in Sipser Sipser (2006).

This paper is most closely related to a line of work showing that certain widely used algorithms inadvertently solve much harder problems than what they were originally designed for. For example, Disser and Skutella Disser & Skutella (2015) showed how to efficiently embed an instance of the (-complete) Partition problem into a linear program so that the trajectory of the simplex method immediately reveals the answer to the instance. This line of work was expanded on by Adler et al. Adler et al.  (2014) and Fearnley and Savani Fearnley & Savani (2015), who exhibited pivot rules with which the simplex method can solve -complete problems. Roughgarden and Wang Roughgarden & Wang (2016) proved an analogous -completeness result for Lloyd’s -means algorithm. Similar -completeness results were proved for computing the final outcome of local search Johnson et al.  (1988), other path-following-type algorithms Goldberg et al.  (2013), and certain dynamical systems Papadimitriou & Vishnoi (2016).

More distantly related are previous works that treat stochastic gradient descent as a dynamical system and then show that the system is complex in some sense. Examples include Van Den Doel & Ascher (2012), who provide empirical evidence of chaotic behavior, and Chaudhari & Soatto (2018), who show that, for DNN training, SGD can converge to stable limit cycles. We are not aware of any previous works that take a computational complexity-based approach to the problem.

2 Preliminaries

2.1 Soft-Margin SVMs

We consider the following special case of OCO. For some fixed regularization parameter , every cost function will have the form

for some data point and label , where the hinge loss is defined as .333For simplicity, we have omitted the bias term here; see also Section 5.1. In this case, the weight updates in OGD have a special form (where is the step size):

2.2 Complexity Theory Background

A decision problem is in the class if and only if there exists a Turing machine and a polynomial function such that, for every -bit string , correctly decides whether or not is in while using space at most .

is obviously at least as big as , the class of polynomial-time-decidable decision problems (it takes operations to use up tape cells). It also contains every problem in (just try all possible polynomial-length witnesses, reusing space for each computation), - (for the same reason), the entire polynomial hierarchy, and more. A problem is -hard if every problem in polynomial-time reduces to it, and -complete if in addition belongs to . While the current state of knowledge does not rule out (which would be even more surprising than ), the widespread belief is that contains many problems that are intrinsically computationally difficult (like the aforementioned chess example). Thus a problem that is complete (or hard) for would seem to be very hard indeed.

Our main reduction is from the -Path problem. In this problem, the input is (an encoding of) a Boolean circuit  with inputs, outputs, and fan-in ; and a target -bit string . The goal is to decide whether or not the repeated application of to the all-false string ever produces the output . This problem is -complete (see Adler et al.  (2014)), and in this sense every polynomial-space computation is just a thinly disguised instance of -Path.

3 -Hardness Reduction

In this section, we present our main reduction from the -Path problem. Our reduction uses several types of gadgets, which are organized into an API in Subsection 3.2.

The implementation of two gadgets is given in Section 4 and the remaining implementations can be found in Appendix A. After presenting the API, this section concludes by showing how the reduction can be performed using the API.

3.1 Simplifying Assumptions

For this section, we make a couple of simplifying assumptions to showcase the main technical ideas used in our proof. We later show how to extend the proof to remove these assumptions in Section 5. Our simplifying assumptions are:

  1. There is no bias term, i.e. is fixed to .

  2. The learning rate is fixed to .

  3. The loss function is not regularized, i.e. .

3.2 API for Reduction Gadgets

We use a number of gadgets to encode an instance of -Path into training examples for OGD. The high level plan is to use the weights to encode boolean values in our circuit. A weight of will represent a true bit, while a weight of will represent a false bit. Additionally, we use a weight of to represent a bit that we have not yet computed (which we refer to as “unset”). For example, our simplest gadget is reset, which takes the index of a weight that is set to either +1 or -1, and provides a sequence of training examples that causes that weight to update to 0 (thus unsetting the bit).

It is well known that every Boolean circuit can be efficiently converted into a circuit that only has NAND gates (where the output is 0 if both inputs are 1, and 1 otherwise), and so we focus on such circuits. We would like a gadget that takes two true/false bits and an unset bit and writes the NAND of the first two into the third. Unfortunately, the nature of the weight updates makes it difficult to implement NAND directly. As a result, we instead use two smaller gadgets that can together be used to compute a NAND. The bulk of the work is done by destructive_nand, which performs the above but has the unfortunate side-effect of unsetting the first two bits. As a result, we need a way to increase the number of copies we have of a boolean value. The copy gadget takes a true/false bit and an unset bit and writes the former into the latter. Taken together, we can compute NAND by copying our two bits of interest and then using the copies to compute the NAND.

Our next gadget allows the starting weights to be the all-zeroes vector. The input_false gadget takes a weight that may correspond to either a true/false bit or to an unset bit. If the weight is already true/false, it does nothing. Otherwise, it takes the unset bit and writes false into it.

Finally, we have a simple gadget for the purpose of presenting a concrete -hard decision problem about the OGD process. The question we aim for is, does any weight vector produced by OGD (with soft-margin SVM updates) have a positive first coordinate? Correspondingly, the set_if_true gadget takes a true/false bit and a zero-weight coordinate (intended to be the first coordinate). If the first bit is true, this gadget gives the zero-weight coordinate a weight of . If the first bit is false, this gadget leaves the zero-weight coordinate completely untouched, even in intermediate steps between its training examples. This property is not present in the implementation of our other gadgets, so this will be the only gadget that we use to modify the first coordinate.

This API is formally specified in Table 1.

Function Precondition(s) Description
reset
copy
destructive_nand
input_false If ,
set_if_true If ,
If , remains at
(including in intermediate steps)
Table 1: Public API

3.3 Performing the Reduction using the API

We now show how to use our API to transform an instance of the -Path problem into a set of training examples for a soft-margin SVM that is being optimized by OGD.

Theorem 3.1.

There is a reduction which, given a circuit and a target binary string , produces a set of training examples for OGD (with soft-margin SVM updates) such that repeated application of to the all-false string eventually produces the string if and only if OGD beginning with the all-zeroes weight vector and repeatedly fed this set of training examples (in the same order) eventually produces a weight vector with positive first coordinate.

Proof.

Our reduction begins by converting into a more complex circuit . First, we assume that has only NAND gates (see above). Next, we augment our circuit with an additional input/output bit, intended to track if the current output is . The circuit ignores its additional input bit, and its additional output bit is true if the original output bits are and false otherwise. These transformations keep the size of polynomial in the input/output size.

Let denote the input/output size of and let denote the number of gates in . Our reduction produces training examples for an SVM with a -dimensional weight vector, where . We denote the first three indices for this weight vector using , , and : notably, denotes the first coordinate whose weight should remain zero unless the input to the -Path problem should be accepted. We denote the next indices and associate each with an input bit. We denote the last indices and associate them with gates of , in some topological order.

We begin with an empty training set. Each time we call a function from our API (which can be found in Table 1), we append its training examples to the end of our training set. We now give the construction, and then finish the proof by proving the resulting set of training examples has the desired property. Our construction proceeds in five phases.

In the first phase of our reduction, we set the starting input for the -Path problem. We iterate in order through . In iteration , we call input_false.

In the second phase of our reduction, we simulate the computation of the circuit . We iterate in order through . In iteration , we examine the NAND gate in associated with . Suppose its inputs are associated with indices and . We call copy, copy, destructive_nand in that order.

In the third phase of our reduction, we check if we have found . Let the additional output bit of be at index . We call set_if_true.

In the fourth phase of our reduction, we copy the output of the circuit back to the input. We iterate in order through . In iteration , let the output bit of correspond to the gate associated with index . We call reset and copy, in that order.

In the fifth phase of our reduction, we reset the circuit for the next round of computation. We iterate in order through . In iteration , we call reset.

We now explain why the resulting training data has the desired property. Let’s consider what OGD does in (i) the first pass over the training data and (ii) in later passes over the data. We begin with case (i). Before the first phase of our reduction, all weights are zero, corresponding to unset bits. The first phase of our reduction hence sets the weights at indices to correspond to an all-false input. The second phase of our reduction then computes the appropriate output for each gate and sets it. Note that it is important we proceeded in topological order, so that the inputs of a NAND gate are set before we attempt to compute its output. The third phase of our reduction checks if we have found , and if the weight gets set to a positive coordinate, this implies that immediately produced when applied to the all-false string. The fourth phase of our reduction unsets the weights at indices and then copies the output of into them. The fifth phase of our reduction then unsets the weights at indices .

If we are continuing after this first pass, then the weights at indices , , , and are unset while the weights at indices are set to the next circuit input. We now analyze case (ii), assuming it also leaves the weights in this state after each pass. In the first phase of our reduction, nothing happens because the input is already set. The second through fifth phases of our reduction then proceed exactly as in case (i), computing the circuit based on this input, checking if we found , copying the output to the input, and resetting the circuit for another round of computation. As a result, we again arrive at a state where the weights at indices , , , and are unset while the weights at indices are set to the next circuit input.

In other words, repeatedly passing over our training data causes OGD to simulate the repeated application of , as desired. By construction, our first coordinate has a positive weight if and only if our simulated computation manages to find . This completes the proof. ∎

Remark 1.

Although our decision question about OGD asked whether the first coordinate ever became positive, our reduction technique is flexible enough to result in many possible decision questions. For example, we might ask if OGD, after a single complete pass over the training examples, winds up producing the same weight vector that it had produced immediately preceding the complete pass (since may be rewired so that its only stationary point is ). As another example, with a simple modification of our set_if_true gadget to place a high value into , we could ask whether OGD ever produces a weight vector with norm above some threshold.

4 API Implementation

Now that we have described at a high level how to simulate the circuit computation using OGD updates, we proceed by giving the technical details of the implementation for each gadget operation on the circuit bits: . Note that in all of our constructions the training examples required are extremely sparse having at most non-zero coordinates.

Function Precondition(s) Description
add
Table 2: Private API

4.1 Implementation of add

Recall that for an index and a given constant , add has the effect of adding to the weight coordinate indexed by . This is the only operation in our private API (see Table 2) and is used in the implementation of all other operations.

How can one add to a weight coordinate ? Recall that the only operations performed during OGD are updates based on the gradient of the loss function so we will leverage exactly this. For a datapoint , the hinge loss function is: and the update is:

For our reduction, we have the freedom to select the training datapoint along with its label . We pick the -th coordinate to be , whereas and its label to be . After the update, the weight . The result of the first update can also be seen in the first row of Table 4. Repeating the same training example a constant number of times will result in the desired outcome of adding to the weight coordinate .

Table 3: Training data for add.
Effect on
Table 4: Training data for reset.
Effect on
add

4.2 Implementation of reset

The reset gadget (see Table 4) takes as input one index and resets the corresponding weight coordinate to 0 independent of what this coordinate used to be (either or ). If we knew the value of , things would be straightforward given that we have implemented the add gadget: in the case where we would just do add and in the case we would just do add.

Now that we don’t know , the idea is to first map it to and perform add to finish. What is an appropriate training example that will allow us to map and ? Let the training example be labeled with and have and 0 on the rest of its coordinates.

  • In the case of , we have and so there is no update since the gradient of the hinge loss is 0, hence remains .

  • If , we have , so after the update we get: , as desired.

The final step is to add and this will make the weight vector’s -th coordinate 0.

The technical details on how to implement and set_if_true are provided in the Appendix A.

5 Extensions

In this section, we give extensions to our proof techniques to remove the assumptions we made in Section 3.

5.1 Handling a Bias Term

In this subsection, we show how to remove assumption (i) and handle an SVM bias term. With the bias term added back in, the loss function is now:

Using a standard trick, we can simulate this bias term by adding an extra dimension and insisting that for every training point; the corresponding entry plays the role of . We now explain how to modify the reduction to follow the restriction that for every training point.

The key insight is that if we can ensure that the value of this bias term is immediately preceding every training example from the base construction, then will remain the same and the base construction will proceed as before. The problem is that whenever a base construction training example is in the first case for the derivative (namely ), this will result in an update to . Since every base construction training example chooses , we know the first case causes to be updated from to . We need to insert an additional training example to correct it back to . To complicate matters further, we sometimes don’t know whether we are in the first or second case for the derivative, so we don’t know whether has remained at or has been altered to . We need to provide a gadget such that for either case, is corrected to .

In order to avoid falling on the border of the hinge loss function (), we will be using two mirrored bias terms. In other words, we add two extra dimensions, and and insist that for every training point. We ensure that before every base construction training example. Since they always have the same weight, the two points always receive the same update, and the situtation is now that either (i) they both remained at or (ii) they both were altered to . We would like to correct them both to .

Effect on
, ,
, ,
, ,
, ,
Table 5: Training data to correct the bias term.

The two training examples that implement this behavior can be found in Table 5. The first training example combines cases by transforming case (i) into case (ii) and resulting in no updates when in case (ii). The second training example then resets both values to . To fix the base construction, we insert this gadget immediately after every base training example. As stated previously, this guarantees that immediately before every base construction training example, which thus proceeds in the same fashion.

5.2 Handling a Fixed Learning Rate

In this subsection, we show how to remove our assumption that the learning rate . Suppose we have some other step size , possibly a function of , the total number of steps to run OGD. We perform our reduction from -Path as before, pretending that . This yields a value for , which we can then use to determine .

We then scale all training vectors (but not labels ) by . We claim that our analysis holds when the weight vectors are scaled by . To see why, we reconsider the updates performed by OGD. First, consider the gradient terms:

Notice that the scaling of and the scaling of cancel out when computing , so we stay in the same case. Since was scaled by , our gradients scale by that amount as well. However, since the updates performed are times the new gradient, the net scaling of updates to is by a factor of . Since our analysis of is scaled up by exactly this amount as well, is updated as we previously reasoned.

5.3 Handling a Regularizer

In this subsection, we discuss how to handle a regularization parameter which is not too large. Consider the hinge loss objective with a regularizer:

Conceptually, the regularizer causes our weights to slowly decay over time. In particular, this new term in the gradient means that weights decay by at each step. We assume that this decay rate is not too fast: . Equivalently, . Due to this decay, we will no longer be able to maintain the association that a true bit is , a false bit is , and an unset bit is . Instead, for each weight index the reduction will need to maintain a counter which represents the current magnitude of any true/false bit being stored in that weight variable . A true bit will be , a false bit will be , and an unset bit will still be . After each training example it adds, the reduction should multiply each counter by .

Correspondingly, our API will need to grow more complex as well. The new API, the modified reduction which uses it, and the formal implementation can all be found in Appendix B.

References

  • Adler et al.  [2014] Adler, Ilan, Papadimitriou, Christos, & Rubinstein, Aviad. 2014. On simplex pivoting rules and complexity theory. Pages 13–24 of: International Conference on Integer Programming and Combinatorial Optimization. Springer.
  • Chaudhari & Soatto [2018] Chaudhari, Pratik, & Soatto, Stefano. 2018. Stochastic gradient descent performs variational inference, converges to limit cycles for deep networks. In: International Conference on Learning Representations.
  • Disser & Skutella [2015] Disser, Yann, & Skutella, Martin. 2015. The simplex algorithm is NP-mighty. Pages 858–872 of: Proceedings of the twenty-sixth annual ACM-SIAM symposium on Discrete algorithms. Society for Industrial and Applied Mathematics.
  • Fearnley & Savani [2015] Fearnley, John, & Savani, Rahul. 2015. The complexity of the simplex method. Pages 201–208 of: Proceedings of the forty-seventh annual ACM symposium on Theory of computing. ACM.
  • Goldberg et al.  [2013] Goldberg, Paul W, Papadimitriou, Christos H, & Savani, Rahul. 2013. The complexity of the homotopy method, equilibrium selection, and Lemke-Howson solutions. ACM Transactions on Economics and Computation, 1(2), 9.
  • Hazan [2016] Hazan, Elad. 2016. Introduction to Online Convex Optimization. Foundations and Trends® in Optimization, 2(3-4), 157–325.
  • Johnson et al.  [1988] Johnson, David S, Papadimitriou, Christos H, & Yannakakis, Mihalis. 1988. How easy is local search? Journal of computer and system sciences, 37(1), 79–100.
  • Papadimitriou & Vishnoi [2016] Papadimitriou, Christos H, & Vishnoi, Nisheeth K. 2016. On the Computational Complexity of Limit Cycles in Dynamical Systems. Pages 403–403 of: Itcs" 16: Proceedings Of The 2016 Acm Conference On Innovations In Theoretical Computer Science. Assoc Computing Machinery.
  • Roughgarden & Wang [2016] Roughgarden, Tim, & Wang, Joshua R. 2016. The Complexity of the k-means Method. In: LIPIcs-Leibniz International Proceedings in Informatics, vol. 57. Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik.
  • Shalev-Shwartz & Ben-David [2014] Shalev-Shwartz, Shai, & Ben-David, Shai. 2014. Understanding machine learning: From theory to algorithms. Cambridge university press.
  • Sipser [2006] Sipser, Michael. 2006. Introduction to the Theory of Computation. Vol. 2. Thomson Course Technology.
  • Storer [1983] Storer, James A. 1983. On the complexity of chess. Journal of computer and system sciences, 27(1), 77–100.
  • Van Den Doel & Ascher [2012] Van Den Doel, Kees, & Ascher, Uri. 2012. The chaotic nature of faster gradient descent methods. Journal of Scientific Computing, 51(3), 560–581.
  • Zinkevich [2003] Zinkevich, Martin. 2003. Online convex programming and generalized infinitesimal gradient ascent. Pages 928–936 of: Proceedings of the 20th International Conference on Machine Learning (ICML-03).

Appendix A API Implementation (Continued)

a.1 Implementation of copy

Suppose we want to copy the -th coordinate of the weight vector to its -th coordinate. How can we do that using only gradient updates? Again the difficulty in performing the copy operation stems from the fact that we don’t know the values of the coordinates; if we did know them, for example, if and in order to perform copy, we could just do add.

The sequence of operations together with the resulting weight vector after the gradient updates are provided in Table 6. Observe that in the end, the value of the -th coordinate of the weight vector is exactly the same as the -coordinate and the operation copy is performed correctly.

We are going to use two training examples. The first training example has label , and . The second training example has label , and .

  • Let’s focus in the case where (upper half of every row in Table 6). Without loss of generality let since otherwise we can just perform reset using previously defined gadgets.

    The gradient update on the first example will not affect the weight vector as . Then we just add to get . The gradient update on the second training example leads to . After the last two addition steps, we end up with the desired outcome.

  • This is similar to the previous case and by tracking down the gradient updates we end up with the desired outcome.

Effect on
, ,
, ,
add , ,
, ,
, ,
, ,
add , ,
, ,
add , ,
, ,
Table 6: Training data for copy.

a.2 Implementation of destructive_nand

We want to implement a NAND gate with inputs the coordinates and output the result in . Along the process we will zero out the input coordinates that’s why we call it destructive NAND. The operations needed are provided in Table 7. Notice that we should end up with except for the case when for which it should be .

We focus on the bottom line of each row in Table 7. The derivation for the other rows is completely similar. The weight vector has , set and we want to output their NAND in the coordinate . We will implement this in 5 steps. First we add the constant to so that we set to to .

The next row is the only step that may be a bit unclear and that requires some calculations that are provided here. The training example has label , with and .

  • If the weights are then so there is no update in this case.

  • If the weights are then so there is no update in this case.

  • If the weights are , similarly there is no update.

  • If the weights are , then so the gradient in this case is for every coordinate in the weight vector which now becomes after subtracting the gradient.

The remaining 3 steps are easily interpretable given our previous gadgets.

Effect on
add , , , ,
, , , ,
, , , ,
, , , ,
, , , ,
, , , ,
, , , ,
, , , ,
reset , , , ,
, , , ,
, , , ,
, , , ,
reset , , , ,
, , , ,
, , , ,
, , , ,
add , , , ,
, , , ,
, , , ,
, , , ,
Table 7: Training data for destructive_nand.

a.3 Implementation of input_false

The effect of input_false is to map the -th coordinate (which is either ) to , unless it is in which case it should remain . The 4 steps in Table 8 with the add gadgets should be clear by now. Here we give the calculations of the gradients and updates for the 3 steps that contain training examples.

  • The training example has label , with and . If then so the gradient step will add to . If then so there is no update. If , then again there is no update.

  • The training example has label , with and . If then so there is no update. If , then , so the gradient step will add to . If then so there is no update.

  • Training on the final training example is similar to the first case above.

Effect on
add
add
add
add
Table 8: Training data for input_false.

a.4 Implementation of set_if_true

This short gadget is given two coordinates and sets only if , otherwise everything stays unchanged. We use it to decide if at any point in the circuit computation, the target binary string is ever reached, in which case a specially reserved bit in the weight vector (e.g. the first bit of the ) is set to 1 to signal this fact.

We are going to use two training examples (and two add gadgets) and the calculations explaining the derivations of Table 9 are given below:

  • The first training example has label , with and . If then so there is no update. If then , so the gradient step will add to and to .

  • The second training example has label , with and . If then , so the gradient step will add to and to as is shown in Table 9. If then and so there would be no update.

Effect on
, ,
, ,
add , ,
, ,
, ,
, ,
add , ,
, ,
Table 9: Training data for set_if_true.

Appendix B Proof Extension for Regularization (Continued)

In this appendix, we give the new API, show how to modify the original reduction to use the augmented API, and then give an implementation of the API.

b.1 Augmented API for Regularization

Our augmented API is listed in Table 9. These five functions serve the same purpose as the functions of our original API (see Table 1), but now accept additional parameters and return output so that our reduction can keep track of the magnitude of each weight.

reset, d_nand, input_false, and set_if_true have essentially the same behavior as before, but now accept magnitude parameters and output the final magnitude of the weights that they write to. A more drastic change was made to copy2, which now destroys the bit stored in its input weight. To compensate, it now makes two copies, so that using it increases the total number of copies of a weight.

b.2 Reduction Modifications for Regularization

Our reduction still performs the same transformation of into . However, we will use an additional dimension (now ), which we also denote with a new special: . As stated before, we keep a counter for each dimension , decaying all counters by after each training example we produce.

In most cases, the appropriate to pass to our gadgets is clear: we take the last we received from a gadget writing to this coordinate and decay it appropriately. There is one major exception: in the first phase of the reduction, we need to iterate over and call input_false. The correct input magnitude is actually based on the last time these weights were possibly edited, which is actually in the (previous pass over the data) fourth phase of the reduction! Luckily, in our implementation of this API the number of training examples to implement a gadget does not depend on the inputs . As a result, we can either pick the appropriate values knowing the contents of all the phases, or we can run the reduction once with and then perform a second pass once we know the total number of training examples and which training examples are associated with which API calls.

Other than managing these magnitudes, we also alter the second and fourth phase of our reduction to account for a revised copy function (this is why we need an additional dimension). In the new second phase of our reduction, we iterate over . Again, we look at the associated NAND gate with inputs . We call:

  • copy2,

  • reset,

  • copy2,

  • copy2,

  • reset,

  • copy2, and

  • d_nand,

in that order with appropriate .

Similarly, in the fourth phase of our reduction, we iterate over and call reset, copy2, copy2, reset, in that order with appropriate .

The reason the reduction works is the same as before: the reduction forces the weights to simulate computation of the circuit and a check for