Training of photonic neural networks through in situ backpropagation
Abstract
Recently, integrated optics has gained interest as a hardware platform for implementing machine learning algorithms. Of particular interest are artificial neural networks, since matrixvector multiplications, which are used heavily in artificial neural networks, can be done efficiently in photonic circuits. The training of an artificial neural network is a crucial step in its application. However, currently on the integrated photonics platform there is no efficient protocol for the training of these networks. In this work, we introduce a method that enables highly efficient, in situ training of a photonic neural network. We use adjoint variable methods to derive the photonic analogue of the backpropagation algorithm, which is the standard method for computing gradients of conventional neural networks. We further show how these gradients may be obtained exactly by performing intensity measurements within the device. As an application, we demonstrate the training of a numerically simulated photonic artificial neural network. Beyond the training of photonic machine learning implementations, our method may also be of broad interest to experimental sensitivity analysis of photonic systems and the optimization of reconfigurable optics platforms.
I Introduction
Artificial neural networks (ANNs), and machine learning in general, are becoming ubiquitous for an impressively large number of applications LeCun et al. (2015). This has brought ANNs into the focus of research in not only computer science, but also electrical engineering, with hardware specifically suited to perform neural network operations actively being developed. There are significant efforts in constructing artificial neural network architectures using various electronic solidstate platforms Merolla et al. (2014); Prezioso et al. (2015), but ever since the conception of ANNs, a hardware implementation using optical signals has also been considered AbuMostafa and Pslatis (1987); Jutamulia (1996). In this domain, some of the recent work has been devoted to photonic spike processing Rosenbluth et al. (2009); Tait et al. (2014) and photonic reservoir computing Brunner et al. (2013); Vandoorne et al. (2014a), as well as to devising universal, chipintegrated photonic platforms that can implement any arbitrary ANN Shainline et al. (2017); Shen et al. (2017). Photonic implementations benefit from the fact that, due to the noninteracting nature of photons, linear operations – like the repeated matrix multiplications found in every neural network algorithm – can be performed in parallel, and at a lower energy cost, when using light as opposed to electrons.
A key requirement for the utility of any ANN platform is the ability to train the network using algorithms such as error backpropagation Rumelhart et al. (1986). Such training typically demands significant computational time and resources and it is generally desirable for error backpropagation to implemented on the same platform. This is indeed possible for the technologies of Refs. Merolla et al. (2014); Graves et al. (2016); Hermans et al. (2015) and has also been demonstrated e.g. in memristive devices Alibart et al. (2013); Prezioso et al. (2015). In optics, as early as three decades ago, an adaptive platform that could approximately implement the backpropagation algorithm experimentally was proposed Wagner and Psaltis (1987); Psaltis et al. (1988). However, this algorithm requires a number of complex optical operations that are difficult to implement, particularly in integrated optics platforms. Thus, the current implementation of a photonic neural network using integrated optics has been trained using a model of the system simulated on a regular computer Shen et al. (2017). This is inefficient for two reasons. First, this strategy depends entirely on the accuracy of the model representation of the physical system. Second, unless one is interested in deploying a large number of identical, fixed copies of the ANN, any advantage in speed or energy associated with using the photonic circuit is lost if the training must be done on a regular computer. Alternatively, training using a brute force, in situ computation of the gradient of the objective function has been proposed Shen et al. (2017). However, this strategy involves sequentially perturbing each individual parameter of the circuit, which is highly inefficient for large systems.
In this work, we propose a procedure, which we label the timereversal interference method (TRIM), to compute the gradient of the cost function of a photonic ANN by use of only in situ intensity measurements. Our procedure works by physically implementing the adjoint variable method (AVM), a technique that has typically been implemented computationally in the optimization and inverse design of photonic structures Georgieva et al. (2002); Veronis et al. (2004); Hughes et al. (2017). Furthermore, the method scales in constant time with respect to the number of parameters, which allows for backpropagation to be efficiently implemented in a hybrid optoelectronic network. Although we focus our discussion on a particular hardware implementation of a photonic ANN, our conclusions are derived starting from Maxwellâs equations, and may therefore be extended to other photonic platforms.
The paper is organized as follows: In Section II, we introduce the working principles of a photonic ANN based on the hardware platform introduced in Ref. Shen et al. (2017). We also derive the mathematics of the forward and backward propagation steps and show that the gradient computation needed for training can be expressed as a modal overlap. Then, in Section III we discuss how the adjoint method may be used to describe the gradient of the ANN cost function in terms of physical parameters. In Section IV, we describe our procedure for determining this gradient information experimentally using in situ intensity measurements. We give a numerical validation of these findings in Section V and demonstrate our method by training a model of a photonic ANN in Section VI. We provide final comments and conclude in Section VII.
Ii The Photonic Neural Network
In this Section, we introduce the operation and gradient computation of a feedforward photonic ANN. In its most general case, a feedforward ANN maps an input vector to an output vector via an alternating sequence of linear operations and elementwise nonlinear functions of the vectors, also called ‘activations’. A cost function, , is defined over the outputs of the ANN and the matrix elements involved in the linear operations are tuned to minimize over a number of training examples via gradientbased optimization. The ‘backpropagation algorithm’ is typically used to compute these gradients analytically by sequentially utilizing the chain rule from the output layer backwards to the input layer.
Here, we will outline these steps mathematically for a single training example, with the procedure diagrammed in Fig. 1a. We focus our discussion on the photonic hardware platform presented in Shen et al. (2017), which performs the linear operations using optical interference units (OIUs). The OIU is a mesh of controllable MachZehnder interferometers (MZIs) integrated in a silicon photonic circuit. By tuning the phase shifters integrated in the MZIs, any unitary operation on the input can be implemented Reck et al. (1994); Clements et al. (2016), which finds applications both in classical and quantum photonics Carolan et al. (2015); Harris et al. (2017). In the photonic ANN implementation from Ref. Shen et al. (2017), an OIU is used for each linear matrixvector multiplication, whereas the nonlinear activations are performed using an electronic circuit, which involves measuring the optical state before activation, performing the nonlinear activation function on an electronic circuit such as a digital computer, and preparing the resulting optical state to be injected to the next stage of the ANN.
We first introduce the notation used to describe the OIU, which consists of a number, , of singlemode waveguide input ports coupled to the same number of singlemode output ports through a linear and lossless device. In principle, the device may also be extended to operate on a different number of inputs and outputs. We further assume directional propagation such that all power flows exclusively from the input ports to the output ports, which is a typical assumption for the devices of Refs. Miller (2013a); Shen et al. (2017); Harris et al. (2017); Carolan et al. (2015); Reck et al. (1994); Miller (2013b); Clements et al. (2016). In its most general form, the device implements the linear operation
(1) 
where and are the modal amplitudes at the input and output ports, respectively, and , which we will refer to as the transfer matrix, is the offdiagonal block of the system’s full scattering matrix,
(2) 
Here, the diagonal blocks are zero because we assume forwardonly propagation, while the offdiagonal blocks are the transpose of each other because we assume a reciprocal system. and correspond to the input and output modal amplitudes, respectively, if we were to run this device in reverse, i.e. sending a signal in from the output ports.
Now we may use this notation to describe the forward and backward propagation steps in a photonic ANN. In the forward propagation step, we start with an initial input to the system, , and perform a linear operation on this input using an OIU represented by the matrix . This is followed by the application of a elementwise nonlinear activation, , on the outputs, giving the input to the next layer. This process repeats for the each layer until the output layer, . Written compactly, for
(3) 
Finally, our cost function is an explicit function of the outputs from the last layer, . This process is shown in Fig. 1(a).
To train the network, we must minimize this cost function with respect to the linear operators, , which may be adjusted by tuning the integrated phase shifters within the OIUs. While a number of recent papers have clarified how an individual OIU can be tuned by sequential, in situ methods to perform an arbitrary, predefined operation Miller (2013b, a, 2015); Annoni et al. (2017), these strategies do not straightforwardly apply to the training of ANNs, where nonlinearities and several layers of computation are present. In particular, the training of ANN requires gradient information which is not provided directly in the methods of Ref. Miller (2013b, a, 2015); Annoni et al. (2017).
In Ref. Shen et al. (2017), the training of the ANN was done ex situ on a computer model of the system, which was used to find the optimal weight matrices for a given cost function. Then, the final weights were recreated in the physical device, using an idealized model that relates the matrix elements to the phase shifters. Ref. Shen et al. (2017) also discusses a possible in situ method for computing the gradient of the ANN cost function through a serial perturbation of every individual phase shifter (‘brute force’ gradient computation). However, this gradient computation has an unavoidable linear scaling with the number of parameters of the system. The training method that we propose here operates without resorting to an external model of the system, while allowing for the tuning of each parameter to be done in parallel, therefore scaling significantly better with respect to the number of parameters when compared to the brute force gradient computation.
To introduce our training method we first use the backpropagation algorithm to derive an expression for the gradient of the cost function with respect to the permittivities of the phase shifters in the OIUs. In the following, we denote as the permittivity of a single, arbitrarily chosen phase shifter in layer , as the same derivation holds for each of the phase shifters present in that layer. Note that has an explicit dependence on , but all field components in the subsequent layers also depend implicitly on .
As a demonstration, we take a mean squared cost function
(4) 
where is a complexvalued target vector corresponding to the desired output of our system given input .
Starting from the last layer in the circuit, the derivative of the cost function with respect to the permittivity of one of the phase shifters in the last layer is given by
(5)  
(6)  
(7) 
where is elementwise vector multiplication, defined such that, for vectors and , the th element of the vector is given by . gives the real part, is the derivative of the th layer activation function with respect to its (complex) argument. We define the vector in terms of the error vector .
For any layer , we may use the chain rule to perform a recursive calculation of the gradients
(8)  
(9)  
(10) 
Figure 1(b) diagrams this process, which computes the vectors sequentially from the output layer to the input layer. A treatment for nonholomorphic activations is derived Appendix A.
We note that the computation of requires performing the operation , which corresponds physically to sending into the output end of the OIU in layer . In this way, our procedure ‘backpropagates’ the vectors and physically through the entire circuit.
Iii Gradient computation using the adjoint variable method
In the previous Section, we showed that the crucial step in training the ANN is computing gradient terms of the form , which contain derivatives with respect to the permittivity of the phase shifters in the OIUs. In this Section, we show how this gradient may be expressed as the solution to an electromagnetic adjoint problem.
The OIU used to implement the matrix , relating the complex mode amplitudes of input and output ports, can be described using firstprinciples electrodynamics. This will allow us to compute its gradient with respect to each , as these are the physically adjustable parameters in the system. Assuming a source at frequency , at steady state Maxwell’s equations take the form
(11) 
which can be written more succinctly as
(12) 
Here, describes the spatial distribution of the relative permittivity (), is the freespace wavenumber, is the electric field distribution, is the electric current density, and due to Lorentz reciprocity. Eq. (12) is the starting point of the finitedifference frequencydomain (FDFD) simulation technique Shin and Fan (2012), where it is discretized on a spatial grid, and the electric field is solved given a particular permittivity distribution, , and source, .
To relate this formulation to the transfer matrix , we now define source terms , , that correspond to a source placed in one of the input or output ports. Here we assume a total of input and output waveguides. The spatial distribution of the source term, , matches the mode of the th singlemode waveguide. Thus, the electric field amplitude in port is given by , and we may establish a relationship between and , as
(13) 
for over the input port indices, where is the th component of . Or more compactly,
(14) 
Similarly, we can define
(15) 
for over the output port indices, or,
(16) 
and, with this notation, Eq. (1) becomes
(17) 
We now use the above definitions to evaluate the cost function gradient in Eq. (10). In particular, with Eqs. (10) and (17), we arrive at
(18) 
Here is the modal source profile that creates the input field amplitudes at the input ports.
The key insight of the adjoint variable method is that we may interpret this expression as an operation involving the field solutions of two electromagnetic simulations, which we refer to as the ‘original’ (og) and the ‘adjoint’ (aj)
(19)  
(20) 
where we have made use of the symmetric property of .
Eq. (18) can now be expressed in a compact form as
(21) 
If we assume that this phase shifter spans a set of points, in our system, then, from Eq. (11), we obtain
(22) 
where is the Kronecker delta.
Inserting this into Eq. (21), we thus find that the gradient is given by the overlap of the two fields over the phaseshifter positions
(23) 
This result now allows for the computation in parallel of the gradient of the loss function with respect to all phase shifters in the system, given knowledge of the original and adjoint fields.
Iv Experimental measurement of gradient
We now introduce our timereversal interference method (TRIM) for computing the gradient from the previous section through in situ intensity measurements. This represents the most significant result of this paper. Specifically, we wish to generate an intensity pattern with the form , matching that of Eq. (23). We note that interfering and directly in the system results in the intensity pattern:
(24) 
the last term of which matches Eq. (23). Thus, the gradient can be computed purely through intensity measurements if the field can be generated in the OIU.
The adjoint field for our problem, , as defined in Eq. (20), is sourced by , meaning that it physically corresponds to a mode sent into the system from the output ports. As complex conjugation in the frequency domain corresponds to timereversal of the fields, we expect to be sent in from the input ports. Formally, to generate , we wish to find a set of input source amplitudes, , such that the output port source amplitudes, , are equal to the complex conjugate of the adjoint amplitudes, or . Using the unitarity property of transfer matrix for a lossless system, along with the fact that for output modes, the input mode amplitudes for the timereversed adjoint can be computed as
(25) 
As discussed earlier, is the transfer matrix from output ports to input ports. Thus, we can experimentally determine by sending into the device output ports, measuring the output at the input ports, and taking the complex conjugate of the result.
We now summarize the procedure for experimentally measuring the gradient of an OIU layer in the ANN with respect to the permittivities of this layer’s integrated phase shifters:

Send in the original field amplitudes and measure and store the intensities at each phase shifter.

Send into the output ports and measure and store the intensities at each phase shifter.

Compute the timereversed adjoint input field amplitudes as in Eq. (25).

Interfere the original and the timereversed adjoint fields in the device, measuring again the resulting intensities at each phase shifter.

Subtract the constant intensity terms from steps 1 and 2 and multiply by to recover the gradient as in Eq. (23).
This procedure is also illustrated in Fig. 2.
V Numerical Gradient Demonstration
We numerically demonstrate this procedure in Fig. 3 with a series of FDFD simulations of an OIU implementing a unitary matrix Reck et al. (1994). These simulations are intended to represent the gradient computation corresponding to one OIU in a single layer, , of a neural network with input and delta vector . In these simulations, we use absorbing boundary conditions on the outer edges of the system to eliminate backreflections. The relative permittivity distribution is shown in Fig. 3(a) with the positions of the variable phase shifters in blue. For demonstration, we simulate a specific case where , with unit amplitude in the bottom port and we choose . In Fig. 3(b), we display the real part of , corresponding to the original, forward field.
The real part of the adjoint field, , corresponding to the cost function is shown in Fig. 3(c). In Fig. 3(d) we show the real part of the timereversed copy of as computed by the method described in the previous section, in which is sent in through the input ports. There is excellent agreement, up to a constant, between the complex conjugate of the field pattern of (c) and the field pattern of (d), as expected.
In Fig. 3(e), we display the gradient of the objective function with respect to the permittivity of each point of space in the system, as computed with the adjoint method, described in Eq. (23). In Fig. 3(f), we show the same gradient information, but instead computed with the method described in the previous section. Namely, we interfere the field pattern from panel (b) with the field pattern from panel (d), subtract constant intensity terms, and multiply by the appropriate constants. Again, (b) and (d) agree with good precision.
We note that in a realistic system, the gradient must be constant for any stretch of waveguide between waveguide couplers because the interfering fields are at the same frequency and are traveling in the same direction. Thus, there should be no distance dependence in the corresponding intensity distribution. This is largely observed in our simulation, although small fluctuations are visible because of the proximity of the waveguides and the sharp bends, which were needed to make the structure compact enough for simulation within a reasonable time. In practice, the importance of this constant intensity is that it can be detected after each phase shifter, instead of inside of it.
Finally, we note that this numerically generated system experiences a total power loss of 41% due to scattering caused by very sharp bends and staircasing of the structure in the simulation. We also observe approximately 510% modedependent loss, as determined by measuring the difference in total transmitted power corresponding to injection at different input ports. Minimal amounts of reflection are also visible in the field plots. Nevertheless, TRIM still reconstructs the adjoint sensitivity with very good fidelity.
Vi Example of ANN training
Finally, we use the techniques from the previous Sections to numerically demonstrate the training of a photonic ANN to implement a logical XOR gate, defined by the following input to target () pairs
(26) 
This problem was chosen as demonstration of learning a nonlinear mapping from input to output Vandoorne et al. (2014b) and is simple enough to be solved with a small network with only four training examples.
As diagrammed in Fig. 6a, we choose a network architecture consisting of two unitary OIUs. On the forward propagation step, the binary representation of the inputs, , is sent into the first two input elements of the ANN and a constant value of is sent into the third input element, which serves to introduce artificial bias terms into the network. These inputs are sent through a unitary OIU and then the elementwise activation is applied. The output of this step is sent to another OIU and sent through another activation of the same form. Finally, the first output element is taken to be our prediction, , ignoring the last two output elements. Our network is repeatedly trained on the four training examples defined in Eq. (26) and using the meansquared cost function presented in Eq. (4).
For this demonstration, we utilized a matrix model of the system, as described in Reck et al. (1994); Clements et al. (2016), with mathematical details described in Appendix B. This model allows us to compute an output of the system given an input mode and the settings of each phase shifter. Although this is not a firstprinciple electromagnetic simulation of the system, it provides information about the complex fields at specific reference points within the circuit, which enables us to implement training using the backpropagation method as described in Section II, combined with the adjoint gradient calculation from Section III. Using these methods, at each iteration of training we compute the gradient of our cost function with respect to the phases of each of the integrated phase shifters, and sum them over the four training examples. Then, we perform a simple steepestdescent update to the phase shifters, in accordance with the gradient information. This is consistent with the standard training protocol for an ANN implemented on a conventional computer. Our network successfully learned the XOR gate in around 400 iterations. The results of the training are shown in Fig. 6bd.
We note that this is meant to serve as a simple demonstration of using the insitu backpropagation technique for computing the gradients needed to train photonic ANNs. However, our method may equally be performed on more complicated tasks, which we show in the Appendix C.
Vii Discussion and Conclusion
Here, we justify some of the assumptions made in this work. Our strategy for training a photonic ANN relies on the ability to create arbitrary complex inputs. We note that a device for accomplishing this has been proposed and discussed in Miller (2017). Our recovery technique further requires an integrated intensity detection scheme to occur in parallel and with virtually no loss. This may be implemented by integrated, transparent photodetectors, which have already been demonstrated in similar systems Annoni et al. (2017). Furthermore, as discussed, this measurement may occur in the waveguide regions directly after the phase shifters, which eliminates the need for phase shifter and photodetector components at the same location. Finally, in our procedure for experimentally measuring the gradient information, we suggested running isolated forward and adjoint steps, storing the intensities at each phase shifter for each step, and then subtracting this information from the final interference intensity. Alternatively, one may bypass the need to store these constant intensities by introducing a lowfrequency modulation on top of one of the two interfering fields in Fig. 2(c), such that the product term of Eq. (24) can be directly measured from the lowfrequency signal. A similar technique was used in Annoni et al. (2017).
We now discuss some of the limitations of our method. In the derivation, we had assumed the operator to be unitary, which corresponds to a lossless OIU. In fact, we note that our procedure is exact in the limit of a lossless, feedforward, and reciprocal system. However, with the addition of any amount of uniform loss, is still unitary up to a constant, and our procedure may still be performed with the added step of scaling the measured gradients depending on this loss (see a related discussion in Ref. Miller (2017)). Uniform loss conditions are satisfied in the OIUs experimentally demonstrated in Refs. Shen et al. (2017); Miller (2013b). Modedependent loss, such as asymmetry in the MZI mesh layout or fabrication errors, should be avoided as its presence limits the ability to accurately reconstruct the timereversed adjoint field. Nevertheless, our simulation in Fig. 3 indicates that an accurate gradient can be obtained even in the presence of significant modedependent loss. In the experimental structures of Refs. Shen et al. (2017); Miller (2013b), the modedependent loss is made much lower due to the choice of the MZI mesh. Thus we expect our protocol to work in practical systems. Our method, in principle, computes gradients in parallel and scales in constant time. In practice, to get this scaling would require careful design of the circuits controlling the OIUs.
Conveniently, since our method does not directly assume any specific model for the linear operations, it may gracefully handle imperfections in the OIUs, such as deviations from perfect 5050 splits in the MZIs. Lastly, while we chose to make an explicit distinction between the input ports and the output ports, i.e. we assume no backscattering in the system, this requirement is not strictly necessary. Our formalism can be extended to the full scattering matrix. However, this would require special treatment for subtracting the backscattering.
The problem of overfitting is one that must be addressed by ‘regularization’ in any practical realization of a neural network. Photonic ANNs of this class provide a convenient approach to regularization based on ‘dropout’ Srivastava et al. (2014). In the dropout procedure, certain nodes are probabilistically and temporarily ‘deletedâ from the network during train time, which has the effect of forcing the network to find alternative paths to solve the problem at hand. This has a strong regularization effect and has become popular in conventional ANNs. Dropout may be implemented simply in the photonic ANN by ‘shutting offâ channels in the activation functions during training. Specifically, at each time step and for each layer and element , one may set with some fixed probability.
In conclusion, we have demonstrated a method for performing backpropagation in an ANN based on a photonic circuit. This method works by physically propagating the adjoint field and interfering its timereversed copy with the original field. The gradient information can then be directly measured out as an insitu intensity measurement. While we chose to demonstrate this procedure in the context of ANNs, it is broadly applicable to any reconfigurable photonic system. One could imagine this setup being used to tune phased arrays Sun et al. (2013), optical delivery systems for dielectric laser accelerators Hughes et al. (2018), or other systems that rely on large meshes of integrated optical phase shifters. Furthermore, it may be applied to sensitivity analysis of photonic devices, enabling spatial sensitivity information to be measured as an intensity in the device.
Our work should enhance the appeal of photonic circuits in deep learning applications, allowing for training to happen directly inside the device in an efficient and scalable manner. Furthermore, this method is broadly applicable to integrated and adaptive optical systems, enabling the possibility for automatic selfconfiguration and optimization without resorting to brute force gradient computation or modelbased methods, which often do not perfectly represent the physical system.
Funding Information
Gordon and Betty Moore Foundation (GBMF4744); Schweizerischer Nationalfonds zur FÃ¶rderung der Wissenschaftlichen Forschung (P300P2_177721); Air Force Office of Scientific Research (FA95501710002).
References
 LeCun et al. (2015) Y. LeCun, Y. Bengio, and G. Hinton, Nature 521, 436 (2015).
 Merolla et al. (2014) P. A. Merolla, J. V. Arthur, R. AlvarezIcaza, A. S. Cassidy, J. Sawada, F. Akopyan, B. L. Jackson, N. Imam, C. Guo, Y. Nakamura, B. Brezzo, I. Vo, S. K. Esser, R. Appuswamy, B. Taba, A. Amir, M. D. Flickner, W. P. Risk, R. Manohar, and D. S. Modha, Science 345, 668 (2014), http://science.sciencemag.org/content/345/6197/668.full.pdf .
 Prezioso et al. (2015) M. Prezioso, F. MerrikhBayat, B. D. Hoskins, G. C. Adam, K. K. Likharev, and D. B. Strukov, Nature 521, 61 (2015), arXiv:1412.0611 .
 AbuMostafa and Pslatis (1987) Y. S. AbuMostafa and D. Pslatis, Scientific American 256, 88 (1987).
 Jutamulia (1996) S. Jutamulia, Science 28 (1996).
 Rosenbluth et al. (2009) D. Rosenbluth, K. Kravtsov, M. P. Fok, and P. R. Prucnal, Optics Express 17, 22767 (2009).
 Tait et al. (2014) A. N. Tait, M. A. Nahmias, B. J. Shastri, and P. R. Prucnal, Journal of Lightwave Technology 32, 3427 (2014).
 Brunner et al. (2013) D. Brunner, M. C. Soriano, C. R. Mirasso, and I. Fischer, Nature Communications 4, 1364 (2013).
 Vandoorne et al. (2014a) K. Vandoorne, P. Mechet, T. Van Vaerenbergh, M. Fiers, G. Morthier, D. Verstraeten, B. Schrauwen, J. Dambre, and P. Bienstman, Nature Communications 5, 1 (2014a).
 Shainline et al. (2017) J. M. Shainline, S. M. Buckley, R. P. Mirin, and S. W. Nam, Physical Review Applied 7, 1 (2017), arXiv:1610.00053 .
 Shen et al. (2017) Y. Shen, N. C. Harris, S. Skirlo, M. Prabhu, T. BaehrJones, M. Hochberg, X. Sun, S. Zhao, H. Larochelle, D. Englund, and M. Soljacic, Nature Photonics 11, 441 (2017), arXiv:1610.02365 .
 Rumelhart et al. (1986) D. E. Rumelhart, G. E. Hinton, and R. J. Williams, Parallel Distributed Processing, edited by D. E. Rumelhart and R. J. McClelland, Vol. 1 (MIT Press, 1986) Chap. 8.
 Graves et al. (2016) A. Graves, G. Wayne, M. Reynolds, T. Harley, I. Danihelka, A. GrabskaBarwińska, S. G. Colmenarejo, E. Grefenstette, T. Ramalho, J. Agapiou, A. P. Badia, K. M. Hermann, Y. Zwols, G. Ostrovski, A. Cain, H. King, C. Summerfield, P. Blunsom, K. Kavukcuoglu, and D. Hassabis, Nature 538, 471 (2016), arXiv:arXiv:1410.5401v2 .
 Hermans et al. (2015) M. Hermans, M. Burm, T. Van Vaerenbergh, J. Dambre, and P. Bienstman, Nature Communications 6, 6729 (2015).
 Alibart et al. (2013) F. Alibart, E. Zamanidoost, and D. B. Strukov, Nature Communications 4, 2072 (2013).
 Wagner and Psaltis (1987) K. Wagner and D. Psaltis, Applied Optics 26, 5061 (1987).
 Psaltis et al. (1988) D. Psaltis, D. Brady, and K. Wagner, Applied Optics 27, 1752 (1988).
 Georgieva et al. (2002) N. Georgieva, S. Glavic, M. Bakr, and J. Bandler, IEEE Transactions on Microwave Theory and Techniques 50, 2751 (2002).
 Veronis et al. (2004) G. Veronis, R. W. Dutton, and S. Fan, Optics Letters 29, 2288 (2004).
 Hughes et al. (2017) T. Hughes, G. Veronis, K. P. Wootton, R. J. England, and S. Fan, Optics Express 25, 15414 (2017).
 Reck et al. (1994) M. Reck, A. Zeilinger, H. J. Bernstein, and P. Bertani, Physical Review Letters 73, 58 (1994), arXiv:9612010 [quantph] .
 Clements et al. (2016) W. R. Clements, P. C. Humphreys, B. J. Metcalf, W. S. Kolthammer, and I. A. Walsmley, Optica 3, 1460 (2016).
 Carolan et al. (2015) J. Carolan, C. Harrold, C. Sparrow, E. MartínLópez, N. J. Russell, J. W. Silverstone, P. J. Shadbolt, N. Matsuda, M. Oguma, M. Itoh, G. D. Marshall, M. G. Thompson, J. C. Matthews, T. Hashimoto, J. L. O’Brien, and A. Laing, Science 349, 711 (2015), arXiv:1505.01182 .
 Harris et al. (2017) N. C. Harris, G. R. Steinbrecher, M. Prabhu, Y. Lahini, J. Mower, D. Bunandar, C. Chen, F. N. C. Wong, T. BaehrJones, M. Hochberg, S. Lloyd, and D. Englund, Nature Photonics 11, 447 (2017).
 Miller (2013a) D. A. B. Miller, Optics Express 21, 6360 (2013a), arXiv:1302.1593 .
 Miller (2013b) D. A. B. Miller, Photonics Research 1, 1 (2013b), arXiv:1303.4602 .
 Miller (2015) D. A. B. Miller, Optica 2, 747 (2015).
 Annoni et al. (2017) A. Annoni, E. Guglielmi, M. Carminati, G. Ferrari, M. Sampietro, D. A. Miller, A. Melloni, and F. Morichetti, Light: Science & Applications 6, e17110 (2017).
 Shin and Fan (2012) W. Shin and S. Fan, Journal of Computational Physics 231, 3406 (2012).
 Vandoorne et al. (2014b) K. Vandoorne, P. Mechet, T. Van Vaerenbergh, M. Fiers, G. Morthier, D. Verstraeten, B. Schrauwen, J. Dambre, and P. Bienstman, Nature Communications 5, 3541 (2014b).
 Miller (2017) D. A. B. Miller, Opt. Express 25, 29233 (2017).
 Srivastava et al. (2014) N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, Journal of Machine Learning Research 15, 1929 (2014).
 Sun et al. (2013) J. Sun, E. Timurdogan, A. Yaacobi, E. S. Hosseini, and M. R. Watts, Nature 493, 195 (2013).
 Hughes et al. (2018) T. W. Hughes, S. Tan, Z. Zhao, N. V. Sapra, Y. J. Lee, K. J. Leedle, H. Deng, Y. Miao, D. S. Black, M. Qi, O. Solgaard, J. S. Harris, J. Vuckovic, R. L. Byer, and S. Fan, Physical Review Applied (Accepted 2018).
Appendix A Nonholomorphic Backpropagation
In the previous derivation, we have assumed that the functions are holomorphic. For each element of input , labeled , this means that the derivative of with respect to its complex argument is well defined, or the derivative
(27) 
does not depend on the direction that approaches in the complex plane.
Here we show how to extend the backpropagation derivation to nonholomorphic activation functions. We first examine the starting point of the backpropagation algorithm, considering the change in the meansquared loss function with respect to the permittivity of a phase shifter in the last layer OIU, as written in Eq. (7) of the main text as
(28) 
Where we had defined the error vector for simplicity and is the output of the final layer.
To evaluate this expression for nonholomorphic activation functions, we split and its argument into their real and imaginary parts
(29) 
where is the imaginary unit and and are the real and imaginary parts of , respectively.
We now wish to evaluate , which gives the following via the chain rule
(30) 
where we have dropped the layer index for simplicity. Here, terms of the form correspond to elementwise differentiation of the vector with respect to the vector . For example, the th element of the vector is given by .
Now, inserting into Eq. (28), we have
(31)  
(32) 
We now define real and imaginary parts of as and , respectively. Inserting the definitions of and in terms of and and doing some algebra, we recover
(33)  
(34)  
(35)  
(36) 
Finally, the expression simplifies to
(37)  
(38) 
As a check, if we insert the conditions for to be holomorphic, namely
(39) 
Eq. (36) simplifies to
(40)  
(41)  
(42)  
(43)  
(44) 
as before.
This derivation may be similarly extended to any layer in the network. For holomorphic activation functions, whereas we originally defined the vectors as
(45) 
for nonholomorphic activation functions, the respective definition is
(46) 
where and are the respective real and imaginary parts of , and are the real and imaginary parts of , and and are the real and imaginary parts of , respectively.
We can write this more simply as
(47) 
In polar coordinates where and , this equation becomes
(48) 
where all operations are elementwise.
Appendix B Photonic neural network simulation
In Sections 4 and 5 of the main text, we have shown, starting from Maxwell’s equations, how the gradient information defined for an arbitrary problem can be obtained through electric field intensity measurements. However, since the full electromagnetic problem is too large to solve repeatedly, for the purposes of demonstration of a functioning neural network, in Section 6 we use the analytic, matrix representation of a mesh of MZIs as described in Ref. Clements et al. (2016). Namely, for an even , the matrix of the OIU is parametrized as the product of unitary matrices:
(49) 
where each implements a number of twobytwo unitary operations corresponding to a given MZI, and is a diagonal matrix corresponding to an arbitrary phase delay added to each port. This is schematically illustrated in Fig. 5(a) for . For the ANN training, we need to compute terms of the form
(50) 
for an arbitrary phase and vectors and defined following the steps in the main text. Because of the feedforward nature of the OIUs, the matrix can also be split as
(51) 
where is a diagonal matrix which applies a phase shift in port (the other elements are independent of ), while and are the parts that precede and follow the phase shifter, respectively (Fig. 5(b)). Thus, Eq. (50) becomes
(52)  
where is the th element of the vector , and denotes the imaginary part. This result can be written more intuitively in a notation similar to the main text. Namely, if is the field amplitude generated by input from the right, measured right after the phase shifter corresponding to , while is the field amplitude generated by input from the right, measured at the same point, then
(53) 
By recording the amplitudes in all ports during the forward and the backward propagation, we can thus compute in parallel the gradient with respect to every phase shifter. Notice that, within this computational model, we do not need to go through the full procedure outlined in Section 4 of the main text. However, this procedure is crucial for the in situ measurement of the gradients, and works even in cases which cannot be correctly captured by the simplified matrix model used here.
Appendix C Training demonstration
In the main text we show how the insitu backpropagation method may be used to train a simple XOR network. Here we demonstrate training on a more complex problem. Specifically, we generate a set of one thousand training examples represented by input and target pairs. Here, where and are the independent inputs, which we constrain to be real for simplicity, and represents a mode added to the third port to make the norm of the same for each training example. In this case, we choose . Each training example has a corresponding label, which is encoded in the desired output, , as and for and respectively.
For a given and , we define and as the magnitude and phase of the vector in the 2Dplane, respectively. To generate the corresponding class label, we first generate a uniform random variable between 0 and 1, labeled , and then set if
(54) 
Otherwise, we set . For the demonstration, , , and . The underlying distribution thus resembles an oblong ring centered around , with added noise.
As diagrammed in Fig. 6(a), we use a network architecture consisting of six layers of unitary OIUs, with an elementwise activation after each unitary transformation except for the last in the series, which has an activation of . After the final activation, we apply an additional ‘softmax’ activation, which gives a normalized probability distribution corresponding to the predicted class of . Specifically, these are given by , where is the first/second element of the output vector of the last activation (the other two elements are ignored). The ANN prediction for the input is set as the larger one of these two outputs, while the total cost function is defined in the crossentropy form
(55) 
where is the cost function of the th example, the summation is over all training examples, and is the output from the target port, , as defined by the target output of the th example. We randomly split our generated examples into a training set containing 75% of the originally generated training examples, while the remaining 25% are used as a test set to evaluate the performance of our network on unseen examples.
As in the XOR demonstration, we utilized our matrix model of the system described in Section B. As in the main text, at each iteration of training we compute the gradient of the cost function with respect to the phases of each of the integrated phase shifters, and sum this over each of the training examples. For the backpropagation through the activation functions, since and are nonholomorphic, we use eq. 48 from Section A, to obtain
(56)  
(57) 
where is a vector containing the phases of and is given by the derivative of the crossentropy loss function for a single training example
(58) 
where is the Kronecker delta.
With this, we can now compute the gradient of the loss function of eq. 55 with respect to all trainable parameters, and perform a parallel, steepestdescent update to the phase shifters, in accordance with the gradient information. Our network successfully learned the this task in around 4000 iterations. The results of the training are shown in Fig. 6(b). We achieved a training and test accuracy of 91% on both the training and test sets, indicating that the network was not overfitting to the dataset. This can also be confirmed visually from Fig. 6(c). The lack of perfect predictions is likely due to the inclusion of noise.