Matrix optimization on universal unitary photonic devices
Abstract
Universal unitary photonic devices can apply arbitrary unitary transformations to a vector of input modes and provide a promising hardware platform for fast and energyefficient machine learning using light. We simulate the gradientbased optimization of random unitary matrices on universal photonic devices composed of imperfect tunable interferometers. If device components are initialized uniformrandomly, the locallyinteracting nature of the mesh components biases the optimization search space towards banded unitary matrices, limiting convergence to random unitary matrices. We detail a procedure for initializing the device by sampling from the distribution of random unitary matrices and show that this greatly improves convergence speed. We also explore mesh architecture improvements such as adding extra tunable beamsplitters or permuting waveguide layers to further improve the training speed and scalability of these devices.
pacs:
85.40.Bh=.08em \lightrulewidth=.05em \cmidrulewidth=.03em \belowrulesep=.65ex \belowbottomsep=0pt \aboverulesep=.4ex \abovetopsep=0pt \cmidrulesep=\cmidrulekern=.5em \defaultaddspace=.5em
I Introduction
Universal multiport interferometers are optical networks that perform arbitrary unitary transformations on input vectors of coherent light modes. Such devices can be used in applications including quantum computing (e.g. boson sampling, photon walks) Gräfe et al. (2016); Carolan et al. (2015); Spring et al. (2013); Harris et al. (2017), mode unscramblers Annoni et al. (2017), photonic neural networks Shen et al. (2017); Hughes et al. (2018); Williamson et al. (2019), and finding optimal channels through lossy scatterers Miller (2013a). While universal photonic devices have been experimentally realized at a relatively small scale Shen et al. (2017); Annoni et al. (2017), commercial applications such as hardware for energyefficient machine learning and signal processing can benefit from scaling the devices to up to modes. At this scale, fabrication imperfections and components with scaledependent sensitivities can negatively affect performance.
One canonical universal photonic device is the rectangular multiport interferometer mesh Clements et al. (2016) shown in Figure 1 interfering modes. In multiport interferometers, an dimensional vector is represented by an array of modes arranged in singlemode waveguides. A unitary operation is applied to the input vector by tuning MachZehnder interferometers (MZIs) represented by the red dots of Figure 1. Each MZI is a twoport optical component made of two 50:50 beamsplitters and two tunable singlemode phase shifters. Other mesh architectures have been proposed, such as the triangular mesh Reck et al. (1994) (shown in Appendix C), the universal cascaded binary tree architecture Miller (2013b), and lattice architectures where light does not move in a forwardonly direction Perez et al. (2017); Pérez et al. (2018, 2017).
The scalability of optimizing mesh architectures, especially using gradientbased methods, is limited by the ability of the locally interacting architecture to control the output powers in the mesh. If phase shifts in the mesh are initialized uniformrandomly, light propagates through the device in a manner similar to a random walk. The offdiagonal, nonlocal elements of the implemented unitary matrix tend to be close to zero because transitions between inputs and outputs that are far apart have fewer paths (e.g., input and output in Figure 1 have a single path). The resulting mesh therefore implements a unitary matrix with a banded structure that is increasingly pronounced as the matrix size increases.
In many applications such as machine learning Shen et al. (2017) and quantum computing Russell et al. (2017); Carolan et al. (2015), we avoid this banded unitary matrix behavior in favor of random unitary matrices. A random unitary matrix is achieved when the device phase shifts follow a distribution derived from random matrix theory Hurwitz (1897); Zyczkowski and Kus (1994); Diaconis and Forrester (2017); Spengler et al. (2012); Russell et al. (2017). In the random matrix theory model, we assign a sensitivity index to each component that increases towards the center of the mesh, as shown in Figure 1. The more sensitive components toward the center of the mesh require higher transmissivities and tighter optimization tolerances. If the required tolerances are not met, the implemented unitary matrix begins to show the undesired banded behavior.
In Section II, we introduce the photonic mesh architecture and sources of error that can exacerbate the banded unitary matrix problem. In Section III, we explicitly model the component settings to implement a random unitary matrix and ultimately avoid the banded unitary matrix problem. We propose a “Haar initialization” procedure that allows light to propagate uniformly to all outputs from any input. We use this procedure to initialize the gradientbased optimization of a photonic mesh to learn unknown random unitary matrices given training data. We show that this optimization converges even in the presence of significant simulated fabrication errors.
In Sections IV and V, we propose and simulate two alterations to the mesh architecture that further improve gradientbased optimization performance. First, we add redundant MZIs in the mesh to reduce convergence error by up to five orders of magnitude. Second, we permute the mesh interactions while maintaining the same number of tunable components, which increases allowable tolerances of phase shifters, decreases offdiagonal errors, and improves convergence time.
Ii Photonic Mesh
We define the photonic mesh when operated perfectly and then discuss how beam splitter or phase shift errors can affect device performance.
ii.1 Photonic unitary implementation
A singlemode phase shifter can perform an arbitrary transformation on its input. A phasemodulated MachZehnder interferometer (MZI) with perfect () beamsplitters can apply to its inputs a unitary transformation of the form:
(1)  
where is the beamsplitter operator, are upper phase shift operators. Equation 1 is represented diagrammatically by the configuration in Figure 1.^{1}^{1}1Other configurations with two independent phase shifters between the beamsplitters are ultimately equivalent for photonic meshes Miller (2015). If one or two singlemode phase shifters are added at the inputs, we can apply an arbitrary or transformation to the inputs, respectively.
We define the transmissivity and reflectivity of the MZI as:
(2)  
In this convention, when , we have (the MZI “bar state”), and when , we have (the MZI “cross state”).
If there are input modes and the interferometer is connected to waveguides and then we can embed the unitary from Equation 1 in dimensional space with a locallyinteracting unitary “Givens rotation” defined as:
(3) 
All diagonal elements are 1 except those labeled and , which have magnitudes of , and all offdiagonal elements are 0 except those labeled and , which have magnitudes of .
Arbitrary unitary transformations can be implemented on a photonic chip using only locally interacting MZIs Reck et al. (1994). In this paper, we focus on optimizing a rectangular mesh Clements et al. (2016) of MZIs; however, our ideas can be extended to other universal schemes, such as the triangular mesh Miller (2013c), as well.
In the rectangular mesh scheme Clements et al. (2016) of Figure 1, we represent in terms of locally interacting Givens rotations and singlemode phase shifts at the inputs represented by diagonal unitary :
(4) 
where our layerwise product leftmultiplies from to 1,^{2}^{2}2In general, for matrix products for a sequence , we define the multiplication order . the singlemode phase shifts are , and where the Givens rotations are parameterized by .^{3}^{3}3Since are periodic phase parameters, they are in halfopen intervals . In contrast, any must be in a closed interval to achieve all transmissivities . We define the top indices of each interacting mode for each vertical layer as the set . This vertical layer definition follows the convention of Refs. Jing et al. (2017); Hughes et al. (2018) and is depicted in Figure 1, where represents the index of the vertical layer.
ii.2 Beamsplitter error tolerances
The expressions in Equations 1 and 4 assume perfect fabrication. In practice, however, we would like to simulate how practical devices with errors in each transfer matrix in Equation 1 impact optimization performance.
In fabricated chip technologies, imperfect beamsplitters can have a split ratio error that change the behavior of the red 50:50 coupling regions in Figure 1 or in Equation 1. The resultant scattering matrix with imperfect beamsplitters can be written as:
(5)  
As shown in Appendix B, if we assume both beamsplitters have identical , we find is the realistic transmissivity, is the realistic reflectivity, and are the ideal transmissivity and reflectivity defined in Equation 2.
The unitary matrices in Equation 5 cannot express the full transmissivity range of the MZI, with errors of up to in the transmissivity, potentially limiting the performance of greedy progressive photonic algorithms Burgwal et al. (2017); Miller (2017); Flamini et al. (2017). Our Haar phase theory, which we develop in the following section, determines acceptable interferometer tolerances for calibration of a “perfect mesh” consisting of imperfect beamsplitters Miller (2015) given large . We will additionally show that simulated photonic backpropagation Hughes et al. (2018) with adaptive learning can adjust to nearly match the performance of perfect meshes with errors as high as for meshes of size .
ii.3 Phase shift tolerances
Another source of uncertainty in photonic meshes is the phase shift tolerances of the mesh which affect the matrices of Equation 1, shown in orange in Figure 1. Error sources such as thermal crosstalk or environmental drift may result in slight deviance of phase shifts in the mesh from intended operation. Such errors primarily affect the control parameters that control light propagation in the mesh by affecting the MZI split ratios. This nontrivial problem warrants a discussion of mean behavior and sensitivities (i.e., the distribution) of needed to optimize a random unitary matrix.
Iii Haar Initialization
iii.1 Cross state bias and sensitivity index
The convergence of global optimization depends critically on the sensitivity of each phase shift. The gradient descent optimization we study in this paper converges when the phase shifts are correct to within some acceptable range. This acceptable range can be rigorously defined in terms of average value and variance of phase shifts in the mesh that together define an unbiased (“Haar random”) unitary matrix.^{4}^{4}4A Haar random unitary is defined as GramSchmidt orthogonalization of standard normal complex vectors Spengler et al. (2012); Russell et al. (2017). To implement a Haar random unitary, some MZIs in the mesh need to be biased towards cross state ( near , near ) Burgwal et al. (2017); Russell et al. (2017). This cross state bias correspondingly “pinches” the acceptable range for transmissivity and phase shift near the limiting cross state configuration, resulting in higher sensitivity, as can be seen in Figure 3(b).
For an implemented Haar random unitary matrix, lowtolerance, transmissive MZIs are located towards the center of a rectangular mesh Russell et al. (2017); Burgwal et al. (2017) and the apex of a triangular mesh as proven in Appendix C. For both the triangular and rectangular meshes, the cross state bias and corresponding sensitivity for each MZI depends only on the total number of reachable waveguides ports, as proven in Appendix I. Based on this proof, we define the sensitivity index ,^{5}^{5}5Note that , and there are always MZIs that have a sensitivity index of . where and are the subsets of input and output waveguides reachable by light exiting or entering the MZI, respectively, and denotes set size. Figure 1 and Figure 2(a) show the sensitivity index for the rectangular mesh, which clearly increases towards the center MZI.
iii.2 Phase shift distributions and Haar phase
The external phase shifts do not affect the the transmissivity and therefore obey uniform random distributions Russell et al. (2017). In contrast, the phase shifts have a probability density function (PDF) that depends on Russell et al. (2017):
(6) 
The general shape of this distribution is presented in Figure 3(b), showing how an increase in biases towards the cross state with higher sensitivity.
We define the Haar phase as the cumulative distribution function (CDF) of starting from :
(7) 
iii.3 Haar initialization
In the physical setting, it is useful to find the inverse of Equation 8 to directly set the measurable transmissivity of each MZI using a uniformly varying Haar phase , a process we call “Haar initialization” shown in Figure 2(c, d):
(9)  
where the expression for is just a rearrangement of Equation 2.
Haar initialization can be achieved progressively using a procedure similar to that in Ref. Miller (2017). If the phase shifters in the mesh are all wellcharacterized, the transmissivities can be directly set Russell et al. (2017). We will show in Section V that Haar initialization improves the convergence speed of gradient descent optimization significantly.
We can also use Equation 9 to find the average transmissivity and reflectivity for an MZI parameterized by as is found through simulation in Ref. Burgwal et al. (2017):
(10)  
The average reflectivity shown in Figure 2(b) gives a simple interpretation for the sensitivity index shown in Figure 2(a). The average reflectivity is equal to the inverse of the total number of inputs and outputs reachable by the MZI minus the number of ports on either side of the device, . This is true regardless of whether is assigned for a triangular or rectangular mesh.
To see what the Haar initialization has accomplished, we can compare the field propagation through the rectangular mesh from a single input when Haar initialized versus uniform initialized in Figure 2(e). Physically, this corresponds to light in the mesh spreading out quickly from the input of the mesh and “interacting” more near the boundaries of the mesh (inputs, outputs, top, and bottom), as compared to the center of the mesh which has high transmissivity. In contrast, when phases are randomly set, the light effectively follows a random walk through the mesh, resulting in the field propagation pattern shown in Figure 2(f).
iii.4 Tolerance dependence on
While Haar initialization is based on how the average component reflectivity scales with , optimization convergence and device robustness ultimately depend on how phase shift tolerances scale with . The average sensitivity index in the mesh is . As shown in Figure 3(b, c), the standard deviation over the PDF decreases as increases. Therefore, a phase shifter’s allowable tolerance, which roughly correlates with , decreases as the total number of input and output ports affected by that component increases. Since increases linearly with , the required tolerance gets more restrictive at large , as shown in Figure 3(c). We find that the standard deviation is on the order radians for most values of in the specified range. Thus, if thermal crosstalk is ignored Shen et al. (2017), it is possible to implement a known random unitary matrix in a photonic mesh assuming perfect operation. However, we concern ourselves with onchip optimization given just input/output data, in which case the unitary matrix is unknown. In such a case, the decreasing tolerances do pose a challenge in converging to a global optimum as increases. We demonstrate this problem for in Section V.
To account for the scalability problem in global optimization, one strategy may be to design a component in such a way that the mesh MZIs can be controlled by Haar phase voltages as in Figure 3(d) and Equation 9. The transmissivity dependence on a periodic Haar phase (shown in Figure 3(d) and discussed in Appendix G), is markedly different from the usual sinusoidal dependence on periodic . The MZIs near the boundary vary in transmissivity over a larger voltage region than the MZIs near the center, where only small voltages are needed get to full transmissivity. This results in an effectively small control tolerance near small voltages. This motivates the modifications to the mesh architecture which we discuss in the next section.
Iv Architecture Modifications
We propose two architecture modifications that can relax the transmissivity tolerances in the mesh discussed in Section III and result in significant improvement in optimization.
iv.1 Redundant rectangular mesh (RRM)
By adding extra tunable MZIs, it is possible to greatly accelerate the optimization of a rectangular mesh to an unknown unitary matrix. The addition of redundant tunable layers to a redundant rectangular mesh is depicted in green in Figure 4(a). The authors in Ref. Burgwal et al. (2017) point out that using such “underdetermined meshes” (number of inputs less than the number of tunable layers in the mesh) can overcome photonic errors and restore fidelity in unitary construction algorithms. Adding layers to the mesh increases the overall optical depth of the device, but embedding smaller meshes with extra beamsplitter layers in a rectangular mesh of an acceptable optical depth does not pose intrinsic waveguide lossrelated problems.
iv.2 Permuting rectangular mesh (PRM)
Another method to accelerate the optimization of a rectangular mesh is to shuffle outputs at regular intervals within the rectangular mesh. This shuffling relaxes component tolerances and uniformity of the number of paths for each inputoutput transition. We use this intuition to formally define a permuting rectangular mesh. For simplicity,^{6}^{6}6If is not a power of 2, then one might consider the following approximate design: . Define , and let each have layers. assume for some positive integer . Define “rectangular permutation” operations that allow inputs to interact with waveguides at most away for . These rectangular permutation blocks can be implemented using a rectangular mesh composed of MZIs with fixed cross state phase shifts, as shown in Figure 4(b), or using lowloss waveguide crossings.
We now add permutation matrices into the middle of the rectangular mesh as follows
(11)  
where represents the nearest integer larger than .
There are two operations per block : an layer rectangular mesh which we abbreviate as , and the rectangular permutation mesh where block index . This is labelled in Figure 4(b).
V Simulations
Now that we have discussed the mesh modifications and Haar initialization, we simulate global optimization to show how our framework can improve convergence performance by up to five orders of magnitude, even in the presence of fabrication error.
v.1 Mesh initialization
We begin by discussing the importance of initializing the mesh to respect the cross state bias and sensitivity of each component for Haar random unitary matrices discussed in Section III. Uniform random phase initialization is problematic because it is agnostic of the sensitivity and average behavior of each component. We define this distribution of matrices as for a rectangular mesh for inputs and layers. As shown previously in Figure 2(f), any given input follows a random walklike propagation if phases are initialized uniformrandomly, so there will only be nonzero matrix elements within a “bandsize” about the diagonal. This bandsize decreases as circuit size increases as shown in Figure 5.
We compare the bandsizes of banded unitary matrices in simulations qualitatively as we do in Figure 5 or quantitatively as we do in Appendix D. We randomly generate , (permuting rectangular mesh with tunable layers), and (redundant rectangular mesh with extra tunable layers). Figure 5 shows a significant reduction in bandsize as grows larger for rectangular meshes. This phenomenon is not observed with permuting rectangular meshes which generally have the same bandsize as Haar random matrices (independent of ) as shown in in Figure 5 and Appendix D. This correlates with permuting rectangular meshes having faster optimization and less dependence on initialization.
Instead of initializing the mesh using uniform random phases, we use Haar initialization as in Equation 9 to avoid starting with a banded unitary configuration. This initialization, which we recommend for any photonic meshbased neural network application, dramatically improves convergence because it primes the optimization with the right average behavior for each component. We find in our simulations that as long as the initialization is calibrated towards higher transmissivity ( near ), larger mesh networks can also have reasonable convergence times similar to when the phases are Haarinitialized.
The proper initialization of permuting rectangular meshes is less clear because the tolerances and average behavior of each component have not yet been modeled. Our proposal is to initialize each tunable block as an independent mesh using the same definition for , except replacing with the number of layers in , . This is what we use as the Haar initialization equivalent in the permuting rectangular mesh case, although it is possible there may be better initialization strategies for the nonlocal mesh structure.
v.2 Optimization problem and synthetic data
After initializing the photonic mesh, we proceed to optimize the mean square error cost function for an unknown Haar random unitary :
(12) 
where the estimated unitary matrix function maps phase shift parameters to via Equations 4 or 11, and denotes the Frobenius norm. Since trigonometric functions parameterizing are nonconvex, we know that Equation 12 is a nonconvex problem. The nonconvexity of Equation 12 suggests learning a single unitary transformation in a deep neural network might have significant dependence on initialization.
To train the network, we generate random unitnorm complex input vectors of size and generate corresponding labels by multiplying them by the target matrix . We use a training batch size of . The synthetic training data of unitnorm complex vectors is therefore represented by . The minibatch training cost function is similar to the test cost function, . The test set is the identity matrix of size . The test cost function, in accordance with the training cost function definition, thus matches Equation 12.
v.3 Training algorithm
We simulate the global optimization of a unitary mesh using automatic differentiation in tensorflow, which can be physically realized using the in situ backpropagation procedure in Ref. Hughes et al. (2018). This optical backpropagation procedure physically measures using interferometric techniques, which can be extended to any of the architectures we discuss in this paper.
The onchip backpropagation approach is also likely faster for gradient computation than other training approaches such as the finite difference method mentioned in past onchip training proposals Shen et al. (2017). We find empirically that the Adam update rule (a popular firstorder adaptive update rule Kingma and Ba (2015)) outperforms standard stochastic gradient descent for the training of unitary networks. If gradient measurements for the phase shifts are stored during training, adaptive update rules can be applied using successive gradient measurements for each tunable component in the mesh. Such a procedure requires minimal computation (i.e., locally storing the previous gradient step) and can act as a physical test of the simulations we will now discuss. Furthermore, we avoid quasiNewton optimization methods such as LBFGS used in Ref. Burgwal et al. (2017) that cannot be implemented physically as straightforwardly as firstorder methods.
The models were trained using our open source simulation framework neurophox ^{7}^{7}7See https://github.com/solgaardlab/neurophox. using a more general version of the vertical layer definition proposed in Refs. Jing et al. (2017); Hughes et al. (2018). The models were programmed in tensorflow Abadi et al. (2016) and run on an NVIDIA GeForce GTX1080 GPU to improve optimization performance.
v.4 Results
We now compare training results for rectangular, redundant rectangular, and permuting rectangular meshes given . In our comparison of permuting rectangular meshes and rectangular meshes, we analyze performance when beamsplitter errors are distributed throughout the mesh as either or and when the are randomly or Haarinitialized (according to the PDF in Equation 6). We also analyze optimization perforamnces of redundant rectangular meshes where we vary the number of vertical MZI layers.
From our results, we report five key findings:

Optimization of rectangular meshes results in significant offdiagonal errors due to bias towards the banded matrix space of , as shown in Figure 6.

Permuting rectangular meshes converge faster than rectangular meshes despite having the same number of total parameters as shown in Figure 6.

Redundant rectangular meshes, due to increase in the number of parameters, have up to five orders of magnitude better convergence when the number of vertical layers are doubled compared to rectangular and permuting rectangular meshes, as shown in Figure 7.
The singular value decomposition (SVD) architecture discussed in Refs. Miller (2013c); Shen et al. (2017) consists of optical lossy components flanked on both sides by rectangular meshes and are capable of implementing any linear operation with reasonable device input power. Note that with some modifications (e.g. treating loss and gain elements like nonlinearities in the procedure of Ref. Hughes et al. (2018)), SVD architectures can also be trained physically using in situ backpropagation. We simulate the gradientbased optimization of SVD architectures using automatic differentiation in Appendix F.
Vi Discussion
vi.1 Haar initialization
For global optimization and robustness of universal photonic meshes, it is important to consider the required biases and sensitivities for each mesh component. Implementing any Haar random matrix requires that each component independently follows an average reflectivity within some tolerance. This requirement becomes more restrictive with the number of input and output ports accessible by each mesh component. For the rectangular mesh, this means the center mesh components are close to cross state and the most sensitive.
In a Haarinitialized mesh, as shown in Figure 2, the light injected into a single input port spreads out to all waveguides in the device uniformly regardless of . This is a preferable initialization for global optimization because Haar random matrices require this behavior. In contrast, when randomly initializing phases, the light only spreads out over a limited band of outputs. This band gets relatively small compared to the mesh gets larger as shown in Figure 9.
The average reflectivities given by Haar initialization may be useful for inverse design approaches Piggott et al. (2017) for compact tunable or passive multiport interferometers. The component tolerances may inform how robust phase shifters need to be given error sources such as thermal crosstalk Shen et al. (2017). The thermal crosstalk might make it difficult to achieve required tolerances for devices interfering up to modes that generally have phase shift tolerances just above radians.^{8}^{8}8Ref. Shen et al. (2017) propose a standard deviation of might be possible with further circuit characterization, which might be scalable based on Figure 3(c).
In our simulations in Section V, we assume that the control parameter for photonic meshes is linearly related to the phase shift. However, in many current phase shifter implementations, such as thermal phase shifters Shen et al. (2017), the phase is a nonlinear function of the control parameter (i.e., the voltage) and has minimum and maximum values, unlike the unbounded phase used in our optimization. In addition, like the Haar phase in our theory, the voltage acts as the CDF for transmissivities in the physical device, up to a normalization factor. Particular attention needs to be given to phase uncertainty as a function of voltage, since the Haar random distribution of internal MZI phases has small variance for large , as we show in Figure 3(c). As mentioned in Section III, the ideal transmissivityvoltage dependence with this consideration would be identical to the transmissivity vs Haar phase dependence in Figure 3(d).
vi.2 Applications of mesh optimization
Meshes can be tuned using either selfconfiguration Reck et al. (1994); Miller (2013c) or global optimizations (gradientbased Hughes et al. (2018) or derivativefree Tang et al. (2018)). The algorithmic optimizations proposed in Refs. Reck et al. (1994); Miller (2013c) assume that each component in the mesh can cover the entire split ratio range, which is not the case in presence of 50:50 beamsplitter errors. This ultimately leads to lower fidelity in the implemented unitary operation, which can be avoided using a doubleMZI architecture Miller (2015); Wilkes et al. (2016) or a vertical layerwise progressive algorithm Miller (2017). We explore a third alternative to overcome photonic errors; gradientbased global optimization is modelfree and, unlike algorithmic approaches, can efficiently tune photonic neural networks Hughes et al. (2018). This modelfree property makes gradientbased optimization robust to fabrication error; we show in Figure 6(a) that meshes with split ratio error variances of up to can be optimized nearly as well as a perfect mesh, particularly for permuting rectangular meshes.
In the regime of globally optimized meshes, we propose two strategies to modify the rectangular architecture: adding waveguide permutation layers and adding extra tunable vertical MZI layers. Both approaches relax the cross state requirements on the MZIs and accelerate the mesh optimization process. Nonlocal interference works by allowing inputs that are far away physically in the mesh to interact. These approaches are inspired by several recent proposals in machine learning and coherent photonics to design more error tolerant and efficient meshes, many of which use single layers of MZIs and nonlocal waveguide interactions Genz (1998); Mathieu and LeCun (2014); Flamini et al. (2017); Jing et al. (2017); such designs can also be considered to be in the same class of permuting architectures as our proposed permuting rectangular mesh. Adding extra tunable vertical layers, as proposed in Ref. Burgwal et al. (2017), simply adds more tunable paths for the light to achieve a desired output. As shown in Figure 6, we achieve up to five orders of magnitude improvement in convergence at the expense of doubling the mesh size and parameter space.
Like permuting rectangular meshes, multiplane light conversion successfully applies the nonlocal interference idea for efficient spatial mode multiplexing Labroille et al. (2014, 2016). In this protocol, alternating layers of transverse phase profiles and optical Fourier transforms (analogous to what our rectangular permutations accomplish) are applied to reshape input modes of light Labroille et al. (2014, 2016). A similar concept is used in unitary spatial mode manipulation, where stochastic optimization of deformable mirror settings allow for efficient mode conversion Morizur et al. (2010). Thus, the idea of efficient unitary learning via a Fourierinspired permuting approach has precedent in contexts outside of photonic MZI meshes.
An onchip optimization for multiplane light conversion has been accomplished experimentally in the past using simulated annealing Tang et al. (2018). The success of simulated annealing in experimentally training small unitary photonic devices Tang et al. (2018) (rather than gradient descent as is used in this work) suggests there are other algorithms aside from gradient descent that may effectively enable onchip training.
We propose that similar simulated annealing approaches might be made more efficient by sampling Haar phases from uniform distributions and flashing updates onto the device. Similar derivativefree optimizations may also be useful for quantum machine learning Schuld and Killoran (2019); Schuld et al. (2018); Killoran et al. (2018). Whether such approaches can compete with backpropagation for classical applications remains to be investigated. For experimental onchip tuning, simulated annealing has the attractive property of only requiring output detectors. For practical machine learning applications, however, there is currently more literature for backpropagationbased optimization. Furthermore, gradientbased approaches allow for continuous control of phase shifters during the optimization.
Our tensorflow simulations may be useful in the design of optical recurrent neural networks (RNNs) that use unitary operators parameterized by photonic meshes. Such “unitary RNNs” (URNNs) have already been simulated on conventional computers and show some promise in synthetic longterm memory tasks Jing et al. (2017); Dangovski et al. (2017). Unitary RNNs are physically implementable using a single mesh with optical nonlinearities and recurrent optoelectronic feedback, suggesting that the architecture discussed in this work is a scalable, energyefficient option for machine learning applications. It is possible that some tunable features such as the “bandedness” of unitaries implemented by rectangular MZI meshes can be useful (e.g. as an attention mechanism in sequence data) for certain deep learning tasks that use URNNs.
Vii Conclusion
The scalability of gradientbased optimization of Haar random unitary matrices on universal photonic meshes is limited by small reflectivities and MZI phase shifter sensitivities arising from the constraint of locally interacting components. As shown in Section III, the required average reflectivity and sensitivity for each MZI is inversely related to the total number of inputs and outputs affected by the MZI. If the tolerance requirements are not met by the physical components, optimization algorithms will have difficulty converging to a target unitary operator. As shown in Section V for the case of , convergence via in situ backpropagation is generally not achieved if phase shifters are initialized randomly. However, Haar initialization can sufficiently bias the optimization for convergence to a desired random unitary matrix, even in the presence of significant simulated beamsplitter fabrication errors.
In Section IV, we propose adding extra tunable beamsplitters or mesh nonlocalities to accelerate mesh optimization. Naive (uniform random) initialization on a standard photonic mesh has difficulty learning random unitary matrices via gradient descent. By introducing nonlocalities in the mesh, we can improve optimization performance without the need for extra parameters. A Haarinitialized redundant architecture can achieve five orders of magnitude less mean square error for a Haar random unitary matrix and decrease optimization time to such a matrix by at least two orders of magnitude, as shown in Figure 7. Our findings suggest that architecture choice and initialization of photonic mesh components may prove important for increasing the scalability and stability of reconfigurable universal photonic devices and their many classical and quantum applications Annoni et al. (2017); Shen et al. (2017); Spring et al. (2013); Killoran et al. (2018); Schuld et al. (2018); Schuld and Killoran (2019); Arrazola et al. (2019); Miller (2013c, b).
Acknowledgements
This work was funded by the Air Force Office of Scientific Research, specifically for the Center for EnergyâEfficient 3D Neuromorphic Nanocomputing (CEE3N), Grant No. FA955018110186. We would like to thank Tyler Hughes, Momchil Minkov, Nate Abebe, Dylan Black, and Ian Williamson for illuminating discussions.
References
 Gräfe et al. (2016) Markus Gräfe, RenÃ© Heilmann, Maxime Lebugle, Diego GuzmanSilva, Armando PerezLeija, and Alexander Szameit, “Integrated photonic quantum walks,” Journal of Optics (2016), 10.1088/20408978/18/10/103002.
 Carolan et al. (2015) Jacques Carolan, Christopher Harrold, Chris Sparrow, Enrique MartínLópez, Nicholas J. Russell, Joshua W. Silverstone, Peter J. Shadbolt, Nobuyuki Matsuda, Manabu Oguma, Mikitaka Itoh, Graham D. Marshall, Mark G. Thompson, Jonathan C.F. Matthews, Toshikazu Hashimoto, Jeremy L. O’Brien, and Anthony Laing, “Universal linear optics,” Science (2015), 10.1126/science.aab3642.
 Spring et al. (2013) Justin B. Spring, Benjamin J. Metcalf, Peter C. Humphreys, W. Steven Kolthammer, Xian Min Jin, Marco Barbieri, Animesh Datta, Nicholas ThomasPeter, Nathan K. Langford, Dmytro Kundys, James C. Gates, Brian J. Smith, Peter G.R. Smith, and Ian A. Walmsley, “Boson sampling on a photonic chip,” Science (2013), 10.1126/science.1231692.
 Harris et al. (2017) Nicholas C. Harris, Gregory R. Steinbrecher, Mihika Prabhu, Yoav Lahini, Jacob Mower, Darius Bunandar, Changchen Chen, Franco N.C. Wong, Tom BaehrJones, Michael Hochberg, Seth Lloyd, and Dirk Englund, “Quantum transport simulations in a programmable nanophotonic processor,” Nature Photonics 11, 447–452 (2017).
 Annoni et al. (2017) Andrea Annoni, Emanuele Guglielmi, Marco Carminati, Giorgio Ferrari, Marco Sampietro, David Ab Miller, Andrea Melloni, and Francesco Morichetti, “Unscrambling light  Automatically undoing strong mixing between modes,” Light: Science and Applications 6 (2017), 10.1038/lsa.2017.110.
 Shen et al. (2017) Yichen Shen, Nicholas C. Harris, Scott Skirlo, Mihika Prabhu, Tom BaehrJones, Michael Hochberg, Xin Sun, Shijie Zhao, Hugo Larochelle, Dirk Englund, and Marin Soljačić, “Deep learning with coherent nanophotonic circuits,” Nature Photonics 11, 441–446 (2017).
 Hughes et al. (2018) Tyler W. Hughes, Momchil Minkov, Yu Shi, and Shanhui Fan, “Training of photonic neural networks through in situ backpropagation and gradient measurement,” Optica 5, 864 (2018).
 Williamson et al. (2019) Ian A. D. Williamson, Tyler W. Hughes, Momchil Minkov, Ben Bartlett, Sunil Pai, and Shanhui Fan, “Reprogrammable ElectroOptic Nonlinear Activation Functions for Optical Neural Networks,” arXiv preprint (2019).
 Miller (2013a) David A. B. Miller, “Establishing Optimal Wave Communication Channels Automatically,” Journal of Lightwave Technology, Vol. 31, Issue 24, pp. 39873994 31, 3987–3994 (2013a).
 Clements et al. (2016) William R. Clements, Peter C. Humphreys, Benjamin J. Metcalf, W. Steven Kolthammer, and Ian A. Walmsley, “An Optimal Design for Universal Multiport Interferometers,” Optica , 1–8 (2016).
 Reck et al. (1994) Michael Reck, Anton Zeilinger, Herbert J. Bernstein, and Philip Bertani, “Experimental realization of any discrete unitary operator,” Physical Review Letters 73, 58–61 (1994).
 Miller (2013b) David A. B. Miller, “Selfaligning universal beam coupler,” Optics Express 21, 6360 (2013b).
 Perez et al. (2017) Daniel Perez, Ivana Gasulla, Jose Capmany, and Richard A. Soref, “Hexagonal waveguide mesh design for universal multiport interferometers,” in 2016 IEEE Photonics Conference, IPC 2016 (2017) pp. 285–286.
 Pérez et al. (2018) Daniel Pérez, Ivana Gasulla, Lee Crudgington, David J. Thomson, Ali Z. Khokhar, Ke Li, Wei Cao, Goran Z. Mashanovich, and JosÃ© Capmany, “Silicon RFPhotonics Processor Reconfigurable Core,” in European Conference on Optical Communication, ECOC (2018).
 Pérez et al. (2017) Daniel Pérez, Ivana Gasulla, Lee Crudgington, David J. Thomson, Ali Z. Khokhar, Ke Li, Wei Cao, Goran Z. Mashanovich, and JosÃ© Capmany, “Multipurpose silicon photonics signal processor core,” Nature Communications (2017), 10.1038/s41467017007141.
 Russell et al. (2017) Nicholas J. Russell, Levon Chakhmakhchyan, Jeremy L. O’Brien, and Anthony Laing, “Direct dialling of Haar random unitary matrices,” New Journal of Physics (2017), 10.1088/13672630/aa60ed.
 Hurwitz (1897) Adolf Hurwitz, “über die Erzeugung der Invarianten durch Integration,” Nachrichten von der Gesellschaft der Wissenschaften zu Göttingen, MathematischPhysikalische Klasse , 71–72 (1897).
 Zyczkowski and Kus (1994) K. Zyczkowski and M. Kus, “Random unitary matrices,” Journal of Physics A: General Physics (1994), 10.1088/03054470/27/12/028.
 Diaconis and Forrester (2017) Persi Diaconis and Peter J. Forrester, “Hurwitz and the origins of random matrix theory in mathematics,” Random Matrices: Theory and Applications 06, 1730001 (2017).
 Spengler et al. (2012) Christoph Spengler, Marcus Huber, and Beatrix C. Hiesmayr, “Composite parameterization and Haar measure for all unitary and special unitary groups,” Journal of Mathematical Physics (2012), 10.1063/1.3672064.
 Miller (2015) David A. B. Miller, “Perfect optics with imperfect components,” Optica 2, 747 (2015).
 Miller (2013c) David A. B. Miller, “Selfconfiguring universal linear optical component [Invited],” Photonics Research 1, 1 (2013c).
 Jing et al. (2017) Li Jing, Yichen Shen, Tena Dubcek, John Peurifoy, Scott Skirlo, Yann LeCun, Max Tegmark, and Marin Soljačić, “Tunable Efficient Unitary Neural Networks (EUNN) and their application to RNNs,” in Proceedings of Machine Learning Research (2017) pp. 1733–1741.
 Burgwal et al. (2017) Roel Burgwal, William R. Clements, Devin H. Smith, James C. Gates, W. Steven Kolthammer, Jelmer J. Renema, and Ian A. Walmsley, “Using an imperfect photonic network to implement random unitaries,” Optics Express 25, 28236 (2017).
 Miller (2017) David A. B. Miller, ‘‘Setting up meshes of interferometers â reversed local light interference method,” Optics Express 25, 29233 (2017).
 Flamini et al. (2017) Fulvio Flamini, NicolÃ² Spagnolo, Niko Viggianiello, Andrea Crespi, Roberto Osellame, and Fabio Sciarrino, “Benchmarking integrated linearoptical architectures for quantum information processing,” Scientific Reports 7, 15133 (2017).
 Kingma and Ba (2015) Diederik P Kingma and Jimmy Lei Ba, “Adam: A Method for Stochastic Optimization,” International Conference on Learning Representations (2015).
 Abadi et al. (2016) Martin Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, Manjunath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore, Derek G. Murray, Benoit Steiner, Paul Tucker, Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng, “TensorFlow: A System for LargeScale Machine Learning,” in Operating Systems Design and Implementation (Savannah, GA, 2016) pp. 265–283.
 Piggott et al. (2017) Alexander Y. Piggott, Jan Petykiewicz, Logan Su, and Jelena Vučković, “Fabricationconstrained nanophotonic inverse design,” Scientific Reports (2017), 10.1038/s41598017019392.
 Tang et al. (2018) Rui Tang, Takuo Tanemura, Samir Ghosh, Keijiro Suzuki, Ken Tanizawa, Kazuhiro Ikeda, Hitoshi Kawashima, and Yoshiaki Nakano, “Reconfigurable alloptical onchip MIMO threemode demultiplexing based on multiplane light conversion,” Optics Letters 43, 1798 (2018).
 Wilkes et al. (2016) C. M. Wilkes, X. Qiang, J. Wang, R. Santagati, S. Paesani, X. Zhou, D. A. B. Miller, G. D. Marshall, M. G. Thompson, and J. L. OâBrien, “60ââdB highextinction autoconfigured MachâZehnder interferometer,” Optics Letters 41, 5318 (2016).
 Genz (1998) Alan Genz, ‘‘Methods for Generating Random Orthogonal Matrices,” Monte Carlo and QuasiMonte Carlo Methods , 199–213 (1998).
 Mathieu and LeCun (2014) Michael Mathieu and Yann LeCun, “Fast Approximation of Rotations and Hessians matrices,” arXiv preprint (2014).
 Labroille et al. (2014) Guillaume Labroille, Bertrand Denolle, Pu Jian, Jean FranÃ§ois Morizur, Philippe Genevaux, and Nicolas Treps, “Efficient and mode selective spatial mode multiplexer based on multiplane light conversion,” in 2014 IEEE Photonics Conference, IPC 2014 (2014).
 Labroille et al. (2016) Guillaume Labroille, Pu Jian, Nicolas Barré, Bertrand Denolle, and JeanFranÃ§ois Morizur, “Mode Selective 10Mode Multiplexer based on MultiPlane Light Conversion,” Optical Fiber Communication Conference (2016), 10.1364/OFC.2016.Th3E.5.
 Morizur et al. (2010) JeanFranÃ§ois Morizur, Lachlan Nicholls, Pu Jian, Seiji Armstrong, Nicolas Treps, Boris Hage, Magnus Hsu, Warwick Bowen, Jiri Janousek, and HansA. Bachor, “Programmable unitary spatial mode manipulation,” Journal of the Optical Society of America A 27, 2524 (2010).
 Schuld and Killoran (2019) Maria Schuld and Nathan Killoran, “Quantum Machine Learning in Feature Hilbert Spaces,” Physical Review Letters 122, 040504 (2019).
 Schuld et al. (2018) Maria Schuld, Alex Bocharov, Krysta Svore, and Nathan Wiebe, “Circuitcentric quantum classifiers,” arXiv preprint (2018).
 Killoran et al. (2018) Nathan Killoran, Thomas R. Bromley, Juan Miguel Arrazola, Maria Schuld, NicolÃ¡s Quesada, and Seth Lloyd, “Continuousvariable quantum neural networks,” arXiv preprint (2018), 10.1016/j.pepi.2014.06.011.
 Dangovski et al. (2017) Rumen Dangovski, Li Jing, and Marin Soljacic, “Rotational Unit of Memory,” in International Conference on Learning Representations (2017).
 Arrazola et al. (2019) Juan Miguel Arrazola, Thomas R Bromley, Josh Izaac, Casey R Myers, Kamil Brádler, and Nathan Killoran, “Machine learning method for state preparation and gate synthesis on photonic quantum computers,” Quantum Science and Technology 4, 024004 (2019).
Appendix A Software
To reproduce the results of this paper, the reader can be directed to neurophox, an opensource Python package that implements the optimizations and simulations of this paper in numpy and tensorflow. The exact code used to generate the results is provided in the neurophoxnotebooks repository.
Appendix B Derivation of beamsplitter errors
Unitary matrices generated by lossless MZIs are prone to errors in beamsplitter fabrication. We introduce the error to our expression derived in Equation 1, which is twice the displacement in beamsplitter split ratio from . Beamsplitter gates with error are defined as where are transmissivity and reflectivity amplitudes that result in slight variations from a beamsplitter. We use this error definition since it is a measurable quantity in the chip; in fact, there are strategies to minimize directly Miller (2015). The unitary matrix that we implement in presence of beamsplitter errors becomes
(13)  
If , which is a reasonable practical assumption for nearby fabricated structures, then solving for in terms of :
(14)  
Similarly, we can solve for :
(15) 
As we have discussed in this paper (and as we later show in Figure 12), photonic errors (standard deviation of 0.1) can affect the optimized phase shifts for unitary matrices. The above constraints on and suggest that limited transmissivity is likely in the presence of fabrication errors, which can inhibit progressive setup of unitary meshes Miller (2015); Burgwal et al. (2017). However, we will later show through tensorflow simulation that in situ backpropagation updates can to some extent address this issue using a more sophisticated experimental protocol involving phase conjugation and interferometric measurements Hughes et al. (2018).
Appendix C Haar measure
In this section, we outline a proof for the Haar measure of a unitary matrix in terms of the physical parameters of a photonic mesh to supplement our discussion of Haar phase and the proof in Ref. Russell et al. (2017). The Haar measure for can be defined in two physical basis representations: the measurement basis represents measurements after each MZI and the transmissivity basis represents the transmissivity of each MZI.
To make our explanation simpler, we will adopt the orthogonalization protocol used by Ref. Reck et al. (1994). In this representation, we define the triangular mesh as
(16)  
where is a diagonal matrix representing a single mode phase shift at index .
The operators represent the diagonal layers of the triangular mesh and their role is to project inputs from Hilbert space dimension from to recursively until we reach a single mode phase shift in . Our proof moves the same direction as Reck’s orthogonalization procedure; starting from , we progressively find the for each in decreasing order. For each layer , there are complex hyperspherical coordinates ( “amplitude” coordinates and “phase” coordinates). The first column vector of can be recovered by shining light (using a unit power ) through the top port of the layer (given by ) and measuring the output fields in the triangular mesh generated by , as shown in Figure 8(b). As mentioned in Refs. Reck et al. (1994); Miller (2013c), progressive optimization moves in the opposite direction; the desired output fields are shined back into the device and the transmissivities and phases for each layer (moving from to ) can be progressively tuned until all the power lies in the top input port for that layer.
The measurement basis is an unbiased Haar measure (as shown in Ref. Russell et al. (2017) using Gaussian random vectors) and can be physically represented by the power measured at waveguides due to shining light through the top input port for that layer. Unlike the proof in Ref. Russell et al. (2017), we choose our constraint that the input power rather than , which introduces a normalization prefactor in our Haar measure by integration over all possible .^{9}^{9}9This prefactor is exactly . This allows us to ignore the power in the final output port because energy conservation ensures we have the constraint . Therefore, our simplified Cartesian basis for each is (ignoring the normalization prefactor):
(17) 
Now we represent the Cartesian power quantities explicitly in terms of the component transmissivities, which we have defined already to be . Using the same convention as hyperspherical coordinates, we get the following recursive relation for as shown diagrammatically by following the path of light from the top input port in Figure 8(b):
(18) 
Intuitively, Equation 18 implies that the power measured at port is given by light that is transmitted by the first components along the path of light and then reflected by the th component. In other words, follows a geometric distribution.
We can use Equation 18 to find the Jacobian relating the and the . We find that we have a lower triangular matrix with diagonal elements for
(19) 
We know is lower triangular since for all , from Equation 18.
Since the determinant of a lower triangular matrix is the same as the product of the diagonal, we can directly evaluate the unbiased measure (off by a normalization constant) as
(20)  
To get the total Haar measure, we multiply the volume elements for the orthogonal components . We get from this procedure that the sensitivity index for a triangular mesh in Equation 20 (independent of ), which can be seen using Figure 8. We can express this Haar measure in terms of , the probability distribution for the transmissivity, and , the probability distribution for the phase shift corresponding to that same transmissivity, assuming appropriate choice for the triangular mesh:
(21)  
We can now normalize Equation 20 using the normalization factor for to get and then substitute to get our desired expression for :
(22)  
Finally, we can recover the Haar phase parameter (i.e. the cumulative density function) in terms of either or :
(23) 
Appendix D Unitary bandsizes
We would like to quantify the bandedness of matrices implemented by the meshes with randomly initialized phases. We define the bandsize as the minimum number of matrix elements whose absolute value squared sums to . Note that our bandsize measurement is agnostic of the ordering of the inputs and outputs, and is therefore agnostic to any permutations that may be applied at the end of the decomposition. In photonics terms, if , let measure the fraction of output waveguides over which of the power is distributed when light is input into waveguide . The bandsize is averaged over all . Sampling from our matrix distributions, we observe the relationship between the bandsize (given ) and the dimension in Figure 9.
Appendix E Introducing photonic errors in a redundant mesh
When photonic errors are added to the redundant mesh, specifically the layer mesh, we observe a slight decrease in optimization performance in Figure 10, similar to what we observed for the rectangular and permuting rectangular meshes in Figure 7. This decrease in performance, however, is less concerning considering that we still achieve a mean square error of around , suggesting that RRM might be more robust to photonic errors even during onchip optimization.
Appendix F Photonic singular value decomposition simulations
We compare the simulated performance of such rectangular and permuting rectangular architectures in the singular value decomposition (SVD) configuration discussed in Refs. Miller (2013c); Shen et al. (2017). Such architectures would allow one to perform arbitary linear operations with a relatively small footprint, and may have some other useful dimensionalityreduction properties in machine learning contexts.
In SVD, we represent complex matrix as , where is a diagonal matrix implemented onchip with singlemode gain or attenuating elements and are unitary matrices implemented in a photonic mesh. While has free parameters, any global optimization for a photonic SVD implementation using rectangular meshes can have at most free parameters, with equality when . In the triangular architecture discussed in Ref. Miller (2013c), the total complexity of parameters can be exactly when setting a subset of the beamsplitters to bar state. In the case where the total number of singular values for is , we get tunable elements. Additionally, there is an “effective redundancy” in that some vectors in are more important than others due to the singular values.
In our simulations, we investigate an SVD architecture for for composed of the unitaries and . Note that such an architecture is redundant when , so we focus on the simple case of .
We define our train and test cost functions analogous to the unitary meansquared error cost functions as
(24)  
where is defined in Section V.
We randomly generate by expressing , where . The synthetic training batches of unitnorm complex vectors are represented by .
Assuming a procedure similar to Hughes et al. (2018) can be used in presence of gains and optimization, the permuting rectangular mesh converges slightly faster but is significantly more resilient to uniform random phase initialization compared to the rectangular mesh as shown in Figure 11. Both optimizations are minimally affected by beamsplitter error, unlike what is seen in the unitary optimization case.
Appendix G Periodic parameters
We comment on our reported values of in the checkerboard plots in Figures 3 (of the main text) and 12. Since our simulated optimization does not have the explicit constraint that , we report the “absolute ,” where we map all values of to some value in . This corresponds to the transformation (assuming is originally between and ):
(25) 
Note a similar treatment as Equation 25 can be used to represent the Haar phase in terms of a “periodic” Haar phase with period 2:
(26) 
Appendix H Training simulation comparisons
In Figure 12, we compare the performance for our unitary network experiment over our aforementioned conditions in Section V. For each plot, we also have an associated video, showing how the parameter distributions, estimates, and errors vary during the course of the optimization, available online.^{10}^{10}10See https://av.tib.eu/series/520/photonic+optimization.
There are several takeaways from these plots. First, the reflectivity of the MZIs near the center of the mesh are much smaller in the optimized rectangular meshes than in the permuting rectangular meshes. Second, the gradient descent algorithm has a hard time finding the regime of Haar random matrices after a uniform random phase initialization. The values of are much larger than they need to be even 100 iterations into the optimization. This is likely evidence of a “vanishing gradient” problem when the mesh is not Haarinitialized. Finally, an important observation for the meshes with beamsplitter error is that the distribution shifts slightly towards in the rectangular mesh. This is a consequence of the limits in reflectivity and transmissivity in each MZI due to beamsplitter fabrication error as discussed in Section II.
Our simulated permuting rectangular implementation uses the same layer definitions as defined in Equation 11 except the with the most layers are in the center of the mesh, and the with the fewest layers are near the inputs and outputs of the mesh. In Figure 4, and would be switched, and for , the order is . We find this configuration to be the best permuting rectangular mesh so far in our experiments, although the architecture in Equation 11 gives improvements over the rectangular mesh.
Appendix I An equivalent definition for
Let be the sensitivity index for an MZI (“node”) at (waveguide, layer) coordinates in a local decomposition for an unitary operator. We define the “row coordinate” or waveguide index from the MZI’s operator coupling waveguides and , and we define the “column coordinate” or layer index to be , where is the maximum number of operators applied to a reachable input (This is equivalent to the vertical layers definition in Figure 1.). The reachable inputs are the subset of input modes affecting the immediate inputs of the MZI at , and the reachable outputs are the subset of output modes affected by the immediate outputs of the MZI.
Following the definitions in Ref. Russell et al. (2017), in the triangular scheme, , and in the rectangular scheme, where is the number of nodes on the diagonal (measured along paths of constant ) containing a rotation parameterized by , and is a sequence of decreasing odd integers , followed by increasing even integers , as defined in Russell et al. (2017). We prove below that for both the triangular and rectangular meshes, .
Lemma 1.
In the triangular mesh, .
Proof.
In the triangular mesh (shown for in Figure 8) , so we wish to show that , or:
(27) 
Suppose Equation 27 holds for some arbitrary in the mesh, such that . First, induct on : if we take and , then and . Next, induct on : if we take and , then and . In both cases, Equation 27 holds.
Traversals by 2 along or from a starting node can reach all nodes with the same parity of and , so we need two base cases. Consider the apex node at , and one of its neighbors at , . The former has and the latter has and . In both cases, Equation 27 is satisfied, so the lemma holds by induction. ∎
Lemma 2.
In the rectangular mesh, .
Proof.
In the rectangular mesh, , as defined in Ref. Russell et al. (2017). Define orthogonal axes and on the lattice such that for a node at , traveling in the direction gives the neighboring node at and traveling in the direction gives the neighboring node at , as depicted in Figure 14. For even {odd} , let the node at have and the node at have . Then there is a onetoone mapping such that , as shown in Figure 14, and it suffices to prove the lemma by induction in this diagonal basis.
Since is defined to be the length of a diagonal along paths of constant , it depends only on , so we rewrite explicitly:
(28) 
Similarly, since is enumerated along a diagonal, it depends only on , and we convert from the sequence definition of Ref. Russell et al. (2017) to an explicit lattice form:
(29) 
In this diagonal basis, we want to show that
(30) 
There are two boundaries at which separate four quadrants that must be considered, depicted by gray lines in Figure 14. We will induct on and within each quadrant, then induct on or across each of the two boundaries.
Suppose that Equation 30 holds for some arbitrary in the mesh, such that . First, we induct on and within each quadrant; the results are tabulated in Table 1. In every case, , so Equation 30 remains satisfied.
Next, we induct across the boundaries, shown in Table 2. Again, in every case, , satisfying Equation 30.
Finally, note that the base case of the top left MZI at , holds, with . This completes the proof in the basis, and since there is a onetoone mapping between , holds by induction. ∎
Quadrant  Induction 
