Evaluation of ComplexValued Neural Networks on RealValued Classification Tasks
Abstract
Complexvalued neural networks are not a new concept, however, the use of realvalued models has often been favoured over complexvalued models due to difficulties in training and performance. When comparing realvalued versus complexvalued neural networks, existing literature often ignores the number of parameters, resulting in comparisons of neural networks with vastly different sizes. We find that when real and complex neural networks of similar capacity are compared, complex models perform equal to or slightly worse than realvalued models for a range of realvalued classification tasks. The use of complex numbers allows neural networks to handle noise on the complex plane. When classifying realvalued data with a complexvalued neural network, the imaginary parts of the weights follow their real parts. This behaviour is indicative for a task that does not require a complexvalued model. We further investigated this in a synthetic classification task. We can transfer many activation functions from the real to the complex domain using different strategies. The weight initialisation of complex neural networks, however, remains a significant problem.
./figures/
1 Introduction
In recent years complexvalued neural networks have been successfully applied to a variety of tasks, specifically in signal processing where the input data has a natural interpretation in the complex domain. Complexvalued neural networks are often compared to realvalued networks. We need to ensure that these architectures are comparable in their model size and capacity. This aspect of the comparison is rarely studied or only dealt with superficially. A metric for their capacity is the number of realvalued parameters. The introduction of complex numbers into a model increases the computational complexity and the number of realvalued parameters, but assumes a certain structure of weights and data input.
This paper explores the performance of complexvalued multilayer perceptron (MLP) with varying depth and width. We consider the number of parameters and choice of activation function in benchmark classification tasks of realvalued data. We present a complexvalued multilayer perceptron architecture and its training process. We consider various activation functions and the number of realvalued parameters in both the complex and real case. We propose two methods to construct comparable networks: 1) by setting a fixed number of realvalued neurons per layer and 2) by setting a fixed budget of realvalued parameters. As benchmark task we choose MNIST digit classification [18], CIFAR10 image classification [17], CIFAR100 image classification [17], Reuters newswire topic classification (Reuters21578, Distribution 1.0). We use classification of synthetic data for further investigation.
2 Related Literature
Complexvalued neural networks were first formally described by Clarke [8]. Several authors have since proposed complex versions of the backpropagation algorithm based on gradient descent [6, 10, 19]. Inspired by work on multivalued threshold logic [1] from the 1970s, a multivalued neuron and neural network was defined by Aizenberg et al. [4, 3] who also extends this idea to quaternions. In the 2000s, complex neural networks were successfully applied to a variety of tasks [22, 12, 21, 25]. These tasks mainly involved the processing and analysis of complexvalued data or data with an intuitive mapping to complex numbers. Particularly, images and signals in their wave form or Fourier transformation were used as input data to complexvalued neural networks [15].
Another natural application of complex numbers are convolutions [7] which are used in image and signal processing. While real convolutions are widely used in deep learning for image processing, it is possible to replace them with complex convolutions [26, 13, 23, 14].
The properties of complex numbers and matrices can be used to define constraints on deep learning models. Introduced by Arjovsky et al. [5], and further developed by Wisdom et al. [29], complexvalued recurrent networks, that constrain their weights to be unitary matrices, reduce the impact of vanishing or exploding gradients.
More recently, complexvalued neural networks have been used to learn filters as embeddings of images and audio signals [27, 24, 9]. In addition, tensor factorisation has been applied to complex embeddings to predict the edges between entities of knowledge bases [28].
Despite their successes, complex neural networks have been less popular than their realvalued counterparts. Potentially, because the training process and architecture design are less intuitive, which stems from stricter requirements for the differentiability of activation functions in the complex plane [31, 16, 20].
When comparing complexvalued neural networks with realvalued neural networks, many publications ignore the number of parameters altogether [3], compare only the number of parameters of the entire model [26], or do not distinguish between complex or realvalued parameters and units [30]. From the perspective of this paper such comparisons are equivalent to comparing models of different sizes. We systematically explore the performance of multilayered perceptrons on simple classification tasks in consideration of the activation function, width and depth.
3 ComplexValued Neural Networks
We define a complexvalued neuron analogous to its realvalued counterpart and consider its differences in structure and training. The complex neuron can be defined as:
(1) 
with an activation function applied to the input , complex weight and complex bias . Arranging neurons into a layer:
(2) 
with an input .
The activation function in the above definitions can be a function or . We will consider the choice of the nonlinear activation function in more detail in Section 6. In this work, we choose a simple realvalued loss function but complexvalued loss functions could be subject for future work. There is no total ordering on the field of complex numbers, since . A complexvalued loss function would require defining a partial ordering on complex numbers (similar to a linear matrix inequality).
The training process in the complex domain differs, because activation functions are often not entirely complexdifferentiable.
Definition 3.1.
Analogous to a real function, a complex function at a point of an open subset is complexdifferentiable if there exists a limit such that
(3) 
If the function is complexdifferentiable at all points of it is called holomorphic. While in the realvalued case the existence of a limit is sufficient for a function to be differentiable, the complex definition in Equation 3 implies a stronger property.
Definition 3.2.
A complex function with realdifferentiable functions and is complexdifferentiable if they satisfy the CauchyRiemann Equations:
(4) 
We represent a complex number with two real numbers . For to be holomorphic, the limit not only needs to exist for the two functions and , but the (partial) derivatives must also satisfy the CauchyRiemann Equations. That also means that a function can be nonholomorphic (i.e. not complexdifferentiable) in , but still be analytic in its parts . Hence, to satisfy the CauchyRiemann Equations, real differentiability of functions and is not a sufficient condition to satisfy the CauchyRiemann Equations (Definition 3.2).
In order to apply the chain rule for nonholomorphic functions, the property of many nonholomorphic functions to be differentiable with respect to their real and imaginary parts can be utilised. We consider the complex function to be a function of and its complex conjugate . Effectively, we choose a different basis for our partial derivatives.
(5) 
These derivatives are a consequence of Wirtinger calculus (or calculus). They allow the application of the chain rule to many nonholomorphic functions for multiple complex variables :
(6) 
Many nonholomorphic functions are also not entirely differentiable with respect to their real parts. The general practice of computing gradients only at specific points allows using a wide range of complex activation functions. The training process, however, can become numerically unstable. The unstable training process makes it necessary to devise special methods to avoid problematic regions of the function. The Wirtinger calculus, described above, provides an alternative method for computing the gradient that also improves the stability of the training process.
4 Interaction of Parameters
Any complex number can be represented by two real numbers: the real part and the imaginary part or equivalently as magnitude and phase (angle) . Consequently, any complexvalued function on one or more complexvariables can be expressed as a function on two real variables .
Despite the straight forward use and representation in neural networks, complex numbers define an interaction between the two parts. Consider the operations necessary in the regression outlined in Equation 2 to be composed of real and imaginary parts (or, equivalently, magnitude and phase). Each element of the weight matrix interacts with an element of an input :
(7) 
In an equivalent representation with Euler’s constant as polar form.
(8) 
Complex parameters increase the computational complexity of a neural network as more operations are required. Instead of a single realvalued multiplication, up to four real multiplications and two real additions are required. As can be seen in Equations 7 and 8 the computational complexity can be significantly reduced depending on the implementation and representation chosen.
Consequently, simply doubling the number of realvalued parameters per layer is not sufficient to achieve the same effect as in complexvalued neural networks. This is illustrated when a complex number is expressed in a equivalent matrix representation. Specifically, as matrix in the ring of :
(9) 
(10) 
This augmented representation facilitates computing the multiplication of an input with a complexvalued weight matrix as:
(11) 
This interaction consequently means that architecture design needs to be reconsidered in order to facilitate this structure. A deep learning architecture that performs well with realvalued parameters may not work for complexvalued parameters and vice versa. Models that do not facilitate the structure or tasks that do not require complexvalued representations will not improve in performance.
Our experiments show that realvalued data does not require this structure. The imaginary part of the input is zero, so Equations 7 and 11 simplify to:
(12) 
For the training this means that the real parts and dominate the overall classification of a realvalued data point. In later sections we discuss our experiment results and illustrate the training with a synthetic classification task.
5 Capacity
The number of (realvalued) parameters is a metric to quantify the capacity of a network in its ability to approximate structurally complex functions. With too many parameters the model tends to overfit the data while with too few parameters it tends to underfit.
A consequence of representing a complex number using real numbers is that the number of real parameters of each layer is doubled: . The number of realvalued parameters per layer should be equal (or at least as close as possible) between the realvalued and its complexvalued architecture. This ensures that models have the same capacity. Performance differences are caused by introducing complex numbers as parameters and not by a capacity difference.
Consider the number of parameters in a fullyconnected layer in the real case and in the complex case. Let be the input dimension and the number of neurons, then the number of parameters of a realvalued layer and of a complex layer is given by
(13) 
For a multilayer perceptron with hidden layers, and output dimension the number of realvalued parameters without bias is given by:
(14) 
At first glance designing comparable multilayer neural network architectures, i.e. with the same number of realvalued parameters in each layer, is trivial. However, halving the number of neurons in every layer will not achieve parameter comparability. The number of neurons define the output dimensions of a layer and the following layer’s input dimension. We addressed this problem by choosing MLP architectures with an even number of hidden layers and the number of neurons per layer to be alternating between and . We receive the same number of real parameters in each layer of a complexvalued MLP compared to a realvalued network. Let us consider the dimensions of outputs and weights with hidden layers. For the realvalued case:
(15) 
where is the number of (complex or real) neurons of the th layer. The equivalent using complexvalued neurons would be:
(16) 
Another approach to the design of comparable architectures is to work with a parameter budget. Given a fixed budget of real parameters we can define real or complex MLP with an even number of hidden layers such that the network’s parameters are within that budget. The hidden layers and the input layer have the same number of real or complex neurons . The number of neurons in the last layer is defined by the number of classes .
(17) 
(18) 
6 Activation Functions
In any neural network an important decision is the choice of nonlinearity. With the same number of parameters in each layer, we are able to study the effects that activation functions have on the overall performance. An important theorem to be considered for the choice of activation function is the Liouville Theorem. The theorem states that any bounded holomorphic function (that is differentiable on the entire complex plane) must be constant. Hence, we need to choose unbounded and/or nonholomorphic activation functions.
To investigate the performance of complex models assuming a function which is linearly separable in the complex parameters we chose the identity function. This allows us to identify tasks that may not be linearly separable in using neurons, but are linearly separable in using neurons. An example would be the approximation of the XOR function [2]. The hyperbolic tangent is a wellstudied function and defined for both complex and real numbers. The rectifier linear unit is also well understood and frequently used in a realvalued setting, but has not been considered in a complexvalued setting. It illustrates separate application on the two parts of a complex number. The magnitude and squared magnitude functions are chosen to map complex numbers to real numbers.

Identity (or no activation function):
(19) 
Hyperbolic tangent:
(20) 
Rectifier linear unit (ReLU):
(21) 
Intensity (or magnitude squared):
(22) 
Magnitude (or complex absolute):
(23)
Before applying the logistic function in the last layer we use another function to receive a realvalued loss. We chose the squared magnitude . The intensity or probability amplitude of two interfering waves gives us a geometrically and probabilistically interpretable output.
(24) 
For an output vector
(25) 
7 Experiments
To compare real and complexvalued multilayer perceptrons (Figure 1) we investigated them in various classification tasks. In all of the following experiments the task was to assign a single class to each realvalued data point using complexvalued multilayer perceptrons:

Experiment 1: We tested MLPs with hidden layers, fixed width of units in each layer in realvalued architectures and alternating 64 and 32 units in complexvalued architectures (see section 5). We applied no fixed parameter budget. We tested the models on MNIST digit classification, CIFAR10 Image classification, CIFAR100 image classification and Reuters topic classification. Reuters topic classification and MNIST digit classification use units per layer, CIFAR10 and CIFAR100 use units per layer.

Experiment 2: We tested MLPs with fixed budget of 500,000 realvalued parameters. The MLPs have variable width according to the depth and the parameters and are tested on MNIST digit classification, CIFAR10 Image classification, CIFAR100 image classification and Reuters topic classification. All tested activation functions are introduced in Section 6. We rounded the units in Equations 17 and 18 to the next integer.
We used the weight initialisation discussed by Trabelsi et al. [26] for all our experiments. To reduce the impact of the initialisation we trained each model 10 times. Each run trained the model over 100 epochs with an Adam optimisation. We used categorical or binary cross entropy as a loss function depending on the task. We used or as the activation function for the last fullyconnected layer.
8 Results
Tables 1, 2, 3, 4 show the results for MLPs with variable depth and fixed width and no parameter budget (Experiment 1). Tables 5, 6, 7, 8 show the results for MLPs with variable width according to depth and a fixed parameter budget of 500,000 realvalued parameters (experiment 2). In our experiments the achieved accuracy of complex and realvalued multilayer perceptrons are close to each other. Nevertheless, the real networks consistently outperform complexvalued networks and the complexvalued neural networks often fail to learn any structure from the data.



MNIST  

k = 0  50,816  0.9282  0.9509  
0.9761  0.9551  
0.9780  0.9710  
0.9789  0.9609  
0.9770  0.9746  
k = 2  59,008  0.9274  0.9482  
0.9795  0.8923  
0.9804  0.9742  
0.9713  0.6573  
0.9804  0.9755  
k = 4  67,200  0.9509  0.9468  
0.9802  0.2112  
0.9816  0.9768  
0.8600  0.2572  
0.9789  0.9738  
k = 8  83,584  0.9242  0.1771  
0.9796  0.1596  
0.9798  0.9760  
0.0980  0.0980  
0.9794  0.1032 



Reuters  

k = 0  642,944  0.8116  0.7939  
0.8117  0.7912  
0.8081  0.7934  
0.8050  0.7885  
0.8068  0.7992  
k = 2  651,136  0.8005  0.7836  
0.7978  0.7320  
0.7921  0.7854  
0.7725  0.6874  
0.7996  0.7823  
k = 4  659,328  0.7925  0.7787  
0.7814  0.4199  
0.7734  0.7671  
0.5895  0.0650  
0.7863  0.7694  
k = 8  675,712  0.7929  0.7796  
0.7542  0.1861  
0.7555  0.7676  
0.0053  0.0053  
0.7671  0.7524 



CIFAR10  

k = 0  394,496  0.4044  0.1063  
0.4885  0.1431  
0.4902  0.4408  
0.5206  0.1000  
0.5256  0.1720  
k = 2  427,264  0.4039  0.1000  
0.5049  0.1672  
0.5188  0.496  
0.1451  0.1361  
0.5294  0.1000  
k = 4  460,032  0.4049  0.1000  
0.4983  0.1549  
0.8445  0.6810  
0.1000  0.1000  
0.5273  0.1000  
k = 8  525,568  0.4005  0.1027  
0.4943  0.1365  
0.5072  0.4939  
0.1000  0.1000  
0.5276  0.1000 



CIFAR100  

k = 0  406,016  0.1758  0.0182  
0.2174  0.0142  
0.1973  0.1793  
0.2314  0.0158  
0.2423  0.0235  
k = 2  438,784  0.1720  0.0100  
0.2314  0.0146  
0.2400  0.2123  
0.0143  0.0123  
0.2411  0.0100  
k = 4  471,552  0.1685  0.0100  
0.2178  0.0157  
0.2283  0.2059  
0.0109  0.0100  
0.2313  0.0100  
k = 8  537,088  0.1677  0.0100  
0.2000  0.0130  
0.2111  0.1956  
0.0100  0.0100  
0.2223  0.0100 

Units 

CIFAR10  

k = 0  630  315  0.9269  0.9464  
0.9843  0.9467  
0.9846  0.9828  
0.9843  0.9654  
0.9857  0.9780  
k = 2  339  207  0.9261  0.9427  
0.9852  0.6608  
0.9878  0.9835  
0.9738  0.8331  
0.9852  0.9748  
k = 4  268  170  0.9254  0.2943  
0.9838  0.2002  
0.9862  0.9825  
0.8895  0.2875  
0.9846  0.9870  
k = 8  205  134  0.9250  0.1136  
0.9810  0.1682  
0.9851  0.9824  
0.0980  0.0980  
0.9803  0.1135 

Units 

Reuters  

k = 0  50  25  0.8072  0.7970  
0.8112  0.7832  
0.8054  0.7925  
0.8037  0.7929  
0.8059  0.7912  
k = 2  49  25  0.7992  0.7809  
0.7952  0.7289  
0.7898  0.7751  
0.7778  0.6887  
0.7716  0.7911  
k = 4  49  25  0.7636  0.7854  
0.7796  0.4550  
0.7658  0.7676  
0.5823  0.0289  
0.7809  0.7573  
k = 8  48  24  0.7760  0.7663  
0.7449  0.1799  
0.7182  0.7484  
0.0053  0.0053  
0.7449  0.7302 

Units 

MNIST  

k = 0  162  81  0.4335  0.1006  
0.5032  0.1676  
0.5007  0.4554  
0.5179  0.1006  
0.5263  0.2381  
k = 2  148  77  0.4069  0.1000  
0.5205  0.1673  
0.5269  0.4963  
0.1395  0.1273  
0.5315  0.1000  
k = 4  138  74  0.4052  0.1000  
0.5218  0.1475  
0.5203  0.4975  
0.1065  0.1010  
0.5234  0.1000  
k = 8  123  69  0.4050  0.1003  
0.5162  0.1396  
0.5088  0.4926  
0.1000  0.1000  
0.5194  0.1000 

Units 

CIFAR100  

k = 0  158  79  0.2807  0.0314  
0.2308  0.0193  
0.2153  0.1935  
0.2364  0.0124  
0.2439  0.0279  
k = 2  144  75  0.1723  0.0100  
0.2440  0.0203  
0.2481  0.2224  
0.0155  0.0151  
0.2453  0.0100  
k = 4  135  72  0.1727  0.0100  
0.2397  0.0150  
0.2381  0.2147  
0.0122  0.0100  
0.2390  0.0100  
k = 8  121  67  0.1706  0.0100  
0.2209  0.0164  
0.2167  0.2027  
0.0100  0.0100  
0.2191  0.0100 
Complexvalued MLPs can be used to classify short dependencies (e.g MNIST digit classification) or a short text as a bagofwords (e.g. Reuters topic classification). For the two image classification tasks CIFAR10 and CIFAR100 the results indicate that a complexvalued MLP does not learn any structure in the data. These two tasks require larger weight matrices in the first layer and weight initialisation is still a significant problem.
The best nonlinearity in complex neural network is the rectifier linear unit applied to the imaginary and real parts, similarly to the realvalued models. and hyperbolic tangents outperform  particularly in the realvalued case. However, the results using the rectifier linear unit are much more stable. Despite the similarity of the activation functions and , their performance in all tasks differ significantly. The magnitude consistently outperforms the squared magnitude . In these classification benchmarks the activation function is the deciding factor for the overall performance of a given model. The activation may allow the network to recover from a bad initialisation and use the available parameters appropriately. An example would be the activation in CIFAR task of Experiments 1 and 2 (Tables 3, 4, 7, 8)
As expected, we observe that with a fixed number of neurons per layer (Experiment 1) and increasing depth, the complex and realvalued accuracy increases. As we are increasing the total number of parameters, the model capacity increases. An exception here is Reuters topic classification where the performance decreases with increasing depth. When choosing the number of neurons per layer according to a given parameter budget (Experiment 2 using Equations 17, 18), the performance decreases significantly as model depth increases. In consideration with the results from Experiment 1, the width of each layer is more important than the overall depth of the complete network.
We observed that the performance variance between the 10 initialisations is very high. We hypothesized that weight initialisation in complex MLPs becomes much more difficult with increasing depth. Hence, their performance is highly unstable. We confirmed this by training a complex MLP (, ) with 100 runs (instead of 10 runs) on the Reuters classification task. The result shows a similar behaviour to the other results: The performance gap decreases if initialised more often. We found a test accuracy of 0.7748 in complexvalued case in comparison to 0.7978 in the realvalued case (Table 2).
9 Discussion
For many applications that involve data that has an interpretation on the complex plane (e.g. signals) complexvalued neural networks have already shown that they are superior [15]. All selected tasks in our work use realvalued input data. We observe that complexvalued neural networks do not perform as well as expected for the selected tasks and realvalued architectures outperform their complex version. This finding seems counterintuitive at first, since every real value is just a special case of a complex numbers with a zero imaginary part. Solving a realvalued problem with a complexvalued model allows the model greater degree of freedom to approximate the function. The question why complexvalued models are inferior to real models for the classification of realvalued data arises.
In further examination of the training process we observed that the imaginary parts of the complex weights always follow the real parts of the weights. We show this behaviour representatively with two tasks in Figure 2 and Figure 3.
We further investigate this behaviour with synthetic classification tasks in consideration of the information flow within a MLP. We create two synthetic classification task. Random complex data points are to be classified using a complexvalued MLP according to the quadrant of its sum or if it is close to the origin (Figure 4). Real data points follow. This is equivalent to a projection of the complex data points to (Figure 5). We classify complex resp. real input with Gaussian noise and dimensions using a complexvalued MLP with hidden layers and with each units per layer. Again, we observe that the weight initialisation is a significant problem. However, the complex model can reliably approximate the underlying complex or real functions achieving training accuracy of , test accuracy of . We observe that over the training process using complex input develops differently than for real input. Using complex input the real and imaginary parts develop independently and then reach convergence (Figure 6). Training with realvalued synthetic data the magnitudes of imaginary parts follow the real parts very closely (Figure 7). Independently of the initialisation, the real parts converge a few epochs before the imaginary part (between 3 and 5 epochs). We also tested nonlinear and linear complexvalued regression approximating a real and a complex function. The imaginary part of the weights in a regression problem converges to zero or does not change if initialised with zero.
To explain the behaviour of the imaginary weights consider the computations of a complexvalued MLP combining two real weight matrices for a complex MLP in Figure 8.
We see that the real and imaginary parts of the weights act identically on the input in order to reach a classification. The classification is thus the average of two identical classifications. If in the training phase the averaged absolute values of weight’s imaginary part follow those of the real parts, the imaginary part of the input is either distributed exactly the same way as its real part, or the considered task simply does not benefit from using complexvalued hypothesis.
Moreover, we observed that complexvalued neural networks are much more sensitive towards their initialisation than realvalued neural networks. This sensitivity increases with the size of the network. The weight initialisation suggested by Trabelsi et al. [26] can reduce this problem, but does not solve it. This initialisation method is a complex generalisation of the variance scaled initialisation by Glorot et. al. [11]. Other possible initialisations include the use of the random search algorithm (RSA) [31]. This requires significantly more computation. We eventually tried to mitigate the problem by running each experiment multiple times with different random initialisation. The initialisation of complex weights, however, is still a significant and unsolved problem and requires further investigation.
The unboundedness of the activation functions can cause numerical instability of the learning process. It can lead to a failed learning process (e.g. the gradient is practically infinite). If the learning process hits such a point in the function (e.g. a singularity), it is difficult to recover the training. It is avoidable by constraining a function, normalising weights or gradients. With increasing depth and structural complexity these options may be impractical due to their computational cost. Alternatively, this can also be prevented in the design stage by choosing a bounded and entirely complexdifferentiable activation function. Finding such a function is difficult. Another possibility is to avoid the problem in practice by applying separate bounded activation functions (the same or different real functions) help overcome this problem. The rectifier linear unit is one of these functions. While not entirely realdifferentiable we found the training process is much more stable and the performance improved. Despite the mathematical difficulties due the differentiability, we can practically transfer a lot of insights from the real to the complex domain.
In summary realvalued models pose an upper performance limit for realvalued tasks when compared to complexvalued models of similar capacity, because the real and imaginary parts act identically on the input. An investigation of the information and gradient flow, can help to identify tasks that benefit from complexvalued neural networks. In consideration of existing literature and our finding we recommend that complex neural networks should be used for classification tasks if the data is naturally in the complex domain, or can be meaningfully moved to the complex plane. The network should reflect the interaction of real and imaginary parts of weights with the input data. If the structure is ignored, the model may not be able to utilise the greater degree of freedom. It will most likely also require more initialisations and computational time due to the more complicated training process.
10 Conclusion
This work considers a comparison between complex and realvalued multilayer perceptrons in benchmark classification tasks. We found that complexvalued MLPs perform similar or worse for classification of realvalued data, even if the complexvalued model allows larger degrees of freedom. We recommend the use of complex numbers in neural networks if a) the input data has a natural mapping to complex numbers, b) the noise in the input data is distributed on the complex plane or c) complexvalued embeddings can be learned from realvalued data. We can identify tasks that would benefit from it by comparing the training behaviour (e.g. by the average absolute values) of real and imaginary weights. If the imaginary part does not follow the real parts general behaviour across epochs, the task benefits from assuming a complex hypothesis.
Other aspects to consider for the model design are activation functions, weight initialisation strategies and the tradeoff between performance, model size and computational costs. In our work, the best performing activation function is the componentwise application of the rectifier linear unit. We transfer many real and complex activation function by applying them separately on the two real parts, using Wirtinger Calculus or a gradientbased approach combined with strategies to avoid certain points. The initialisation described by Trabelsi et al. [26] can help to reduce the initialisation problem, but further investigation is required. Similar to many other architectures, the introduction of complex numbers as parameters is also a decision to tradeoff between the taskspecific performance, the size of the model (i.e. the number of realvalue parameters) and the computational cost.
Acknowledgements
Nils Mönning was supported by the EPSRC via a Doctoral Training Grant (DTG) Studentship. Suresh Manandhar was supported by EPSRC grant EP/I037512/1, A Unified Model of Compositional & Distributional Semantics: Theory and Application.
References
 Aizenberg [1977] Igor Aizenberg. ’multiplevalued threshold logic’ translated by claudio moraga. 1977.
 Aizenberg [2016] Igor Aizenberg. ComplexValued Neural Networks with MultiValued Neurons. Springer Publishing Company, Incorporated, 2016. ISBN 3662506319, 9783662506318.
 Aizenberg and Moraga [2007] Igor Aizenberg and Claudio Moraga. Multilayer feedforward neural network based on multivalued neurons (mlmvn) and a backpropagation learning algorithm. Soft Computing, 11(2):169–183, Jan 2007. ISSN 14337479.
 Aizenberg and Aizenberg [1992] Naum N. Aizenberg and Igor N. Aizenberg. Cnn based on multivalued neuron as a model of associative memory for grey scale images. In CNNA ’92 Proceedings Second International Workshop on Cellular Neural Networks and Their Applications, pages 36–41, Oct 1992. doi: 10.1109/CNNA.1992.274330.
 Arjovsky et al. [2015] Martín Arjovsky, Amar Shah, and Yoshua Bengio. Unitary evolution recurrent neural networks. CoRR, abs/1511.06464, 2015.
 Benvenuto and Piazza [1992] N. Benvenuto and F. Piazza. On the complex backpropagation algorithm. IEEE Transactions on Signal Processing, 40(4):967–969, Apr 1992. ISSN 1053587X. doi: 10.1109/78.127967.
 Bruna et al. [2015] Joan Bruna, Soumith Chintala, Yann LeCun, Serkan Piantino, Arthur Szlam, and Mark Tygert. A theoretical argument for complexvalued convolutional networks. CoRR, abs/1503.03438, 2015.
 Clarke [1990] Thomas L. Clarke. Generalization of neural networks to the complex plane. In 1990 IJCNN International Joint Conference on Neural Networks, pages 435–440 vol.2, June 1990.
 Drude et al. [2016] Lukas Drude, Bhiksha Raj, and Reinhold HäbUmbach. On the appropriateness of complexvalued neural networks for speech enhancement. In INTERSPEECH, 2016.
 Georgiou and Koutsougeras [1992] G. M. Georgiou and C. Koutsougeras. Complex domain backpropagation. IEEE Transactions on Circuits and Systems II: Analog and Digital Signal Processing, 39(5):330–334, May 1992. ISSN 10577130. doi: 10.1109/82.142037.
 Glorot and Bengio [2010] Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward neural networks. In Yee Whye Teh and Mike Titterington, editors, Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, volume 9 of Proceedings of Machine Learning Research, pages 249–256, Chia Laguna Resort, Sardinia, Italy, 13–15 May 2010. PMLR.
 Goh et al. [2006] S.L. Goh, M. Chen, D.H. PopoviÄ, K. Aihara, D. Obradovic, and D.P. Mandic. Complexvalued forecasting of wind profile. Renewable Energy, 31(11):1733–1750, 2006. doi: https://doi.org/10.1016/j.renene.2005.07.006.
 Guberman [2016] Nitzan Guberman. On complex valued convolutional neural networks. CoRR, abs/1602.09046, 2016.
 Haensch and Hellwich [2010] R. Haensch and O. Hellwich. Complexvalued convolutional neural networks for object detection in polsar data. In 8th European Conference on Synthetic Aperture Radar, pages 1–4, June 2010.
 Hirose [2009] A. Hirose. Complexvalued neural networks: The merits and their origins. In 2009 International Joint Conference on Neural Networks, pages 1237–1244, June 2009.
 Hirose [2004] Akira Hirose. ComplexValued Neural Networks: Theories and Applications (Series on Innovative Intelligence, 5). World Scientific Press, 2004. ISBN 9812384642.
 Krizhevsky [2009] Alex Krizhevsky. Learning multiple layers of features from tiny images. 2009.
 Lecun et al. [1998] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
 Nitta [1993] T. Nitta. A backpropagation algorithm for complex numbered neural networks. In Proceedings of 1993 International Conference on Neural Networks (IJCNN93Nagoya, Japan), volume 2, pages 1649–1652 vol.2, Oct 1993.
 Nitta [2014] Tohru Nitta. Learning dynamics of the complexvalued neural network in the neighborhood of singular points. Journal of Computer and Communications, 2(1):27–32, 2014. doi: 10.4236/jcc.2014.21005.
 Özbay [2008] Yüksel Özbay. A new approach to detection of ecg arrhythmias: Complex discrete wavelet transform based complex valued artificial neural network. Journal of Medical Systems, 33(6):435, Sep 2008. doi: 10.1007/s1091600892051.
 Park and Jeong [2002] DongChul Park and TaeKyun Jung Jeong. Complexbilinear recurrent neural network for equalization of a digital satellite channel. IEEE Transactions on Neural Networks, 13(3):711–725, May 2002. ISSN 10459227. doi: 10.1109/TNN.2002.1000135.
 Popa [2017] C. A. Popa. Complexvalued convolutional neural networks for realvalued image classification. In 2017 International Joint Conference on Neural Networks (IJCNN), pages 816–822, May 2017. doi: 10.1109/IJCNN.2017.7965936.
 Sarroff et al. [2015] Andy M. Sarroff, Victor Shepardson, and Michael A. Casey. Learning representations using complexvalued nets. CoRR, abs/1511.06351, 2015.
 Suksmono and Hirose [2002] A. B. Suksmono and A. Hirose. Adaptive noise reduction of insar images based on a complexvalued mrf model and its application t o phase unwrapping problem. IEEE Transactions on Geoscience and Remote Sensing, 40(3):699–709, March 2002. ISSN 01962892. doi: 10.1109/TGRS.2002.1000329.
 Trabelsi et al. [2017] Chiheb Trabelsi, Sandeep Subramanian, Negar Rostamzadeh, Soroush Mehri, Dmitriy Serdyuk, João Felipe Santos, Yoshua Bengio, and Christopher Pal. Deep complex networks. 2017.
 Trouillon et al. [2016] Théo Trouillon, Johannes Welbl, Sebastian Riedel, Éric Gaussier, and Guillaume Bouchard. Complex embeddings for simple link prediction. In International Conference on Machine Learning (ICML), volume 48, pages 2071–2080, 2016.
 Trouillon et al. [2017] Théo Trouillon, Christopher R. Dance, Johannes Welbl, Sebastian Riedel, Éric Gaussier, and Guillaume Bouchard. Knowledge graph completion via complex tensor factorization. CoRR, abs/1702.06879, 2017.
 Wisdom et al. [2016] Scott Wisdom, Thomas Powers, John Hershey, Jonathan Le Roux, and Les Atlas. Fullcapacity unitary recurrent neural networks. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems 29, pages 4880–4888. Curran Associates, Inc., 2016.
 Zhang et al. [2017] Z. Zhang, H. Wang, F. Xu, and Y. Q. Jin. Complexvalued convolutional neural network and its application in polarimetric sar image classification. IEEE Transactions on Geoscience and Remote Sensing, 55(12):7177–7188, Dec 2017. ISSN 01962892. doi: 10.1109/TGRS.2017.2743222.
 Zimmermann et al. [2011] HansGeorg Zimmermann, Alexey Minin, and Victoria Kusherbaeva. Comparison of the complex valued and real valued neural networks trained with gradient descent and random search algorithms. In ESANN, 2011.