Deep Quaternion Networks

# Deep Quaternion Networks

## Abstract

The field of deep learning has seen significant advancement in recent years. However, much of the existing work has been focused on real-valued numbers. Recent work has shown that a deep learning system using the complex numbers can be deeper for a set parameter budget compared to its real-valued counterpart. In this work, we explore the benefits of generalizing one step further into the hyper-complex numbers, quaternions specifically, and provide the architecture components needed to build deep quaternion networks. We go over quaternion convolutions, present a quaternion weight initialization scheme, and present algorithms for quaternion batch-normalization. These pieces are tested by end-to-end training on the CIFAR-10 and CIFAR-100 data sets to show the improved convergence to a real-valued network.

## 1Introduction

There have been many advances in deep neural network architectures in the past few years. One such improvement is a normalization technique called batch normalization [10] that standardizes the activations of layers inside a network using minibatch statistics. It has been shown to regularize the network as well as provide faster and more stable training. Another improvement comes from architectures that add so called shortcut paths to the network. These shortcut paths connect later layers to earlier layers typically, which allows for the stronger gradients to propagate to the earlier layers. This method can be seen in Highway Networks [21] and Residual Networks [9]. Other work has been done to find new activation functions with more desirable properties. One example is the exponential linear unit (ELU) [4], which attempts to keep activations standardized. All of the above methods are combating the vanishing gradient problem [7] that plagues deep architectures. With solutions to this problem appearing it is only natural to move to a system that will allow one to construct deeper architectures with as low a parameter cost as possible.

Other work in this area has explored the use of complex and hyper-complex numbers, which are a generalization of the complex, such as quaternions. Using complex numbers in recurrent neural networks (RNNs) has been shown to increase learning speed and provide a more noise robust memory retrieval mechanism [1]. The first formulation of complex batch normalization and complex weight initialization is presented by [22] where they achieve some state of the art results on the MusicNet data set. Hyper-complex numbers are less explored in neural networks, but have seen use in manual image and signal processing techniques [3]. Examples of using quaternion values in networks is mostly limited to architectures that take in quaternion inputs or predict quaternion outputs, but do not have quaternion weight values [19]. There are some more recent examples of building models that use quaternions represented as real-values. In [17] they used a quaternion multi-layer perceptron (QMLP) for document understanding and [13] uses a similar approach in processing multi-dimensional signals.

Building on [22] our contribution in this paper is to formulate quaternion convolution, batch normalization, and weight initialization. There arises some difficulty over complex batch normalization that we had to overcome as their is no analytic form for our inverse square root matrix.

## 2Motivation and Related Work

The ability of quaternions to effectively represent spatial transformations and analyze multi-dimensional signals makes them promising for applications in artificial intelligence.

One common use of quaternions is for representing rotation into a more compact form. PoseNet [11] used a quaternion as the target output in their model where the goal was to regress the DOF camera pose from a single RGB image. The ability to encode rotations may make a quaternion network more robust to rotational variance.

Quaternion representation has also been used in signal processing. The amount of information in the phase of an image has been shown to be sufficient to recover the majority of information encoded in its magnitude by Oppenheim and Lin [16]. The phase also encodes information such as shapes, edges, and orientations. Quaternions can be represented as a 2 x 2 matrix of complex numbers, which gives them a group of phases potentially holding more information compared to a single phase. Bulow and Sommer [2] used the higher complexity representation of quaternions by extending Gabor’s complex signal to a quaternion one which was then used for texture segmentation. Another use of quaternion filters is shown in [20] where they introduce a new class of filter based on convolution with hyper-complex masks, and present three color edge detecting filters. These filters rely on a three-space rotation about the grey line of RGB space and when applied to a color image produce an almost greyscale image with color edges where the original image had a sharp change of color. A quaternionic extension of feed forward neural network, for processing multi-dimensional signals, is shown in [13]. They expect that quaternion neurons operate on multi-dimensional signals as single entities, rather than real-valued neurons that deal with each element of signals independently. A convolutional neural network (CNN) should be able to learn a powerful set of quaternion filters for more impressive tasks.

Another large motivation is discussed in [22], which is that complex numbers are more efficient and provide more robust memory mechanisms compared to the reals [3]. They continue that residual networks have a similar architecture to associative memories since the residual shortcut paths compute their residual and then sum it into the memory provided by the identity connection. Again, given that quaternions can be represented as a complex group, they may provide an even more efficient and robust memory mechanisms.

## 3Quaternion Network Components

This section will include the work done to obtain a working deep quaternion network. Some of the longer derivations are shown in the appendix.

### 3.1Quaternion Representation

In 1833 Hamilton proposed complex numbers be defined as the set of ordered pairs of real numbers. He then began working to see if triplets could extend multiplication of complex numbers. In 1843 he discovered a way to multiply in four dimensions instead of three, but the multiplication lost commutativity. This construction is now known as quaternions. Quaternions are composed of four components, one real part, and three imaginary parts. Typically denoted as

where is the real part, denotes the three imaginary axis, and denotes the three imaginary components. Quaternions are governed by the following arithmetic:

which leads to the noncommutative multiplication rules

Since we will be performing quaternion arithmetic using reals it is useful to embed into a real-valued representation. There exists an injective homomorphism from to the matrix ring where is a 4x4 real matrix. The 4 x 4 matrix can be written as

This representation of quaternions is not unique, but we will stick to the above in this paper. It is also possible to represent as where is a 2 x 2 complex matrix.

With our real-valued representation a quaternion real-valued convolution layer can be expressed as follows. Say that the layer has feature maps such that is divisible by 4. We let the first feature maps represent the real components, the second represent the imaginary components, the third represent the imaginary components, and the last represent the imaginary components.

### 3.2Quaternion Differentiability

In order for the network to perform backpropagation the cost function and activation functions used must be differentiable with respect to the real, , , and components of each quaternion parameter of the network. As the complex chain rule is shown in [22], we provide the quaternion chain rule which can be seen in Section 7.1 of the Appendix.

### 3.3Quaternion Convolution

Convolution in the quaternion domain is done by convolving a quaternion filter matrix by a quaternion vector . Performing the convolution by using the distributive property and grouping terms one gets

Using a matrix to represent the components of the convolution we have:

An example is shown in Figure 1.

### 3.4Quaternion Batch-Normalization

Batch-normalization [10] is used by the vast majority of all deep networks to stabilize and speed up training. It works by keeping the activations of the network at zero mean and unit variance. The original formulation of batch-normalization only works for real-values. Applying batch normalization to complex or hyper-complex numbers is more difficult, one can not simply translate and scale them such that their mean is 0 and their variance is 1. This would not give equal variance in the multiple components of a complex or hyper-complex number. To overcome this for complex numbers a whitening approach is used [22], which scales the data by the square root of their variances along each of the two principle components. We use the same approach, but must whiten 4D vectors.

However, an issue arises in that there is no nice way to calculate the inverse square root of a 4 x 4 matrix. It turns out that the square root is not necessary and we can instead use the Cholesky decomposition on our covariance matrix. The details of why this works for whitening can be found in Section 7.2 of the Appendix. Now our whitening is accomplished by multiplying the 0-centered data () by W:

where W is one of the matrices from the Cholesky decomposition of where V is the covariance matrix given by:

where Cov is the covariance and , , , and are the real, , , and components of respectively.

Real-valued batch normalization also uses two learned parameters, and . Our shift parameter must shift a quaternion value so it is a quaternion value itself with real, , , and as learnable components. The scaling parameter is a symmetric matrix of size matching given by:

Because of its symmetry it has only ten learnable parameters. The variance of the components of input are variance 1 so the diagonal of is initialized to in order to obtain a modulus of 1 for the variance of the normalized value. The off diagonal terms of and all components of are initialized to 0. The quaternion batch normalization is defined as:

### 3.5Quaternion Weight Initialization

The proper initialization of weights is vital to convergence of deep networks. In this work we derive our quaternion weight initialization using the same procedure as Glorot and Bengio [6] and He et al. [8].

To begin we find the variance of a quaternion weight:

where is the magnitude, and are angle arguments, and [18].

Variance is defined as

but since is symmetric around 0 the term is 0. We do not have a way to calculate so we make use of the magnitude of quaternion normal values , which follows an independent normal distribution with four degrees of freedom (DOFs). We can then calculate the expected value of to find our variance

where is the four DOF distribution shown in the Appendix.

And since , we now have the variance of expressed in terms of a single parameter :

To follow the Glorot and Bengio [6] initialization we have , where and are the number of input and output units respectivly. Setting this equal to and solving for gives . To follow He et al. [8] initialization that is specialized for rectified linear units (ReLUs) [15], then we have , which again setting equal to and solving for gives .

As shown in the weight has components , , and . We can initialize the magnitude using our four DOF distribution defined with the appropriate based on which initialization scheme we are following. The angle components are initialized using the uniform distribution between and where we ensure the constraint on .

## 4Experimental Results

Our experiments only covered image classification using both the CIFAR-10 and CIFAR-100 benchmarks. We use the same architecture as the large model in [22], which is a 110 layer Residual model similar to the one in [9]. There is one difference between the real-valued network and the ones used for both the complex and hyper-complex valued networks. Because the datasets are all real-valued the network must learn the imaginary or quaternion components. We use the same technique as [22] where there is an additional block immediately after the input which will learn the hyper-complex components

Since all datasets that we work with are real-valued, we present a way to learn their imaginary components to let the rest of the network operate in the complex plane. We learn the initial imaginary component of our input by performing the operations present within a single real-valued residual block

This means that to maintain the same approximate parameter count the number of convolution kernels for the complex network was increased. We however did not increase the number of convolution kernels for the quaternion trials so any increase in model performance comes from the quaternion filters and at a lower hardware budget.

The architecture consists of 3 stages of repeating residual blocks where at the end of each stage the images are downsized by strided convolutions. Each stage also doubles the previous stage’s number of convolution kernels. The last layers are a global average pooling layer followed by a single fully connected layer with a softmax function is used to classify the input as either one of the 10 classes in CIFAR-10 or one of the 100 classes in CIFAR-100.

We also followed their training procedure of using the backpropagation algorithm with Stochastic Gradient Descent with Nesterov momentum [14] set at 0.9. The norm of the gradients are clipped to 1 and a custom learning rate scheduler is used. The learning scheduler is the same used in [22] for a direct comparison in performance. The learning rate is initially set to 0.01 for the first 10 epochs and then set to 0.1 from epoch 10-100 and then cut the learning rate by a factor of 10 at epochs 120 and 150. Table 1 presents our results along side the real and complex valued networks. Our quaternion model outperforms the real and complex networks on both datasets on a smaller parameter budget.

## 5Conclusions

We have extended upon work looking into complex valued networks by exploring quaternion values. We presented the building blocks required to build and train deep quaternion networks and used them to test residual architectures on two common image classification benchmarks. We show that they have competitive performance by beating both the real and complex valued networks with less parameters. Future work will be needed to test quaternion networks for tasks outside of image classification such as semantic segmentation or audio processing.

## 6Acknowledgments

We would like to thank James Dent of the University of Louisiana at Lafayette Physics Department for helpful discussions. We also thank Fugro for research time on this project.

## 7Appendix

### 7.1The Generalized Quaternion Chain Rule for a Real-Valued Function

Let be a real valued loss function and be a quaternion variable such that where then,

Now let be another quaternion variable where can be expressed in terms of and we then have,

### 7.2Whitening a Matrix

Let be an x matrix and is the symmetric covariance matrix of the same size. Whitening a matrix linearly decorrelates the input dimensions, meaning that whitening transforms into such that where is the identity matrix [12]. The matrix can be written as:

where is an x ‘whitening’ matrix. Since it follows that:

From it is clear that the Cholesky decomposition provides a suitable (but not unique) method of finding .

### 7.3Cholesky Decomposition

Cholesky decomposition is an efficient way to implement LU decomposition for symmetric matrices, which allows us to find the square root. Consider , , and , then the Cholesky decomposition of is given by where

Let be the row and column entry of , then

### 7.44 DOF Independent Normal Distribution

Consider the four-dimensional vector which has components that are normally distributed, centered at zero, and independent. Then , , , and all have density functions

Let be the length of , which means . Then has the cumulative distribution function

where is the four-dimensional sphere

We then can write the integral in polar representation

The probability density function of is the derivative of its cumulative distribution function so we use the funamental theorem of calculus on to finally arrive at

### References

1. Unitary evolution recurrent neural networks.
Martin Arjovsky, Amar Shah, and Yoshua Bengio. In International Conference on Machine Learning, pages 1120–1128, 2016.
2. Hypercomplex signals-a novel extension of the analytic signal to the multidimensional case.
Thomas Bulow and Gerald Sommer. IEEE Transactions on signal processing, 49(11):2844–2852, 2001.
3. Hypercomplex spectral signal representations for the processing and analysis of images.
Thomas Bülow. Universität Kiel. Institut für Informatik und Praktische Mathematik, 1999.
4. Fast and accurate deep network learning by exponential linear units (elus).
Djork-Arné Clevert, Thomas Unterthiner, and Sepp Hochreiter. arXiv preprint arXiv:1511.07289, 2015.
5. Associative long short-term memory.
Ivo Danihelka, Greg Wayne, Benigno Uria, Nal Kalchbrenner, and Alex Graves. arXiv preprint arXiv:1602.03032, 2016.
6. Understanding the difficulty of training deep feedforward neural networks.
Xavier Glorot and Yoshua Bengio. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pages 249–256, 2010.
7. Untersuchungen zu dynamischen neuronalen netzen.
Sepp Hochreiter. Diploma, Technische Universität München, 91, 1991.
8. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification.
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. In Proceedings of the IEEE international conference on computer vision, pages 1026–1034, 2015.
9. Deep residual learning for image recognition.
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
10. Batch normalization: Accelerating deep network training by reducing internal covariate shift.
Sergey Ioffe and Christian Szegedy. In International Conference on Machine Learning, pages 448–456, 2015.
11. Posenet: A convolutional network for real-time 6-dof camera relocalization.
Alex Kendall, Matthew Grimes, and Roberto Cipolla. In Proceedings of the IEEE international conference on computer vision, pages 2938–2946, 2015.
12. Optimal whitening and decorrelation.
Agnan Kessy, Alex Lewin, and Korbinian Strimmer. The American Statistician, (just-accepted), 2017.
13. Feed forward neural network with random quaternionic neurons.
Toshifumi Minemoto, Teijiro Isokawa, Haruhiko Nishimura, and Nobuyuki Matsui. Signal Processing, 136:59–68, 2017.
14. A method of solving a convex programming problem with convergence rate o (1/k2).
Yurii Nesterov.
15. Rectified linear units improve restricted boltzmann machines.
Vinod Nair and Geoffrey E Hinton. In Proceedings of the 27th international conference on machine learning (ICML-10), pages 807–814, 2010.
16. The importance of phase in signals.
Alan V Oppenheim and Jae S Lim. Proceedings of the IEEE, 69(5):529–541, 1981.
17. Quaternion neural networks for spoken language understanding.
Titouan Parcollet, Mohamed Morchid, Pierre-Michel Bousquet, Richard Dufour, Georges Linarès, and Renato De Mori. In Spoken Language Technology Workshop (SLT), 2016 IEEE, pages 362–368. IEEE, 2016.
18. The polar form of a quaternion.
Robert Piziak and Danny Turner. 2002.
19. Neural networks with complex and quaternion inputs.
Adityan Rishiyur. arXiv preprint cs/0607090, 2006.
20. Colour image filters based on hypercomplex convolution.
Stephen J Sangwine and Todd A Ell. IEE Proceedings-Vision, Image and Signal Processing, 147(2):89–93, 2000.
21. Training very deep networks.
Rupesh K Srivastava, Klaus Greff, and Jürgen Schmidhuber. In Advances in neural information processing systems, pages 2377–2385, 2015.
22. Deep complex networks.
Chiheb Trabelsi, Olexa Bilaniuk, Dmitriy Serdyuk, Sandeep Subramanian, João Felipe Santos, Soroush Mehri, Negar Rostamzadeh, Yoshua Bengio, and Christopher J Pal. arXiv preprint arXiv:1705.09792, 2017.
23. Full-capacity unitary recurrent neural networks.
Scott Wisdom, Thomas Powers, John Hershey, Jonathan Le Roux, and Les Atlas. In Advances in Neural Information Processing Systems, pages 4880–4888, 2016.
You are adding the first comment!
How to quickly get a good reply:
• Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
• Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
• Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters