# Trainable Spectrally Initializable Matrix Transformations in

Convolutional Neural Networks

###### Abstract

In this work, we investigate the application of trainable and spectrally initializable matrix transformations on the feature maps produced by convolution operations. While previous literature has already demonstrated the possibility of adding static spectral transformations as feature processors, our focus is on more general trainable transforms. We study the transforms in various architectural configurations on four datasets of different nature: from medical (ColorectalHist, HAM10000) and natural (Flowers, ImageNet) images to historical documents (CB55) and handwriting recognition (GPDS). With rigorous experiments that control for the number of parameters and randomness, we show that networks utilizing the introduced matrix transformations outperform vanilla neural networks. The observed accuracy increases by an average of 2.2 % across all datasets. In addition, we show that the benefit of spectral initialization leads to significantly faster convergence, as opposed to randomly initialized matrix transformations. The transformations are implemented as auto-differentiable PyTorch modules that can be incorporated into any neural network architecture. The entire code base is open-source.

- Abb.
- Abbrevation
- fxd
- Fixed Layer
- ufx
- Unfixed Layer
- mxd
- Mixed Layer
- sp
- Spectral Layer
- fc
- Fully Connected Linear Layer
- CNN
- Convolutional Neural Network
- MLP
- Multi Layer Perceptron
- ReLU
- Rectified Linear Units
- NN
- Neural Network
- DFT
- Discrete Fourier Transformation
- iDFT
- Inverse Discrete Fourier Transformation
- DCT
- Discrete Cosine Transformation
- iDCT
- Inverse Discrete Cosine Tranformation
- GPU
- Graphics Processor Units
- LDA
- Linear Discriminant Analysis
- PCA
- Principal Component Analysis
- MRI
- Magnetic Resonance Imaging
- FTIR
- Fourier Transformed Infrared Spectrum
- IRMAS
- Instrument Recognition in Musical and Audio Signals
- CIFAR-10
- Canadian Institute For Advanced Research
- MNIST
- Modified National Institute of Standards and Technology
- FashionMNIST
- Fashion Modified National Institute of Standards and Technology
- ColorectalHist
- Colorectal Cancer Histology
- MLP
- Multilayer Perceptron
- GAP
- Global Average Pooling

## 1 Introduction

In recent years neural networks experienced a renaissance, leading to numerous performance breakthroughs in many machine learning tasks [29, 20]. Besides the simpler Multilayer Perceptron (MLP) [28, 29], especially Convolutional Neural Network (CNN) [9, 21] were increasingly popularized. The progressively deeper architectures of neural networks allowed them to successfully learn hierarchical representations of data, in other words, to implicitly learn suitable feature representations. This, in turn, makes it feasible to train on raw data in an end-to-end fashion, as is currently the state of the art in image recognition [20]. CNN generally operate on locally distinct parts of the input. This localization invariant behavior can be highly beneficial to network performance. As a disadvantage, vanilla CNNs are not able to apply transformations that concern the global image (or feature map) structure e.g., translation, rotation, scaling, mirroring and shearing. These transformations are common in a variety of applications such as visual computing and signal processing. Prominent examples of such transformations are unitary spectral transforms such as the Discrete Fourier Transformation (DFT) or Discrete Cosine Transformation (DCT). Depending on the input domain, such transforms can lead to better representations of the data, making the task at hand substantially easier (faster convergence and/or higher accuracy). So far, these transformations are often manually applied in a preprocessing step, before feeding the data into the CNN. This is common practice in audio signal or structured image related learning tasks [6, 11, 39].

In this work, we introduce trainable matrix transformations on images or feature maps, based on the structure of the DCT and DFT. The idea here is similar in nature to the spatial transform networks [16] but instead of input conditioned spatial transforms we explore the possibility of learning a more general matrix transformation that can be initialized as a pre-specified unitary transform (DCT and DFT) and is applicable to the whole input domain of a respective layer. The rationale for initializing in the spectral domain is based on the observation that it often leads to more compact feature representations, resulting in faster convergence [27]. The main advantage of having such trainable transforms over a fixed preprocessing based image transformation is that it requires less expert knowledge and can thus adjust automatically to a given input domain, effectively allowing for end-to-end learning.

### Contribution

The main contribution of this paper is two-fold.
First, we implement trainable linear matrix transformation modules, which can be used as a priori initialized spectral transforms (DCT, DFT) or the random normalized initialization [12].
All transforms are differentiable and can be trained with regular gradient descent based algorithms and are thus straight forward to incorporate into regular CNNs in any major framework.
Second, to evaluate the usefulness of the transforms in CNNs, we rigorously evaluated classification performance on four publicly available 2D datasets.
To the best of our knowledge, we are the first to develop and integrate trainable linear transforms in the proposed way.
Our PyTorch based open source implementations are freely available as a pip installable python package^{1}^{1}1https://github.com/NarayanSchuetz/SpectralLayersPyTorch and have already been integrated^{2}^{2}2https://github.com/NarayanSchuetz/DeepDIVA into the DeepDIVA [2] deep learning framework thus enabling full reproducibility of experiments.

## 2 Related Work

The structure of the spectral domain has been exploited for many feature extraction purposes. Pictures of tissue samples, cancer structure or Magnetic Resonance Imaging (MRI) profit from features obtained through spectral transformations. In [39] the authors have shown that mammographies can be classified well by using features from original and DCT-transformed pictures. Fourier Transformed Infrared Spectrum (FTIR) of relevant tissue samples for cancer detection has been around since the 90s [23, 35]. Currently, integration of FTIR in classification tasks provides better results in cancer detection [10]. In [34], DCT is used both for dimensionality reduction of brain tumor MRIs, as well as for feature extraction, serving as input to a Neural Network (NN). Regular or irregular patterns of different sizes, i.e., the tissue structure and other medically relevant details, are well suited for analysis with spectral methods. This is opposed to recognition of large objects, e.g. faces, cars or people, where usage of CNNs is well established [19, 13, 15].

Work from Rippel et al. investigates the use of DFT in neural networks [27].
They introduce the concepts of spectral pooling and spectral parametrization.
Spectral pooling is comparable to low-pass filtering the data and has a similar effect on the number of parameters as *max-pooling* on the original image.
In spectral parametrization, convolutional filters are trained in the spectral domain.
After training, the filters are transformed back to the spatial domain.
This method generates the same filters as obtained through regular training, but the convergence rate is much higher.
In [11] they showed that static DCT transforms applied on feature maps led to better convergence rates on different data sets.
In [18], a trainable DCT layer is implemented to match the classic MLP layers and showed similar results as a Gabor filter layer in speech recognition.
[16] introduces the idea to apply parameterized spatial transformations on feature maps.
In [5] and [37] scattering transforms are used to generate more stable representation of data in a theoretically well founded but practically more complex manner.

Reducing the complexity of linear layers has been tackled by various groups [33, 24]. In [33], the structure of special matrix layers, such as Vandermonde, Toepliz or Cauchy matrices, is exploited to reduce memory space. In [24], so called structured efficient linear layers (SELLs), a combination of diagonal, sparse or permutation matrices, and implementations of Fourier, Hadamard and Cosine transformations are investigated. They show that the usage of these SELLs allowes to approximate dense layers while reducing memory usage and complexity of the NN.

The idea to initialize weights in NN according to prior assumptions or information has proven successful time and time again. This can be seen in commonly used transfer learning applications [26] but also with more specific initialization of theoretically well-founded linear functions like Gabor filters [25], Principal Component Analysis (PCA) [30] and Linear Discriminant Analysis (LDA) [3]. In the latter, they pushed initialization further by using label information to directly produce networks with classification abilities.

## 3 Matrix Transforms in Neural Networks

In this section, the mathematical background of the implemented linear mappings is explained in detail.

For two vector spaces and , a function is a linear mapping if for and any scalar it holds that .

In this paper, we look at layers of NN, implemented in the form of two matrix multiplications. They can be described by a linear mapping

(1) | |||||

(2) |

This is a composition of two linear mappings , with and . As a side note, the linear mapping as described in (2) can be reformulated using the Kronecker product. While this is a convenient approach when solving linear systems, it is not advisable for the implementation in a neural network. The resulting block matrices would have more parameters than the original ones which would unnecessarily increase computational complexity. More details on the Kronecker product and why we chose no to use it are provided in appendix D.2 Besides this general linear mapping, in this paper we look at two specific mappings, the Discrete Fourier Transformation (DFT) and the Discrete Cosine Transformation (DCT). These are commonly considered as spectral transformations, as they transform a spatial representation to a spectral (frequency) representation.

For a real-valued image , the Discrete Fourier Transformation (DFT) can be written as:

(3) | ||||

(4) |

for and . The 2D-DFT of is . The variables and are the Fourier transformation matrices. The parameters and denote the number of vertical and horizontal frequency components by which the input image is represented. Both and are unitary matrices, making the two matrix multiplications unitary transformations, i.e., the matrix is only rotated and possibly mirrored, but not scaled. While the DFT matrices have Vandermonde structure, this will not be exploited, in order to avoid complex parameters and operations. To ensure real parameters, the real and imaginary parts are separated and computed individually using Euler’s formula . In our implementation, every complex parameter is treated as a real-valued vector with denoting the real part and the imaginary part. This doubles the number of parameters for the DFT implementation. More details for the implementation are given in the supplementary material in Appendix A. For a real-valued input signal, the Fourier transform has important symmetry properties. These and their implications for the transformation are discussed in [27] and in Appendix B. The back transformation of the DFT is implemented in the same manner as the forward transformation [1]. More details on it are given in Appendix C.

The 2D-DCT II^{3}^{3}3There are four common implementations of the DCT. And while all have advantages and disadvantages, the DCT II is the most commonly used [1]. of an image can be written as:

(5) | ||||

(6) |

for and . This transformation is similar to the DFT, but the resulting signal is real. The two variables and are the orthogonal cosine transformation matrices, i.e., the matrix is rotated and possibly mirrored, but not scaled. As with the DFT, the parameters and denote the number of frequencies used in the transformation to represent the input image in the frequency space. Again, the back transformation is implemented in the same manner as the forward transformation. More details on it are in Appendix C.

Both the 2D-DFT and the 2D-DCT are linear mapings as defined in (1) and (2), mapping from to (DCT) or (DFT) respectively. As they are linear mappings, the gradient of this composition of matrix-matrix multiplications is the product of the transformation matrices [27].

(7) |

There is a strong similarity between this linear mapping as depicted in (7) and the classic linear layer of the MLP that is defined for a vector input and is given as with being the activation function. A short overview over the similarities and differences is given in Appendix D.

## 4 Experimental Setting

### 4.1 Task and Data

We consider the image classification task. Given an input, we produce a single output label which corresponds to what is contained in the image. To asses the generality of our findings, we perform this on images from different source domains, i.e., images from histologic sections, dermatoscopic images, high-resolution scans of historical documents and natural images. The dataset decision is a deliberate choice. We believe that to prove the effectiveness of our method it is necessary to study realistic problems. Therefore, the chosen datasets include images from sensible domains such as medical imaging where the application of spectral analysis has proven useful in the past, and signatures, where is it common to find dataset of limited size due to the difficulties in collecting and annotating large amount of data. Additionally, the choice is steered by the strengths of the matrix-layer approach, which include its ability to handle large input dimensions with a reasonable amount of trainable parameters (see Appendix D). Specifically, we use the following public datasets.

The Colorectal Cancer Histology (ColorectalHist) dataset [17] is an image collection of specimens from histologic sections that depict eight mutually exclusive tissue types.

The Human Against Machine (HAM10000) dataset [36] is composed of pigmented skin lesion images, which include a representative collection of all (seven) important diagnostic categories in the realm of pigmented lesions.

The CB55 is a manuscript extract from the DIVA-HisDB dataset [32] which consists of challenging medieval manuscripts, precisely annotated for the evaluation of several Document Image Analysis (DIA) tasks. The task consist into distinguishing the four classes: main text, comment, decoration and background. Given the intractable large input size of every page (high-resolution scans at 24MP), in this work we selected square patches of size and assigned them the label of their central pixel, thus creating a classification dataset.

The Flowers Recognition dataset [4] contains images of five different types of flowers. The data collection is based on Flickr, Google images and Yandex images.

The ImageNet dataset [7] is a well-known dataset used in several computer vision tasks. In this work we used the Large Scale Visual Recognition Challenge 2016 (ILSVRC2016) subset which contains object belonging to different categories and is composed of images.

### 4.2 Convolutional Architecture

In order to operate on image-based datasets, we designed a CNN model, shown in Figure 1. The core structure consists of 3 module-block layers, followed by a Global Average Pooling (GAP) and concluded by a fully connected classification layer. There are two types of modules: the baseline and the matrix transformation ones. The baseline module is composed of a convolution layer followed by a Leaky ReLU activation function and is shown in Figure 1 in the middle. The matrix transformation module is similar to the baseline module with the addition of a matrix transformation right after the convolutional layer, as shown on the right of Figure 1. This way, a linear mapping - as defined in equation (2) - is performed on the feature space and then subject to the activation function, as is common practice with neural networks. The receptive field of the networks do cover the whole image, thus, spectral transforms can bring benefits in regard to global structure.

In order to have comparable results, regardless of the configuration, each model has the same amount of parameters^{4}^{4}4With a tolerance. which is roughly 135K.^{5}^{5}5Details on the number of parameters in each network are provided in the appendices in Appendix E.
The exact numbers are provided together with the source code, such that full reproducibility is guaranteed (details in Section 4.4).
Our goal is to investigate the effectiveness and applicability of general and special matrix transformations and not to set a new state-of-the-art results.
To that end, we rely on exact control with regards to the number of parameters and protocols in the respective architectures.
In fact, the architecture is relatively small compared with state-of-the-art models as our main goal is investigating the effectiveness of matrix layers — a comparison with state-of-the-art models is still given in the experiments.

The reason behind our custom architecture choice is control. Using only complex high-end architectures would harm the actual scientific methodology by introducing unfair comparison and preventing the measurement of the effect of our novel contribution in isolation. With its only 3 layers and no additional features (such as batch normalization, skip-connection, …) we have more control and we can thus asses whether the observed behaviour is affected by our matrix transformations initialization. Inserting our layers into an existing - and very complex - network would hinder the reliability of causation conclusions, since the performance boost could be a byproduct of some other hidden effect. Finally, the experimental protocol features afixed parameter budget which is nearly impossible to achieve without altering the original “host” network architecture at all.

### 4.3 Training Parameters and Optimization

In this work, we run experiments with two different training configurations (way of arranging) the modules. One, named “Single”, where only one matrix transform is applied (after the first convolution operation), i.e. one matrix module is followed by two baseline modules. The other, named “Multi”, employs multiple transforms, one after each convolutional layer, i.e. three matrix modules are used. In both configurations the matrices are initialized with either the random normalized initialization RND [12], DFT or DCT, respectively. For better interpretability, the “Multi” configurations with spectrally initialized transforms employ the respective inverse transform in the second (middle) layer. In Table 1 there is an overview of all models we use.

In all the experiments we use Stochastic Gradient Descent (SGD) with 0.9 momentum, cross entropy loss as loss function and L2 regularization. All other training details are provided along with the open-sourced code (Section 4.4).

The choice of hyper-parameters is very important, since we are comparing different initialization methods. We use a black-box hyper-parameter optimization solution which automates model tuning to select the best values for each model we run experiments with [31]. The results reported in Table 1 are selected as the mean and standard deviation of 20 independent runs, using the best hyper-parameters found after 30 optimization iterations - for each model and dataset separately.

### 4.4 Reproducibility

## 5 Results and Analysis

Col. Hist | HAM10000 | CB55 | Flowers | ImageNet | GPDS | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|

#Param | Model | % | % | % | % | % | % | ||||||

Baselines | |||||||||||||

132K | BaselineConv | 83.2 | 6.5 | 61.4 | 2.7 | 90.0 | 0.3 | 65.1 | 2.2 | 26.9 | 0.1 | 82.4 | 3.2 |

141K | BaselineDeep | 82.1 | 4.5 | 58.4 | 3.5 | 90.2 | 0.5 | 63.4 | 2.6 | 28.9 | 0.3 | 81.3 | 2.9 |

Matrix Transform | |||||||||||||

137K | RND Single | 86.2 | 1.5 | 64.5 | 2.5 | 97.3 | 0.4 | 67.1 | 2.5 | 27.2 | 0.1 | 77.7 | 8.2 |

138K | RND Multi | 84.7 | 2.1 | 73.4 | 0.9 | 97.3 | 0.3 | 60.0 | 9.5 | 24.9 | 1.0 | 83.2 | 2.3 |

Spectral | |||||||||||||

131K | DFT Single | 87.0 | 1.9 | 63.4 | 3.0 | 97.6 | 0.4 | 68.1 | 3 | 23.7 | 1.3 | 66.6 | 4.1 |

135K | DFT Multi | 90.1 | 2 | 70.8 | 4.7 | 97.7 | 0.2 | 69.7 | 3 | 22.0 | 0.6 | 73.8 | 6.1 |

137K | DCT Single | 83.9 | 1.3 | 66.4 | 3.4 | 97.2 | 0.3 | 66.7 | 4.7 | 26.0 | 0.4 | 81.8 | 2.8 |

138K | DCT Multi | 88.1 | 1 | 74.1 | 0.7 | 97.7 | 0.2 | 70.0 | 1.8 | 22.5 | 0.3 | 86.2 | 2.5 |

References | |||||||||||||

57M | AlexNet[19] | 83.6 | 3 | 66.3 | 2.7 | 96.9 | 0.1 | 71.1 | 3.2 | 44.8 | 0.1 | 92.0 | 1.2 |

11M | ResNet-18[14] | 85.6 | 8.9 | 76.4 | 0.6 | 97.5 | 0.3 | 70.8 | 4.9 | 52.0 | 0.2 | 98.0 | 0.3 |

### 5.1 Performance Comparison

The results of the different models on the datasets are shown in Table 1.
Throughout all datasets, the models with added matrix transformations perform overall better than the baseline model, with a mean gain of 2.2% accuracy.
However, the performances differ greatly across specific datasets and initialization techniques.
The performance gains range from an average of 7.5% on CB55 to 1.8% on Flower, with a single peak of 12.7% on HAM10000.
The performance loss spans from an average of -2.5% on ImageNet to -4.1% on GPDS, with a single peak of -15.8% on GPDS^{6}^{6}6Noteworthy, the DFT spectral initialized models show relatively poor performances on GPDS, whereas the DCT outperforms the baseline.
While we are unsure of the exact causes of these lower performances, we believe that the Fourier transform does not capture the particular properties of the input data on signatures (black text on white background, always centered in the image) as well as the discrete Cosine one..

These results suggest that the structure of the dataset is essential for the choice of the architectural component, e.g., the selection of this new component. Depending on what information is relevant to the task at hand, changing the representation of the data from spatial to the frequency domain is more or less beneficial. Evidence suggests that datasets where the relevant classification information is uniformly distributed over the image, i.e., tissue classification and other pattern-heavy problems, profit more from our approach, than sets, where the relevant information is centered on a few closely distributed pixels. Moreover, datasets that exhibit repeating structures or patterns seem to benefit the most from the spectral layer. This hypothesis is supported by empirical evidence e.g., on Colorectral Hist. where the spectral initialized network (DFTMulti) surpasses the reference networks (AlexNet and ResNet) by 6.5% and 4.5%, respectively. This is promising considering the large gap in terms of the number of parameters (and architectural advances) featured in these models. This trend is reinforced by the other medical dataset (HAM10000), the historical one (CB55), and in smaller magnitude by the natural images (Flower) one.

In contrast, we expect datasets like ImageNet to benefit less from using the matrix transformations. In fact, the information distribution within an image of such datasets is not denoting the important properties aforementioned. For example, border pixels contribute less than more central pixels in object classification datasets because of the bias of the photographer to center the subjects being captured.

### 5.2 Regularization Properties

To asses whether the additional performance is due to a regularizing effect introduced by the non-separability of the matrix multiplication operation inside a module, we run the same experiments with another baseline model (in Table 1 referred as “BaselineDeep”) which has an additional convolutional layer with subsequent non-linearity. The results, however, indicate that this is not the case. It is thus reasonable to assume that the introduced transformations themselves have a beneficial effect on convergence behavior (Figure 2) and accuracy (Table 1). The addition of these linear transformations allows our simple three layers model to match and often outperform AlexNet and ResNet-18 models – which possess orders of magnitude more parameters (420 and 81 times more, respectively) and improvements such as batch-normalization layers or skip connections.

### 5.3 Analysis of the Learned Transformations

While the observed convergence speedups of the DFT and DCT initialized transforms were expected and are in line with similar observations from other literature [27, 11], the respectable classification performance gains come rather unexpected. We suspect that the transforms and especially the fact that they are trainable can be useful, allowing for the exploitation of the global input domain-specific structure. The potentially more meaningful and sparser representation produced by the DCT and DFT initialized transforms might also lead to smoother error surfaces, which are easier to optimize and decrease the chance of becoming stuck in a local optimum. To further investigate, in Figure 3, we visualize the trained weights of the first transformation matrix with regards to the different weight initializations (DCT, DFT, RND). All three transforms are trained on the CB55 dataset, which features historical document classification based on the central pixel (thus, the central pixel decides on the class). It is visible by the emerging vertical pattern in all three matrices that the transform is adjusted to this particular, globally present classification scenario. Moreover, in Figure 4, we visualize the effect that the first matrix module block has on the input patch (a): before (b) and after (c) training on CB55. It seems that the weights of the higher mixed-frequencies become smaller in magnitude, which is supported by the darker corners in (c) as well as by the negative average of the normalized difference between before/after training (d). Further, visually, there is a central pattern, which becomes more prominent after training (b,c). This is, in our opinion, a reasonable effect as we are operating on text input data.

### 5.4 Single vs Multiple Matrix Transforms

Adding multiple transforms (one after each convolutional operation) leads to significant performance gains - especially for the spectral initialization - as shown in Table 1. In fact, the “Multi” configurations with spectral initialization (DFT/DCT) performs generally better than their “Single” counterpart. The magnitude of the margin varies from almost 10% on HAM10000 to a very marginal one on CB55 and even negative on ImageNet. These results are aligned with the work done in [24], where a combination of specially structured linear layers, featuring both a forward and backward Cosine transformation, is shown to have good performances. In the same manner, our “Multi” configuration seems to be able to leverage this “forth-and-back” transformations to some extent. Allowing the network to work on the data both in the normal and spectral domains seems to be highly beneficial.

In the case of the random initialization, there are two cases in which the gain is negative, specifically on ColorectalHist and Flowers. This is, to some extent, surprising because RND Multi outperforms RND Single by roughly 9 % on HAM10000, and exhibits similar behavior as the spectral counterpart on CB55. Moreover, by closer inspection, we observe that the convergence rate of RND Multi is appreciably slower than its single counterpart on the Flowers dataset, as can be seen in Figure (d)d and (a)a. Our tentative explanation is that the larger amount of dissonant degree of freedom in the later stages of the network (as opposed to the spectral initialized ones) dampens the magnitude of the relative error, which gets back-propagated to the lower layers. We are, however, unsure about the reason why this phenomenon is not observed on all datasets in the same strength.

### 5.5 Comparison of Different Initialization

Classification performance of differently initialized matrix transformations on the feature maps of CNNs appear to be mostly dataset dependent. For example, DFT initialization has better performance on ColorectalHist but loses against the DCT initialized counterpart on HAM10000. On CB55 and Flowers, both spectral initialization, show similar results. However, it seems that the dataset CB55 exhibits a common ground around 97 % accuracy as all models, but the baselines achieve a similar result. Regarding the spectral vs. random initialization, overall, the two unitary spectral transforms (DCT and DFT) do perform better, with an average gap of 1.6 % between the best spectral model and the best RND model. Even so, the main difference can be seen in convergence behavior. Here, models containing the randomly initialized transform tend to converge significantly slower, as can be seen in Figure 2.

### 5.6 Limitations and Outlook

The obtained results are very promising and the final framework is easy to implement and use. The most prominent drawback of using the proposed transformation seems to be the need to asses whether or not the dataset shows patterns that are suitable to be highlighted by such transformations. While this can generally be estimated by analyzing the visual appearance of images, a more analytical tool to support this decision would be very helpful. A promising direction would be the approach presented in [38]. As discussed in section 5.1, datasets with uniformly distributed classification information, such as tissue, profit from spectral transformations. When measuring the occlusion sensitivity of the images in a trained CNN-network, we would expect to have a high probability of correct classification, regardless of which part of the image is covered. On the other hand, datasets like ImageNet are more sensitive to this method depending on the omitted part of the image, as shown in [38]. This coincides with the datasets our network-structures are not performing well on. This might be used as a measurement for well-suited or ill-suited datasets.

In the case of the spectrally initialized transforms, it could also be promising to adapt spectral pooling operations [27], potentially further exploiting the more compact spectral representations.

In this work, we constrained our investigations into the image domain, while this framework could be applied to higher-dimensional data as well. Further research should also focus on the impact of using spectrally initialized matrix transforms in neural networks applied to tasks such as video or medical tomography data analysis. Overall, our results suggest that transforms, such as the spectral ones, are an often neglected part in image classification tasks. It is commonly assumed that CNNs do not need the input to be transformed, probably because standard benchmark datasets reflect tasks where the original spatial representation is already ideal or close to it. However, as we found, less mainstream image domains can benefit from transforms, even more so if the transforms are trainable.

## 6 Conclusion

This work introduces trainable matrix transformations that are applied on feature maps and can be initialized with two unitary transforms, namely the DCT and DFT (and their respective inversions). Experiments on six challenging datasets show that the transforms can exploit global input domain structure, resulting in significantly better results when compared to a similar baseline model, with an average performance gain of 2.2% accuracy. We further show that the spectral initialization as DCT or DFT brings substantial speedups in terms of convergence, when compared to random initialization.

All of the source code, documentation, and experimental setups are realized in the DeepDIVA framework and made publicly available in GitHub repositories. The methods proposed in this paper lay out the basis for the realization of trainable spectral transformations and can be extended to higher-dimensional data as well.

## Acknowledgment

The work in this paper has been partially supported by the HisDoc III project funded by the Swiss National Science Foundation with the grant number _.

## References

- [1] (1974) Discrete cosine transform. IEEE transactions on Computers 100 (1), pp. 90–93. Cited by: §3, footnote 3.
- [2] (2018-08) DeepDIVA: A Highly-Functional Python Framework for Reproducible Experiments. In 2018 16th International Conference on Frontiers in Handwriting Recognition (ICFHR), Niagara Falls, USA. External Links: 1805.00329 Cited by: §1, §4.4.
- [3] (2017) Historical Document Image Segmentation with LDA-Initialized Deep Neural Networks. In Proceedings of the 4th International Workshop on Historical Document Imaging and Processing, pp. 95–100. Cited by: §2.
- [4] Flowers Recognition. Note: https://www.kaggle.com/alxmamaev/flowers-recognitionAccessed: 18-05-2019 Cited by: §4.1.
- [5] (2014) Deep scattering spectrum. IEEE Transactions on Signal Processing 62 (16), pp. 4114–4128. Cited by: §2.
- [6] (1974) An experimental automatic word recognition system. JSRU Report 1003 (5), pp. 33. Cited by: §1.
- [7] (2009) Imagenet: a large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp. 248–255. Cited by: §4.1.
- [8] (2015-03) Static Signature Synthesis: A Neuromotor Inspired Approach for Biometrics. IEEE Transactions on Pattern Analysis and Machine Intelligence 37 (3), pp. 667–680. External Links: ISBN 0162-8828 VO - 37, ISSN 0162-8828 Cited by: §4.1.
- [9] (1980) Neocognitron: a self-organizing neural network model for a mechanism of visual pattern recognition. In Competition and Cooperation in Neural Nets, pp. 267–285. Cited by: §1.
- [10] (2013) Fourier-transform infrared spectroscopy coupled with a classification machine for the analysis of blood plasma or serum: a novel diagnostic approach for ovarian cancer. Analyst 138 (14), pp. 3917–3926. Cited by: §2.
- [11] (2016) Deep feature extraction in the DCT domain. In Pattern Recognition (ICPR), 2016 23rd International Conference on, pp. 3536–3541. Cited by: §1, §2, §5.3.
- [12] (2010) Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the thirteenth international conference on artificial intelligence and statistics, pp. 249–256. Cited by: §1, §4.3.
- [13] (2015) Delving deep into rectifiers: surpassing human-level performance on imagenet classification. In Proceedings of the IEEE international conference on computer vision, pp. 1026–1034. Cited by: §2.
- [14] (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: Table 1.
- [15] (2018) Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7132–7141. Cited by: §2.
- [16] (2015) Spatial transformer networks. CoRR abs/1506.02025. External Links: Link, 1506.02025 Cited by: §1, §2.
- [17] (2016) Multi-class texture analysis in colorectal cancer histology. Scientific reports 6, pp. 27988. Cited by: §4.1.
- [18] (2013) The joint optimization of spectro-temporal features and neural net classifiers. In Text, Speech, and Dialogue, I. Habernal and V. Matoušek (Eds.), Berlin, Heidelberg, pp. 552–559. Cited by: §2.
- [19] (2012) Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems, pp. 1097–1105. Cited by: §2, Table 1.
- [20] (2015) Deep learning. Nature 521 (7553), pp. 436. Cited by: §1.
- [21] (1990) Handwritten digit recognition with a back-propagation network. In Advances in neural information processing systems, pp. 396–404. Cited by: §1.
- [22] (2019) Combining Graph Edit Distance and Triplet Networks for Offline Signature Verification. Pattern Recognition Letters 125, pp. 527–533. External Links: Document Cited by: §4.1.
- [23] (1996) Breast cancer detection by fourier transform infrared spectrometry. Vibrational spectroscopy 10 (2), pp. 341–346. Cited by: §2.
- [24] (2016-03-19) ACDC: a structured efficient linear layer. External Links: Link, 1511.05946 Cited by: §2, §5.4.
- [25] (2018-05) Initialization of convolutional neural networks by Gabor filters. In 2018 26th Signal Processing and Communications Applications Conference (SIU), Vol. , pp. 1–4. External Links: Document, ISSN Cited by: §2.
- [26] (2010) A survey on transfer learning. IEEE Transactions on knowledge and data engineering 22 (10), pp. 1345–1359. Cited by: §2.
- [27] (2015) Spectral representations for convolutional neural networks. In Advances in Neural Information Processing Systems, pp. 2449–2457. Cited by: Appendix B, §1, §2, §3, §3, §5.3, §5.6.
- [28] (1990) The multilayer perceptron as an approximation to a bayes optimal discriminant function. IEEE Transactions on Neural Networks 1 (4), pp. 296–298. Cited by: §1.
- [29] (2015) Deep learning in neural networks: an overview. Neural networks 61, pp. 85–117. Cited by: §1.
- [30] (2017) PCA-initialized deep neural networks applied to document image analysis. In Document Analysis and Recognition (ICDAR), 2017 14th IAPR International Conference on, Vol. 1, pp. 877–882. Cited by: §2.
- [31] (2014) SigOpt reference manual. External Links: Link Cited by: §4.3.
- [32] (2016) DIVA-HISDB: a precisely annotated large dataset of challenging medieval manuscripts. In Frontiers in Handwriting Recognition (ICFHR), 2016 15th International Conference on, pp. 471–476. Cited by: §4.1.
- [33] (2015) Structured transforms for small-footprint deep learning. In Advances in Neural Information Processing Systems 28, C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett (Eds.), pp. 3088–3096. External Links: Link Cited by: §2.
- [34] (2013) Brain tumor classification using discrete cosine transform and probabilistic neural network. In Signal Processing Image Processing & Pattern Recognition (ICSIPR), 2013 International Conference on, pp. 92–96. Cited by: §2.
- [35] (1999) Factor analysis of cancer fourier transform infrared evanescent wave fiberoptical (ftir-few) spectra. Lasers in Surgery and Medicine: The Official Journal of the American Society for Laser Medicine and Surgery 24 (5), pp. 382–388. Cited by: §2.
- [36] (2018) The HAM10000 dataset, a large collection of multi-source dermatoscopic images of common pigmented skin lesions. External Links: Document Cited by: §4.1.
- [37] (2016) Complex linear projection (clp): a discriminative approach to joint feature extraction and acoustic modeling. Cited by: §2.
- [38] (2014) Visualizing and understanding convolutional networks. In Computer Vision – ECCV 2014, D. Fleet, T. Pajdla, B. Schiele, and T. Tuytelaars (Eds.), Vol. 8689, pp. 818–833. External Links: ISBN 978-3-319-10589-5 978-3-319-10590-1, Link, Document Cited by: §5.6.
- [39] (1996) Digital mammography: mixed feature neural network with spectral entropy decision for detection of microcalcifications. IEEE transactions on medical imaging 15 (5), pp. 589–597. Cited by: §1, §2.

## Appendix A Layer Implementation

The Discrete Fourier Transformation (DFT) as defined in (4) returns a complex signal. As of November 2018, *PyTorch* can not handle complex numbers directly ^{7}^{7}7https://github.com/pytorch/pytorch/issues/755.
Under the assumption that input signals are all real valued, the DFT can be separated into real and imaginary part of the Fourier coefficients using Euler’s formula

For a given Fourier weight matrix applying Euler’s formula leads to

Here, denotes the real part and the imaginary part of . This is done for all entries of the weight matrix in (4)

The DFT can then be computed as

Effectively, the DFT has twice as many parameters as the DCT. The DFT does not contain more information than the DCT, the surplus of parameters is due to symmetry properties of the DFT if the input is real valued. This is discussed in appendix B. Our implementation has the option to handle the DFT output in two different ways. Either the output consists of real part and imaginary part of the transformation. These two are concatenated, effectively doubling the size of the feature map, i.e., the output is . The other way is the computation of the amplitude. In this case, the amplitude of the output signal is computed as is usually done when working with frequency analysis in signal processing, i.e., the output is . The resulting feature map has same dimension as the input.

In our evaluations, the complex valued output is used.

## Appendix B Redundancy of data

For a real-valued input signal , its Fourier transform has important symmetry properties. For even , and . For odd , and . Where denotes the complex conjugate of . As a result, the DFT of even (respectively odd) length real input signals are defined by the first (respectively ) entries of the transformed signal. The remaining (respectively ) entries can be computed from the existing ones. Or in other words, the degree of freedom for the DFT of an even (respectively odd) real input signal is (respectively ).

If the input signal is complex valued, then there are no more redundancies, but the layer implementation as explained in Appendix A would have to be expanded by a complex valued input .

While this redundancy of parameters is a reduction of memory, this can pose a problem for the back transformation of the DFT. For a DFT signal to be back transformable to a real valued signal, these properties of symmetry need to be fulfilled. Consequently, about half of the parameters need to be fixed to meet this requirement. This is also discussed in [27]. So far, in our implementation, this problem of redundancy is not taken care of in the unfixed layer case.

## Appendix C Back transformation

Back transformations of both DFT and DCT can be implemented in a similar manner to the forward transformation. As the DFT matrices are unitary, the definition of the inverse matrices is straight forward. The inverse of a unitary matrix is its conjugate transpose . Alternatively, the definition for the backward transformation can be used to derive the back-transformation matrices

Using again Euler’s formula, these can be expanded and implemented as described in Appendix A.

## Appendix D Linear layers and matrices

### d.1 Linear layers in MLP

In a MLP, linear layers are represented by matrix-vector multiplications with and . The weight matrix fully connects two different layers and . There are parameters in the weight matrix .

The 2D-DFT and 2D-DCT are linear matrix-matrix multiplications as in equation (4) and (6). They consist of linear operations on matrices, usually with . There are parameters in the weight matrix for an input of dimension . If the input is flattened row first, , the weight matrix is adjusted to a sparse matrix . The new weight matrix is a block matrix

(8) | ||||

(9) |

with the identity matrix of size and the entries of the original matrix. The matrix has density , i.e., sparsity . This is comparable to a not-fully connected linear layer with shared weights.

### d.2 Kronecker Product

For two matrices and , the Kronecker product is a block matrix:

(10) |

with . The Kronecker product can be used for a more convenient representation of certain matrix equations. Given four matrices and of appropriate dimensions, the equation can be rewritten using the Kronecker product and matrix vectorization:

(11) |

This allows the application of Gauss or other simple-to-use algorithm to solve for .

If all matrices are of size , the complexity of multiplying the Kronecker product matrix with is while the complexity of the original matrix multiplication is .
The two matrices and have a total of parameters, while has parameters. Rank information is maintained, .

For the application to equation (2), the Kronecker product can be used as

(12) | |||||

(13) |

Disadvantage of this method is the necessity to maintain the structure of the parameters of as described in (10). If all parameters of can be optimized independently, the original structure of can get lost and this layer will result in independent parameters instead of independent parameters. The resulting matrix will no longer represent and , i.e. no longer be the Kronecker product. Alternatively, if and the structure of the Kronecker product is to be maintained, there have to be additional constraints on the parameters, rendering the whole optimization even more complex. In other words, the inherent block structure as described in equation (10), can not be maintained during training, resulting in a simple fully connected MLP layer as described in section D.1, but without the sparsity. At the same time, computational complexity is increased and number of trainable parameters are increased. The advantages of the Kronecker product can not be exploited in this scenario.

## Appendix E Computing Network Number of parameters

In order to operate on image-based datasets, we design a CNN model, shown in Figure 1.
The core structure is 3 module-block layers (explained in the next sections) followed by a GAP and concluded by a fully connected classification layer.
In order to have comparable results, regardless of the configuration, each model has the same amount of parameters^{8}^{8}8With a tolerance. which is roughly 135k.
Because of this, the size of the feature map or the number of filters for a specific convolution layer may vary across different configurations.
The exact numbers are provided together with the source code, such that full reproducibility is guaranteed (details in Section 4.4).

### e.1 Baseline Module

The baseline module is composed of a convolution layer followed by a Leaky ReLU activation function and is shown in Figure 1. The number of parameters in this module is computed as , where and are the convolution filter sizes, is the number of convolutional filters and the input depth.

### e.2 Regular Module

The regular module is similar to the baseline module with the addition of a spectral operation right after the convolutional layer, as shown in Figure 1. This way a spectral transformation is performed on the features space and then subject to the activation function, as is common practice with neural networks. The number of parameters of this module is the sum of parameters of the convolutional layer and the spectral block. For the convolutional layer the procedure is identical as shown in the baseline module, whereas for the spectral block the number of parameters is , with and being the width and height of the feature map after the convolutional layer, and is the number of frequencies examined by the spectral transformation. In our experiments and are set to be equal to and , therefore the final number of parameters for the spectral layer can be computed as and for the 2D-DCT and 2D-DFT layer, respectively.

### e.3 Global Average Pooling

The Global Average Pooling has no parameters to be trained.

### e.4 Classification Layer

The final classification layer is a simple fully connected layer and its number of parameters is with and being the width and height of the feature map after the GAP, and the number of output classes on this particular dataset.