Historical Document Image Segmentation with LDA-Initialized Deep Neural Networks
In this paper, we present a novel approach to perform deep neural networks layer-wise weight initialization using Linear Discriminant Analysis (LDA). Typically, the weights of a deep neural network are initialized with: random values, greedy layer-wise pre-training (usually as Deep Belief Network or as auto-encoder) or by re-using the layers from another network (transfer learning). Hence, many training epochs are needed before meaningful weights are learned, or a rather similar dataset is required for seeding a fine-tuning of transfer learning. In this paper, we describe how to turn an LDA into either a neural layer or a classification layer. We analyze the initialization technique on historical documents. First, we show that an LDA-based initialization is quick and leads to a very stable initialization. Furthermore, for the task of layout analysis at pixel level, we investigate the effectiveness of LDA-based initialization and show that it outperforms state-of-the-art random weight initialization methods.
Very Deep Neural Network (DNN) are now widely used in machine learning for solving tasks in various domains.
Although artificial neurons have been around for a long time (mcculloch1943logical), the depth of commonly used artificial neural networks has started to increase significantly only for roughly 15 years111Note that deep neural architectures where proposed already much earlier, but they have not been heavily used in practice (888).(mit-book). This is due to both: the coming back of layer-wise training methods222Referred to as Deep Belief Networks (hinton2006fast), and often composed of Restricted Boltzmann Machines (smolensky1986).(ballard1987modular) and the higher computational power available to researchers.
Historical Document Image Analysis (Dia) is an example of a domain where DNN have been successfully applied recently. As historical documents can be quite diverse, simple networks with few inputs usually lead to poor results, so large networks have to be used. The diversity of the documents has several origins: different degradations (e.g ink fading or stains), complexity and variability of the layouts, overlapping of contents, writing styles, bleed-through, etc.
Because of their success, a lot of resources have been invested into research and development of DNN. However, they still suffer from two major drawbacks. The first is that, despite the computational power of new processors and GPUs, the training of DNN still takes some time. Especially for large networks, the training time becomes a crucial issue, not only because there are more weights to use in the computations, but also because more training epochs are required for the weights to converge. The second drawback is that initializing the weights of DNN with random values implies that different networks will find different local error minima.
In our previous work (seuretalberti2017pca) we proposed to initialize a Convolutional Neural Network (CNN) layer-wise with Principal Component Analysis (PCA) instead of random initialization. We also have shown how features which are good for one task do not necessarily generalize well to other tasks (alberti2017whatYouExpectIsNOTWhatYouGet; alberti2017questioningFeatures). Per extension, we argue that features obtained by maximizing the variance of the input data — which is what PCA features do — might not be the optimal ones for performing classification tasks. To this end, we investigate the performances of initializing a CNN layer-wise with a goal oriented (supervised) algorithm such as LDA by performing layout analysis at the pixel level on historical documents.
In this paper, we present a novel initialization method based on LDA which allows to quickly initialize the weights of a CNN layer-wise333A neural layer can be both initialized to perform either features extraction (LDA space transform) or classification (LDA discriminants). with data-based values. We show that such initialization is very stable444It leads to highly similar patterns of weights in networks initialized on different random samples from the same dataset., converge faster and to better performances when compared with the same architecture initialized with random weights. Additionally, even before the fine-tuning a network initialized with LDA exhibits noticeable results for classification task.
Follows a brief review of literature relevant for this work.
Random Neural Network Initialization
There are currently three main trends for neural network initialization: layer-wise unsupervised pre-training (ballard1987modular; hinton2006fast), transfer learning (caruana1998multitask) or random initial initial weights (bottou1988reconnaissance). Random initialization is fast and simple to implement. The most used approach is to initialize weights of a neuron in , where is the number of inputs of the neuron.
PCA Neural Network Initialization
In our previous work (seuretalberti2017pca) we successfully initialized a CNN layer-wise with PCA. In this work, we introduced a mathematical framework for generating Convolutional Auto-Encoder (CAE) out of the PCA, taking into account the bias of neural layers, and provide a deep analysis of the behavior of PCA-initialized networks – both for CAE and CNN – with a focus on historical document images. Krähenbühl et al. (krahenbuhl2015data) conducted a similar, but independent, research in which, while investigating data-dependent initialization, used PCA matrices as neural layer initial weights. They however mainly focus on K-means initialization and do not investigate deeply PCA initialization.
Linear Discriminant Analysis in Neural Networks:
The link between Neural Networks and LDA has been investigated by many authors. Among them, there are Webb and Lowe (webb1990) who have shown that the output of hidden layers of multi-layer perceptrons are maximizing the network discriminant function, explicitly performing a non-linear transformation of the data into a space in which the classes may be more easily separated. Demir and Ozmehmet (demir2003) presented an online local learning algorithms for updating LDA features incrementally using error-correcting and the Hebbian learning rules. Recently, Dorfer at al. (dorfer2015) have shown how to learn linearly separable latent representations in an end-to-end fashion on a DNN. To the best of our knowledge, there have been no attempts to use LDA for direct NN initialization.
2. Mathematical formulation
In this section we explain the general idea555Giving an exhaustive and complete explanation of the LDA algorithm is behind the scope of this paper. We are keeping the notation and the mathematical background as simple as possible by limiting ourselves to the essential information for understanding this paper. Unless stated otherwise we use the following notation: is the i-th element of . of the Linear Discriminant Analysis and then give the mathematical formulation for using it both as features extractor and classifier.
2.1. LDA in a Nutshell
LDA seeks to reduce dimensionality while preserving as much of the class discriminatory information as possible (ricardogutierrezosuna). Assume we have a set of observations belonging to different classes. The goal of LDA is to find a linear transformation (projection) matrix that converts the set of labelled observations into another coordinate system such that the linear class separability is maximized and the variance of each class is minimized.
2.2. LDA vs PCA
Both LDA and PCA are linear transformation methods and are closely related to each other (martineza2001). However, they pursue two completely different goals (see Figure 1):
Looks for the directions (components) that maximize the variance in the dataset. It therefore does not need to consider class labels.
Looks for the directions (components) that maximize the class separation, and for this it needs class labels.
2.3. LDA as Feature Extractor
In our previous work (seuretalberti2017pca) we successfully initialized a NN layer-wise with PCA. Here, we exploited the similarities behind the mathematical formulation of PCA and LDA to initialize a NN layer to perform LDA space transformation. Recall that a standard artificial neural layer takes as input a vector , multiplies it by a weight matrix , adds a bias vector , and applies a non-linear activation function to the result to obtain the output :
The LDA space transformation operation can be written in the same fashion:
Ideally, we would like to have . This is not possible because of the non-linearity introduced by the function . However, since does not change the sign of the output, we can safely apply it to the LDA as well, obtaining what we call an activated LDA, which behaves like a neural layer:
Let , then we have:
This shows that the transformation matrix can be used to quickly initialize the weight of a neural layer which will then perform the best possible class separation obtainable within a single layer, with regard to the layer training input data. Note that inputs coming from previous layers might be not optimal for the task, thus fine-tuning LDA-initialized networks will improve classification accuracy of the top layer.
The rows of the matrix are the sorted666The eigenvectors are sorted according to the corresponding eigenvalue in descending order. eigenvectors of the squared matrix (see Equation 4). Typically with LDA one might take only the subset of the largest (non-orthogonal) eigenvectors (where denotes the number of classes), however, in this case, as the size of has to match the one of , the number of eigenvectors taken is decided by the network architecture. This also implies that with a standard777There are variants of LDA which allows for extracting an arbitrary number of features (wang2010) (diaf2013). implementation of LDA we cannot have more neurons in the layer than input dimensions.
The matrix is obtained as:
where and are the scatter matrices within-class and respectively between-classes (raschka2014lda). Let denote the within-class mean of class , and denote the overall mean of all classes. The scatter matrices are then computed as follow:
where is the mean number of points per class and is the number of points belonging to class .
2.4. LDA as Classifier
Even though LDA is most used for dimensionality reduction, it can be used to directly perform data classification. To do so, one must compute the discriminant functions for each class :
where and are the prior probability (friedman2001elements) and the pooled covariance matrix, for the class . Let be the total number of observations in , then the priors can be estimated as , and computed as:
An observation will then be classified into class as:
The entire vector can be computed in a matrix form (for all classes) given an input vector :
and the rows of the weight matrix to be the linear part of Equation 7, such that at the row we have .
3. Experiments Methodology
In this section we introduce the dataset,the architecture and the experimental setting used in this work, such that the results we obtained are reproducible by anyone else.
To conduct our experiments we used the DIVA-HisDB dataset(simistira_2016_diva), which is a collection of three medieval manuscripts (CB55888Cologny-Geneve, Fondation Martin Bodmer, Cod. Bodmer 55., CSG18999St. Gallen, Stiftsbibliothek, Cod. Sang. 18, codicological unit 4. and CSG863101010St. Gallen, Stiftsbibliothek, Cod. Sang. 863.) with a particularly complex layout (see Figure 2).
The dataset consists of 150 pages in total and it is publicly available111111http://diuf.unifr.ch/hisdoc/diva-hisdb. In particular, there are 20 pages/manuscript for training, 10 pages/manuscript for validation and 10 test pages. There are four classes in total: background, comment, decoration and text.
The images are in JPG format, scanned at 600 dpi, RGB color. The ground truth of the database is available both at pixel level and in the PAGE XML (pletschacher2010page) format. We chose this dataset as it as been recently used for an ICDAR competition on layout analysis (simistira_2017_competition). To perform our experiments we used a scaled version of factor in order to reduce significantly the computation time.
3.2. Network Architecture
When designing a NN, there is no trivial way to determine the optimal architecture hyper-parameters (wang1994optimal)(kavzoglu1999determining) and often the approach is finding them by trial and error (validation). In this work we are not interested into finding the best performing network topology as we are comparing the results of different initialization techniques on the same network. Therefore we used similar parameters to our previous work on this dataset (simistira_2016_diva). The parameters presented in the following table define the CNN architecture for what concerns the number of layers, size of the input patches with their respective offsets121212Some literature refer to offset as stride. and number of hidden layers. Each layer has a Soft-Sign activation function and the total input patch covered by the CNN is pixels. On top of these feature extraction layers we put a single classification layer with 4 neurons: one for each class in the dataset.
3.3. Experimental Setup
In order to investigate the effectiveness of our novel initialization method, we measure the performances of the same network (see Section 3.2) once initialized with LDA and once with random weights. We evaluate the network for the task of layout analysis at pixel level in a multi-class, single-label setting131313This means that a pixel belongs to one class only, but it could be one of many different classes..
Initializing with Lda
First we evaluate the stability of the LDA initialization in respect to the number of training samples used to compute it (see Figure 6) . After validating this hyper-parameter, we will use it for all other experiments.
When initializing a multi layer network with LDA we start by computing LDA on raw input patches and use the transformation matrix to initialize the first layer. We then proceed to apply a forward pass with the first layer to all raw input patches and we use the output to compute again LDA such that we can use the new transformation matrix to initialize the second layer. This procedure is then repeated until the last layer is initialized. At this point, we add a classification layer that we will initialize in the same fashion as the others, but with the linear discriminant matrix (see Section 2) rather than with the transformation matrix. The whole procedure takes less than two minutes with our experimental setting.
Initializing with random weights
For the random initialization we trivially set the weights matrices to be randomly distributed in the range , where is the number of inputs of the neuron of the layer being initialized.
Once the networks are initialized (both LDA and random) we test their performance on the test set already and again after each training epoch. We then train them for epochs (where one epoch corresponds to training samples) with mini-batches of size . We optimize using standard SGD (Stochastic Gradient Descent) with learning rate . In order to reduce the role randomness play in the experiments, in a single run we show the same input patches to both an LDA and a random network — so that pair-wise they see the same input — and the final results are computed by averaging 10 runs.
The evaluating metric chosen is the mean Intersection over Union because is much stricter than accuracy and especially is not class-size biased (alberti2017evaluation). We measure it with an open-source141414Available at https://github.com/DIVA-DIA/LayoutAnalysisEvaluator. tool.
4. Features Visualization
In this section we show and briefly discuss the features visualization of the CNN initialized with LDA. In Figure 3 are shown the features of the first three layers of the network initialized with LDA and of the first layer of a network randomly initialized, for the CSG863 manuscript.
Without surprise, the features produced by the random initialization are not visually appealing as they are very close to being just gray with noise. In fact, they are not representing something meaningful at all151515For this reason, we displayed only the first layer in Figure 4: images of layer two and three were not conveying additional information..
On the other hand, those produced by the LDA initialization are a completely different story. Notice how on the first layer (Figure (a)a), there are 3 meaningful features, which are exactly as many as the number of classes minus one (see details in Section 2). We expected the first three features to be ”mono-color“ and much different than the others, as we know standard LDA typically projects the points in a sub-dimensional space of size , where is the number of classes (see details in Section 2). Moreover, the other 21 features are yes, looking like the random ones, but are much more colorful. This means that their values are further away from zero.
Regarding the second and third layer (Figures (c)c and (d)d), as convolution is involved is difficult to interpret their visualization in an intuitive way. We can, however, observe how also in the second layer the first three features are significantly different than the other ones and how this is not entirely true anymore in the third layer.
5. Results Analysis
We measured the mean IU of networks during their training phase, evaluating them after each epoch – Figure 5 shows their performances. The LDA initialization is very stable as all networks started with almost the same mean IU. The random initialization however leads to a very high variance of the classification quality at the beginning of the training.
We can also note that the curves of the LDA-initialized networks have very similar shapes for all three manuscripts, thus their behavior can be considered as rather predictable. Contrariwise, the random initialization leads to three different curve shapes, one per manuscript, so we cannot predict how randomly-initialized networks would behave on other manuscripts.
The LDA initialization has two capital advantages over the random one. First, initial mean IU clearly outperforms randomly-initialized networks, as shown in Table 1. The table also includes the percentage of correctly classified pixels, a measurement less punitive than the mean IU but which sheds light from another angle on the advantages of LDA initialization. Second, the LDA initialization leads quickly to much better local minima. In the case of CS863, none of the 10 random networks has finished converging after 100 epochs while LDA-initialized networks have almost finished converging after 60 epochs.
These advantages can be explained by looking at the features obtained by LDA-initialization shown in Figure 3. There are useful structures for some of the filters in all three layers of the network before starting the training, thus less weight adaptations are needed.
In the case of CB55 and CS18, randomly-initialized networks seem to all find similar solutions, and end with very low mean IU variance. Observing only these results, one could think this is the best that can be obtained with the network topology we used, yet the LDA initialization proves this assertion wrong.
6. Conclusion and Outlook
In this paper, we have investigated a new approach for initializing DNN using LDA. We show that such initialization is more stable, converge faster and to better performances than the random weights initialization. This leads to significantly shorter training time for layout analysis tasks at the cost of an initialization time that can be considered as negligible.
This study has been conducted only on relatively small CNN, so the generality of the aforementioned findings should be investigated for deeper networks. Also, as the focus was not achieving high level of accuracy, the design of our test is kept small and simple. Consequently, the results obtained should not be compared to state of art ones.
As future work, we intend to study the joint use of multiple statistical methods (such as PCA and LDA) to initialize a much deeper DNN and to extend the performances test to other classification tasks (e.g image recognition, digit recognition).
Finally, we believe that a good network initialization might be a solution to reduce the training time of DNN significantly.
The work presented in this paper has been partially supported by the HisDoc III project funded by the Swiss National Science Foundation with the grant number _.