Sparse arrays of signatures for online character recognition
Abstract
In mathematics the signature of a path is a collection of iterated integrals, commonly used for solving differential equations. We show that the path signature, used as a set of features for consumption by a convolutional neural network (CNN), improves the accuracy of online character recognition—that is the task of reading characters represented as a collection of paths. Using datasets of letters, numbers, Assamese and Chinese characters, we show that the first, second, and even the third iterated integrals contain useful information for consumption by a CNN.
On the CASIAOLHWDB1.1 3755 Chinese character dataset, our approach gave a test error of 3.58%, compared with 5.61%[4] for a traditional CNN. A CNN trained on the CASIAOLHWDB1.01.2 datasets won the ICDAR2013 Online Isolated Chinese Character recognition competition.
Computationally, we have developed a sparse CNN implementation that make it practical to train CNNs with many layers of maxpooling. Extending the MNIST dataset by translations, our sparse CNN gets a test error of 0.31%.
Keywords: online character recognition, signature, iterated integrals, convolutional neural network
Condensed running title: Sparse arrays of signatures
1 Introduction
Two rather different techniques work well for online Chinese character recognition. One approach is to render the strokes into a bitmap embedded in a grid, and then to use a deep convolutional neural network (CNN) as a classifier [4]. Another is to draw the character on an grid, and then in each square calculate a histogram measuring the amount of movement in each of the 8 compass directions, producing a dimensional vector to classify [2].
Intuitively, the first representation records more accurately where the pen went, while the second is better at recording the direction the pen was taking. We attempt to get the best of both worlds by producing an enhanced picture of the character using the path iteratedintegral signature. This valueadded picture of the character records the pen’s location, direction and the forces that were acting on the pen as it moved.
CNNs start with an input layer of size . The first two dimensions are spatial; the third dimension is simply a list of features available at each point; for example, for grayscale images and for color images. When calculating the path signature, we have a choice of how many iterated integrals to calculate. If we calculate the zeroth, first, second, …, up to the th iterated integrals, then the resulting input vectors are dimensional.
This representation is sparse. We only calculate path signatures where the pen actually went: for the majority of the spatial locations, the dimensional input vector is simply taken as allzeros. Taking advantage of this sparsity makes it practical to train much larger networks than would be practical with a traditional CNN implementation.
2 Sparse CNNs: DeepCNet
Inspired by [4], we have considered a simple family of CNNs with alternating convolutional and maxpooling layers. Let DeepCNet denote the CNN with

an input layer of size where ,

convolutional filters of size in the first layer,

convolutional filters of size in layers

a layer of maxpooling after each convolution layer, and

a fullyconnected final hidden layer of size .
For example, DeepCNet is the architecture from [4] with input layer size and four layers of maxpooling:
input100C3MP2200C2MP2300C2MP2400C2MP2500Noutput 
For general input the cost of the forward operation, in particular calculating the first few hidden layers, is very high. For sparse input, the cost of calculating the lower hidden layers is much reduced, and evaluating the upper layers becomes the computational bottleneck.
When designing a CNN, it is important that the input field size is strictly larger than the objects to be recognized; CNNs do a better job distinguishing features in the center of the input field. However, padding the input in this way is normally expensive. An interesting side effect of using sparsity is that the cost of padding the input disappears.
2.1 Character scale
For character recognition, we choose a scale on which to draw the characters. For the Latin alphabet and Arabic numerals, one might copy MNIST and take . Chinese characters have a much higher level of detail: [4] uses , constrained by the computational complexity of evaluating dense CNNs.
Given , we must choose the parameter such that the characters fit comfortably into the input layer. DeepCNets seem to work best when is approximately . There are a couple of ways of justifying the rule:

To process the sized input down to a zerodimensional quantity, the number of levels of maxpooling should be approximately .

Counting the number of paths through the CNN from the input to output layers reveals a plateau; see Figure 1. Each corner of the input layer has only one route to the output layer; in contrast, the central points in the input layer each have such paths.
2.2 Sparsity
For DeepCNets with , training the network is in general hard work, even using a GPU. However, for character recognition, we can speed things up by taking advantage of the sparse nature of the input, and the repetitive nature of CNN calculations. Essentially, we are memoizing the filtering and pooling operations.
First imagine putting an allzero array into the input layer. As you evaluate the CNN in the forward direction, the translational invariance of the input is propagated to each of the hidden layers in turn. We can therefore think of each hidden variable as having a ground state corresponding to receiving no meaningful input; the ground state is generally nonzero because of bias terms. When the input array is sparse, one only has to calculate the values of the hidden variables where they differ from their ground state.
To forward propagate the network we calculate two types of list: lists of the nongroundstate vectors (which have size , , , …) and lists specifying where the vectors fit spatially in the CNN. This representation is very loosely biologically inspired. The human visual cortex separates into two streams of information: the dorsal (where) and ventral (what) streams.
2.3 MNIST as a sparse dataset
To test the sparse CNN implementation we used MNIST [7]. The digits have on average 150 nonzero pixels. Placing the digits in the middle of a grid produces a sparse dataset as is much smaller than .
It is common to extend the MNIST training set by translations and/or elastic distortions. Here we only use translations of the training set, adding random shifts of up to pixels in the  and directions. Training a very smallbutdeep network, DeepCNet, for a very long time, 1000 repetitions of the training set, gave a test error of 0.58%. Using a GeForce GTX 680 graphics card, we can classify 3000 characters per second.
We tried increasing the number of hidden units. Training DeepCNet for 200 repetitions of the training set gave a test error of 0.46%.
Dropout, in combination with increasing the number of hidden units and the training time generally improves ANN performance [6]. DeepCNet has seven layers of matrixmultiplication. Dropping out 50% of the input to the fourth layer and above during training resulted in a test error of 0.31%.
3 The sparse signature grid representation
The expression of the information contained in a path in the form of iterated integrals was pioneered by K.T. Chen [3]. More recently, path signatures have been used to solve differential equations driven by rough paths [11, 12]. The signature extracts enough information from the path to solve any linear differential equation and uniquely characterizes paths of finite length [5].
The signature has been used in sound compression [10]. A stereo audio recording can be seen as a highly oscillating path in . Storing a truncated version of the path signature allows a version of the audio signal to be reconstructed.
Although computing the signature of a path is easy, the inverse problem is rather more difficult. The limiting factor in [10] was the lack of an efficient algorithm for reconstructing a path from its truncated signature when . We sidestep the inverse problem by learning to recognize the signatures directly.
3.1 Computation of the path signature
Let denote a time interval and let with denote the writing surface. Consider a pen stroke: a continuous function . For positive integers and intervals , the th iterated integral of is the dimensional vector (i.e. a tensor in ) defined by
By convention, the iterated integral is simply the number one. The iterated integral is the displacement of the path. The iterated integral is related to the curvature of the path.
As increases, it is a case of diminishing returns. The iterated integrals increase rapidly in dimension whilst carrying less information about the large scale shape of . We therefore consider truncated signatures. The signature, truncated at level , is the collection of the iterated integrals,
With , the dimension of this object is
(1) 
Let denote the path displacement. Thinking of as a row vector, the tensor product corresponds to the Kronecker matrix product (kron in MATLAB). When is a straight line, the signature can be calculated exactly:
(2) 
3.2 Representing pen strokes
Each character collected by an electronic stylus is represented by a sequence of pen strokes. Each stroke is represented by a sequence of points that we will treat as a piecewise linear path.
Recall that denotes the scale at which characters will be drawn. We use another parameter to describe very approximately the scale of path curvature.
Here is an algorithm that takes a character and uses the first iterated integrals to construct an CNNinputlayer array (1).

Initialize an array of size to all zeros. Think of this as an array of vectors in ; the first two dimensions correspond to the writing surface, and the third corresponds to the elements of the signature.

Rescale the character to fit in an box placed in the center of the input layer. Let denote the th character stroke, parameterized to have unit speed.

Let denote a point moving along the th stroke. Mapping into the grid, store in the appropriate column of the array.
Note that the first element of the signature corresponds to the zeroth iterated integral which is always a one. Thus if we look at the first slice of our 3D array we see a 2D bitmap picture of the character. If , the next two layers contain the first iterated integrals: they encode the direction the pen was moving. If , the next four layers encode the second iterated integrals, and so on.
4 Results
4.1 10, 26 and 183 character classes
We will first look at three relatively small datasets [1] to study the effect of increasing the signature truncation level .

The Pendigits dataset contains handwritten digits 09. It contain about 750 training characters for each of the ten classes.

The UJIpenchars database includes the 26 lowercase letters. The training set contains 80 characters for each of the 26 classes.

The Online Handwritten Assamese Characters Dataset contains 45 samples of 183 IndoAryan characters. We used the first handwriting samples as the training set, and the remaining samples for a test set ( or 36).
To make the comparisons interesting, we deliberately restrict the DeepCNet and parameters. The justification for this is that increasing and is computationally expensive. In contrast, increasing only increases the cost of evaluating the first layer of the CNN; in general for sparse CNNs the first layer represents only a small fraction of the total cost of evaluating the network. Thus increasing is cheap, and we will see that it tends to improve test performance.
We tried smaller and larger networks, and using the training set with and without increasing its size by affine transformations (a random mix of scalings, rotations and translations). The table shows the test set error rates.
Dataset  DeepCNet  Transforms  

Pendigits  (3,10)  10  3.37%  1.60%  1.32%  1.09%  
Pendigits  (5,50)  32  0.94%  0.43%  0.40%  0.40%  
UJI lowercase  (4,10)  20  18.3%  16.3%  15.3%  13.4%  
UJI lowercase  (5,50)  32  7.4%  6.6%  6.9%  6.9%  
Assamese15  (5,20)  32  48.9%  40.3%  39.8%  34.8%  
Assamese15  (5,50)  32  12.5%  11.9%  11.0%  11.0%  
Assamese36  (6,50)  64  2.2%  2.3%  1.6%  2.3% 
The results show that the first and second, and even the third iterated integrals carry information that CNNs can use to improve generalization from the training to the test set.
4.2 Casia
The CASIAOLHWDB1.1[9] database contains samples from 300 writers, allocated into training and test sets. It contains samples of the 3755 GBK level1 Chinese characters.
A test error of 5.61% is achieved using a deep CNN applied to pictures drawn from the pen data [4]. Their program took advantage of a couple of features that are often used in the context of CNNs. Rather than simply drawing a binary bitmap of the character, they convolved the images with a Gaussian blur. They also used elastic distortions.
We trained a DeepCNet with and . We went through the training data 80 times. For the first 40 repetitions, we randomized the placement of the training characters is a small neighborhood of the center of the input layer. This gave a test error of 4.44%. We then applied affine transformations to the training characters for another 40 repetitions, giving a test error of 4.01%.
Modifying the above network by adding dropout—with dropout per level of —resulted in a test error to 3.58%.
5 Conclusion
We have studied two methods for improving the performance of CNNs for online handwriting character recognition, enhancing the pictures with signature information and using sparsity to increase the depth of the networks we can train. They work well together on a variety of alphabets.
This work raises a number of questions:

Besides the path signature, there are many other ways of describing the shape of a path. You could calculate 8direction histograms [2] to give a sparse input layer. Or you could use an unsupervised learning algorithm to characterize path segments. What is the best way of describing paths for CNNs?

Can our approach be applied to natural images using curves extracted from the image by some deterministic curve tracing algorithm?

Our CNN is sparse with respect to the two spatial dimensions, but not in terms of the third featureset dimension. Predictive Sparse Decomposition [13] results in sparse feature vectors. Can doublysparse CNNs be built to recognize images more efficiently?
6 Acknowledgement
Many thanks to Fei Yin, QiuFeng Wang and ChengLin Liu for their hard work organizing the ICDAR2013 competition.
References
 K. Bache and M. Lichman. UCI machine learning repository, 2013.
 Z. L. Bai and Q. A. Huo. A study on the use of 8directional features for online handwritten Chinese character recognition. In ICDAR, pages I: 262–266, 2005.
 KuoTsai Chen. Integration of paths—a faithful representation of paths by noncommutative formal power series. Trans. Amer. Math. Soc., 89:395–407, 1958.
 D. Ciresan, U. Meier, and J. Schmidhuber. Multicolumn deep neural networks for image classification. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, pages 3642–3649, 2012.
 Ben Hambly and Terry Lyons. Uniqueness for the signature of a path of bounded variation and the reduced path group. Ann. of Math. (2), 171(1):109–167, 2010.
 Geoffrey E. Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Improving neural networks by preventing coadaptation of feature detectors. CoRR, abs/1207.0580, 2012.
 Yann Lecun and Corinna Cortes. The MNIST database of handwritten digits. http://yann.lecun.com/exdb/mnist/.
 Yann LeCun, FuJie Huang, and Leon Bottou. Learning methods for generic object recognition with invariance to pose and lighting. In Proceedings of CVPR’04. IEEE Press, 2004.
 C.L. Liu, F. Yin, D.H. Wang, and Q.F. Wang. CASIA online and offline Chinese handwriting databases. In Proc. 11th International Conference on Document Analysis and Recognition (ICDAR), Beijing, China, pages 37–41, 2011.
 T. Lyons and N. Sidorova. Sound compression—a rough path approach. In In Proceedings of the 4th International Symposium on Information and Communication Technologies, pages 223–229, 2005.
 Terry Lyons and Zhongmin Qian. System control and rough paths. Oxford Mathematical Monographs. Oxford University Press, Oxford, 2002. Oxford Science Publications.
 Terry J. Lyons, Michael Caruana, and Thierry Lévy. Differential equations driven by rough paths, volume 1908 of Lecture Notes in Mathematics. Springer, Berlin, 2007.
 Marc’Aurelio Ranzato, Christopher S. Poultney, Sumit Chopra, and Yann LeCun. Efficient learning of sparse representations with an energybased model. In Bernhard Schölkopf, John C. Platt, and Thomas Hoffman, editors, NIPS, pages 1137–1144. MIT Press, 2006.