Topological Deep Learning
This work introduces the Topological CNN (TCNN), which encompasses several topologically defined convolutional methods. Manifolds with important relationships to the natural image space are used to parameterize image filters which are used as convolutional weights in a TCNN. These manifolds also parameterize slices in layers of a TCNN across which the weights are localized. We show evidence that TCNNs learn faster, on less data, with fewer learned parameters, and with greater generalizability and interpretability than conventional CNNs. We introduce and explore TCNN layers for both image and video data. We propose extensions to 3D images and 3D video.
Topological Convolutional Neural Networksauthors \firstpageno1
Genevera Allen, Sayan Mukherjee, Boaz Nadler
machine learning, convolutional neural network, topology, topological data analysis
- 1 Introduction
- 2 Background: CNNs
- 3 Topological Convolutional Layers
- 4 2D Images
- 5 Video
It was observed in LeCun et al. (1998) that one motivation for the construction of Convolutional Neural Networks (CNNs) was that they permit a sparsification based on the geometry of the space of features. In the case of convolutional neural networks for images, the geometry used was that of a two-dimensional grid of features, in this case pixels. M. Robinson has also pointed out the importance of geometries or topologies on spaces of features, coining the term topological signal processing to describe this notion Robinson (2014). In this paper, we study a space of image filters closely related to a subfamily of the Gabor filters, whose geometry is that of a well known geometric object, the Klein bottle
We implement the use of the Klein bottle geometry and its image filters via additional structure on convolutional layers. We call neural networks with these layers Topological Convolutional Neural Networks (TCNNs). We perform experiments on image and video data. The results show significant improvement in TCNNs compared to conventional CNNs with respect to various metrics.
Deep neural network (NN) architectures are the preeminent tools for many image classification tasks, since they are capable of distinguishing between a large number of classes with a high degree of precision and accuracy Guo et al. (2016). CNNs are components of the most commonly used neural network architecture for image classification, e.g. see He et al. (2016); Rawat and Wang (2017); Krizhevsky et al. (2012). The characterizing property of CNNs is their use of convolutional layers which take advantage of the 2-dimensional topology of an image to sparsify a fully connected network and employ weight sharing across slices. Each convolutional layer in a CNN assembles spatially local features, e.g. textures, lines, and edges, into complex global features, e.g. the location and classification of objects. CNNs are also used to classify videos; see e.g. Soomro et al. (November, 2012), Schuldt et al. (2004), and Gorelick et al. (2007). CNNs have several major drawbacks including that the models are difficult to interpret, require large datasets, and often do not generalize well to new data Zheng et al. (2018). It has been demonstrated that as CNNs grow in size and complexity they often do not enjoy a proportional increase in utility He et al. (2016). This suggests that bigger, deeper models alone will not continue to advance image classification.
The topological structure in the layers of a TCNN is inspired from prior research on natural image data with topological data analysis (TDA). Small patches in natural images cluster around a Klein bottle embedded in the space of patches Carlsson et al. (2008). The patches corresponding to this Klein bottle are essentially edges, lines, and interpolations between those; see top panels of Figure 3(b)(c). It has been shown through TDA that CNN weights often arrive at these same Klein bottle patches after training Carlsson and Gabrielsson (2020). The key idea of TCNNs is to use the empirically discovered Klein bottle, and the image patches it represents, directly in the structure of the convolutional layers. For example, the Klein bottle image patches are used in TCNNs as convolutional filters that are fixed during training. There is an analogue of the Klein bottle for video data. Because of symmetries present in the geometric models of feature spaces, there are generalized notions of weight sharing which encode invariances under rotation and black-white reversal.
The method is not simply an addition of features to particular data sets, but is in fact a general methodology that can be applied to all image or video data sets. In the case of images, it builds in the notion that edges and lines are important features for any image data set, and that this notion should be included and accounted for in the architecture for any kind of image data analysis. TDA contains a powerful set of methods that can be used to discover these latent manifolds on which data sit Bubenik (2015); Chazal et al. (2017); Maroulas et al. (2019); Sgouralis et al. (2017). These methods can then be used to inform the parametrization of the TCNN.
TCNNs are composed of two new types of convolutional layers which construct topological features and restrict convolutions based on embeddings of topological manifolds into the space of images (and similarly for video). TCNNs are inherently easier to interpret than CNNs, since in one type of layer (Circle Features layer and Klein Features layer; Definition 9) the local convolutional kernels are easily interpreted as points on the Klein bottle, all of which have clear visual meaning (e.g. edges and lines). Following a Klein Features layer it makes sense to insert our other type of topological layer (Circle One Layer and Klein One Layer; Definitions 7 and 8), in which the 2D slices in the input and output are parameterized by a discretization of the Klein bottle and all weights between slices that are farther away than a fixed threshold distance on the Klein bottle are held at zero throughout training. This has the effect of localizing connections in this convolutional layer to slices that are nearby to each other as points on the Klein bottle. The idea of this ‘pruned’ convolutional layer is that it aggregates the output of Klein bottle filters from the first layer that are all nearby each other. Both of these new types of topological convolutional layers can be viewed as a strict form of regularization, which explains the ability of the TCNN to more effectively generalize to new data. We also provide versions of these layers for video and 3D convolutions.
The main points of the paper are as follows.
The method provides improved performance on measures of accuracy, speed of learning and data requirements, and generalization over standard CNNs.
Our method can be substituted for standard convolutional neural network layers that occur in other approaches, and one should expect improvements in these other situations as well. In this paper, we compare TCNNs to traditional CNNs with the same architecture except for the modified convolutional layers. The comparatively better performance of the TCNNs provides evidence that TCNNs improve the general methodology of CNNs, with the expectation that these advantages can be combined with other methods known to be useful. For example, we use TCNN layers in a ResNet for video classification; see Section 5. As another example, we expect state-of-the-art transfer learning results to improve when some convolutional layers in the network are replaced by TCNN layers.
Our approach suggests a general methodology for building analogues of neural networks in contexts other than static images. We carry this out for the study of video data. In this case, the Klein bottle is replaced by a different manifold, the so-called tangent bundle to , denoted by . There are also straightforward extensions to 3D imaging and video. The improvement afforded by these methods is expected to increase as the data complexity increases, and we find this to be the case in the passage from static images to video.
The simple geometries of the feature spaces enable reasoning about the features to use. In the video situation, we found that using the entire manifold was computationally infeasible, and selected certain natural submanifolds within , which allowed for the improved performance. Even in the case of static images, there are natural submanifolds of which might be sufficient for the study of specific classes of images, such as line drawings, and that restricting to them would produce more efficient computation.
There are more general methods of obtaining geometries on feature spaces, which do not require that the feature space occur among the family of manifolds already studied by mathematicians. One such method is the so-called Mapper construction Singh et al. (2007). A way to use such structures to produce a sparsified neural network structure has been developed (Carlsson and Gabrielsson, 2020, Section 5.3). The manifold methods described in this paper are encapsulated in the general framework described in Remark 6 which can be applied to other domains.
Because of the intuitive geometric nature of the approach, it permits additional transparency into the performance of the neural network.
There are no pretrained components in any of the models considered in this paper. This is significant in particular because the vast majority of state-of-the-art models for classification of the UCF-101 video dataset do use pretrained components Kalfaoglu et al. (2020), Qiu et al. (2019), Carreira and Zisserman (2017) .
The structure of the paper is as follows. In Section 2 we recall the basic structure of convolutional neural networks and we set up notation. TCNNs are introduced in Section 3; the version for image data is in Section 3.1, the video version is in Section 3.2, and a connection with Gabor filters is explained in Section 3.3. Then in Section 4 we describe experiments and results comparing TCNNs to standard CNNs on image data. In Section 5 we do similar experiments on video data.
2 Background: CNNs
In this section we describe in detail the components of the CNN critical to the construction of the TCNN. One of the central motivations of the structure of a CNN is the desire to balance the learning of spatially local and global features. A convolutional layer can be thought of as a sparsified version of the fully connected, feed-forward (FF) layer, with additional homogeneity enforced across subsets of weights. The traditional CNN makes use of the latent image space by creating a small perceptive field with respect to the distance in which weights can be nonzero, thus sparsifying the parameter space by enforcing locality in the image. Additionally, homogeneity of weights across different local patches is enforced, further reducing the parameter space by using the global structure of the grid.
To describe the structure of CNNs and TCNNs, we adopt the language in Carlsson and Gabrielsson (2020), which we summarize as needed. To aid the reader, we provide an accompanying visual guide in Figure 1.
We describe a feed forward neural network (FFNN) as a directed, acyclic graph (Definition 1).
A Feed Forward Neural Network (FFNN) is a directed acyclic graph with a vertex set satisfying the following properties:
is decomposed as the disjoint union of its layers
If , then every edge of satisfies .
For every non-initial node , there is at least one such that is an edge of .
The vertices in are also called nodes. For all , consists of the nodes in layer-. The layer consists of the inputs to the neural network (Figure 1 (a)). The last layer consists of the outputs.
Let be a FFNN with vertex set .
Often we suppress in the notation, writing for the set of all nodes and for the set of nodes in layer .
For any , the set consists of all such that is an edge in , and the set consists of all such that is an edge in .
To describe the edges between nodes in successive layers, we use the notion of a correspondence between and , which is simply a subset of the product For and , we define the subsets
Note that is determined by the subsets for , and it is also determined by for . In this way, a correspondence is a generalization of a map from to ; Given an element in the correspondence provides a subset of , and we denote correspondences by .
We adopt the convention that given nodes , the edge is in if and only if Note that this implies . We call the edge-defining correspondence of the layer. In this way, a FFNN specifies an edge-defining correspondence between each pair of successive layers, and conversely, choices of edge-defining correspondences between successive layers specify the edges in a FFNN. The simplest type of layer in a neural network is as follows.
Let be a layer in a FFNN. We call a fully connected layer if the edge-defining correspondence is the entire set. In that case, we denote this correspondence
We proceed to describe convolutional layers. We model digital images as grids indexed by . Modifications of our constructions to finite size images will be clear. The values of the grid specifying a grayscale image are then triples where is the intensity at . Equivalently, an image is a map . Similarly, videos are modeled as grids indexed by , where the are the spacial dimensions and the third dimension is time. This generalizes to grids indexed by for any positive integer .
In a CNN, the nodes in each convolutional layer form multiple grids of the same size. We model the nodes in a convolutional layer as the product of a finite index set with a grid, With this notation, the graph structure of a CNN is specified as in the following definition. A convolutional layer is a sparsification of a fully connected layer which enforces locality in the grids . For further detail, see Carlsson and Gabrielsson (2020). We explain the homogeneity restrictions on the weights in (1).
Let be a layer in a FFNN. We call a convolutional layer or a normal one layer (NOL) if and for some finite sets and and a positive integer , and if for some fixed threshold the edge-defining correspondence is of the form
where is the fully connected correspondence and is the correspondence given by
for all . Here, is the -metric on defined by
A Convolutional Neural Network (CNN) is a FFNN such that the first layers are convolutional layers and the final
It is common to have pooling layers following convolutional layers in a CNN. Pooling layers downsample the size of the grids. They can be used in TCNNs in the same way and for the same purposes as traditionally used in CNNs. We do not discuss pooling in this paper for simplicity.
The correspondence maps a vertex to a vertex set that is localized with respect to the threshold . In a 2-dimensional image, the above definition gives the typical square convolutional filter. This constructions results in spatially meaningful graph edges.
The graph structures given in Definitions 1, 3, and 4 yield the skeleton of a CNN. To pass data through the network, we need a system of functions on the nodes and edges of that pass data from layer to layer . These functions are called activations and weights. The weights are real numbers associated to each edge,
Let and , as in a CNN. Denote and . Then the homogeneity of the weights, a characteristic of a CNN, is the translational invariance
The activations associate to each vertex a real number and a function . To pass data from layer to layer means to determine for from the values via the formula
The activation functions in neural networks usually map to or . The output activations are a probability distribution on the output nodes of , so they are non-negative real numbers. We use the activation function ReLU, defined as , for all non-terminal layers, and the softmax as the terminal layer activation function. We choose a common optimization method, adaptive moment estimation (Adam), to determine the back-propagation of our changes based on the computed gradients.
Figure 1 (a,b,c) displays an example of a weight and activation system for a convolutional layer from to , where denotes a finite set of cardinality . The correspondence is , where is fully connected and localizes connections at distance . Panel (a) shows the input , each weight-matrix in (b) is a vector of coefficients from , (c) shows the resulting activations in .
3 Topological Convolutional Layers
In Section 3.1 we introduce new types of neural network layers that form our TCNNs for 2D image classification. In Section 3.2 we introduce layers for TCNNs used for video classification. Section 3.3 demonstrates the connection between our constructions and Gabor filters.
3.1 2D Images
Locality in a typical convolutional neural network is a function of the distance between cells, which is specified by the correspondence (see Definition 4) in the case of a dimensional image. We add novel, topological criteria to this notion of locality through metrics on topological manifolds. The general technique is described in the following remark.
Let be a manifold and let be two discretizations of , meaning finite sets of points. Let and be successive layers in a FFNN. Fix a threshold . Let be a metric on . Define a correspondence by
for all . Together with another threshold , this defines a correspondence by
where is the convolutional correspondence from Definition 4. This means that
for all .
The first example we give is a layer that localizes with respect to position on a circle in addition to the usual locality in a convolutional layer. Let be the unit circle in the plane . A typical discretization of is the set of -th roots of unity for some .
Let be two discretizations of the circle. Let and be successive layers in a FFNN. Fix a threshold .
The circle correspondence is defined by
for all , where the metric is given by
We call a circle one layer (COL) if, for some other threshold , the edge defining correspondence is of the form
where is the convolutional correspondence from Definition 4. This means that
for all .
Next, we define a layer that localizes weights with respect to a metric on the Klein bottle . See Figure 2 for a visualization of the nodes and weights. Recall that is the -dimensional manifold obtained from as a quotient by the relations for and . The construction uses an embedding of into the vector space of quadratic functions on the square , motivated by the embedded Klein bottle observed in Carlsson et al. (2008) and its appearance in the weights of CNNs as observed in Carlsson and Gabrielsson (2020). An image patch in the embedded Klein bottle has a natural ‘orientation’ given by the angle . Visually, one sees lines through the center of the image at angle ; see the top right image in Figure 3. The embedding is given by
where . As given, is a function on the torus, which is parameterized by the two angles and . It actually defines a function on since it satisfies and .
Let be two finite subsets of the Klein bottle. Let and be successive layers in a FFNN. Fix a threshold .
The Klein correspondence is defined by
for all , where the metric is defined by
We call a Klein one layer (KOL) if, for some other threshold , the edge defining correspondence is of the form
which means that
for all .
We define two other layers based on the circle and the Klein bottle (Definition 9). First, we define an embedding of the circle into the space of functions on by composing with the embedding , i.e.
Now the idea is to build convolutional layers with fixed weights given by discretizations of and . This is motivated by Carlsson and Gabrielsson (2020) which showed that convolutional neural networks (in particular VGG16) learn the filters . Instead of forcing the neural networks to learn these weights, we initialize the network with these weights. Intuitively, this should cause the network to train more quickly to high accuracy. Moreover, we choose to fix these weights during training (gradient ) to prevent overfitting, which we conjecture contributes to our observed improvement in generalization to new data. We also use discretizations of the images given by the full Klein bottle as weights, motivated by the reasoning that the trained weights observed in Carlsson and Gabrielsson (2020) are exactly the high-density image patches found in Carlsson et al. (2008) which cluster around the Klein bottle. These layers with fixed weights can be thought of as a typical pretrained convolutional layer in a network.
Let or and let be a finite subset. Let and be successive layers in a FFNN. Suppose is a convolutional layer with threshold (Definition 4). Then is called a Circle Features (CF) layer or a Klein Features (KF) layer, respectively, if the weights are given for by a convolution over of the filter of size with values
In summary, we have the following. Both and can be discretized into a finite subset by specifying evenly spaced values of the angles. Given such a discretization , the convolutional layers in a TCNN have slices indexed by . Note that has a metric induced by the embedding and the metric on functions on . The COL and layers are convolutional layers with slices indexed by where all weights between slices whose distances in are greater than some fixed threshold are forced to be zero. The CF and KF layers are convolutional layers with slices indexed by and such that the weights are instantiated on the slice corresponding to to be the image discretized to the appropriate grid shape; examples of these weights are shown in Figure 3. These weights are fixed during training. See also the visual guide in Figure 1.
For video, the space of features of interest is parameterized by the tangent bundle of the translational Klein bottle , denoted . The translational Klein bottle is a -dimensional manifold and its tangent bundle is -dimensional. These manifolds parameterize video patches (3) in a manner related to the Klein bottle parameterization of 2D images patches from (2).
Before providing the precise definitions of and as well as the formulas for the parameterizations, we describe the idea roughly. An image patch in the embedded Klein bottle has a natural ‘orientation’ given by the angle . Visually, one sees lines through the center of the image at angle . Given a real number , there is a 2D image patch given by the translation of by units along the line through the origin at angle , i.e. along the line perpendicular to the lines in the image. One can extend this image to a video that is constant in time. Videos that change in time are obtained by enlarging to its tangent bundle . The tangent bundle consists of pairs of a point and a vector tangent to . The embedding sends such a pair to a video patch that at time is the image For example, is the video patch that translates at unit speed along the line through the origin at angle . Similarly, is the video patch that rotates at unit speed.
Precisely, we use the following construction. Denote the coordinates on by the variables . The variables parameterize and the variables parameterize the tangent spaces. To be precise, is given as the quotient of by the relations for all and , and similarly can be described as a quotient of . We suppress further discussion of these relations because they are only significant to this work in that they are respected by the embeddings and .
Let . Denote by the space of continuous functions , which represent image patches at infinite resolution, and similarly denote the space of video patches by . The embeddings
are given by
Using the embedding , we define a metric on by pulling back the metric on ,
The metric allows us to define a new type of layer in a neural network. Recall the version of the convolutional correspondence from Definition 4.
(6D Moving Klein Correspondence) Let be two finite subsets. Let and be successive layers in a FFNN. Fix a threshold .
The 6D Moving Klein correspondence is defined by
for all , where the metric is defined in (4).
We call a 6D Moving Klein one layer (6MKOL) if, for some other threshold , the edge defining correspondence is of the form
which means that
for all .
There are particular submanifolds of whose corresponding video patches we conjecture to be most relevant for video classification. In 6MKOL layers, we often choose and to be subsets of these submanifolds. One reason to do this is that discretizing the -dimensional manifold results in a large number of filters, significantly bloating the size of the neural network. Indeed, discretizing and into values each, as in our Klein bottle experiments, and discretizing the other dimensions into only values produces points in . Moreover, these are videos rather than static images, so they contain many pixels: pixels for video with time steps. Another reason to specialize to submanifolds is one general philosophy of this paper: Well-chosen features, rather than an abundance of features, provide better generalization due to less over-fitting.
The five -dimensional submanifolds of that we choose to work with are
Under the embedding , corresponds to the Klein bottle images held stationary in time, corresponds to the Klein bottle images translating in time perpendicular to their center line, as described above, where the sign controls the direction of translation, and corresponds to the Klein bottle images rotating either clockwise or counterclockwise depending on the sign .
We now define convolutional layers with fixed weights given by discretizations of the video patches corresponding to the manifolds and . These can be viewed as pretrained convolutional layers that would typically appear as the first layer in a pretrained video classifier. The motivation is the same as for the analogous CF and KF layers defined for images in Definition 9 – faster training and better generalization to new data due to initializing the network with meaningful filters that are fixed during training.
Let be any subset, for example a submanifold such as or (see (5)), or unions of such submanifolds. Let be a finite subset.
Let and be successive layers in a FFNN. Suppose is a convolutional layer with threshold (i.e., the edge-defining correspondence is as in Definition 4). Then is called a M-Features (M-F) layer if the weights are given for by a convolution over of the filter of size with values
3.3 Gabor filters versus Klein bottle filters
The Klein bottle filters given by as in (2) are related to Gabor filters. In fact, besides a minor difference, they are a particular type of Gabor filters. The purpose of this section is to explain this relationship. The significance of this relationship is that, while Gabor filters are commonly used in image recognition tasks, our constructions use a particular 2-parameter family of Gabor filters that is especially important, as identified by the analysis in Carlsson et al. (2008). Restricting to this 2-parameter family provides a compact set of filters that can be effectively used as pretrained weights in a neural network. One may use other families of Gabor filters for this purpose, but then the question is on what basis does one choose a particular family. The Klein bottle filters are a topologically justified choice. In Section 4.1.6, we compare the performance of some other choices of Gabor filters with the Klein bottle filters.
Recall that the Gabor filters are functions on the square given by
The Klein bottle filters do not taper off in intensity near the edges of the square, so we would like to remove this tapering from the Gabor filters too for a proper comparison. This is done by removing the exponential term from , or equivalently, setting . This simultaneously removes the dependence on . So we have the restricted class of filters
Given parameterizing the Klein bottle filter , we claim that a similar Gabor filter is given by
To see the similarity between and , we examine the ‘primary circle’ and the ‘small circle’ . The formula for other values of interpolates along the Klein bottle between these two circles. On the primary circle, we have
These are both odd functions of that are equal to at and are equal to at . Similarly, on the small circle, we have
These are both even functions of that are equal to at and are equal to at .
4 2D Images
4.1 Experiments and Results
We conduct several experiments on the image datasets described in Section 4.1.1. On individual datasets, we investigate the effect of Gaussian noise on training the TCNN Section 4.1.2, the interpretability of TCNN activations Section 4.1.3, and the learning rate of TCNNs in terms of testing accuracy over a number of batches in Section 4.1.4. Across different datasets, we investigate the generalization accuracy of TCNNs when trained on one dataset and tested on another in Section 4.1.5. We compare TCNNs to traditional CNNs in all of these domains. We also test other choices of Gabor filters versus Klein filters in the KF layers; see Section 4.1.6. All of these experiments use one or two convolutional layers, followed by a 3-layer fully connected network terminating in 10 or 2 nodes, depending on whether the network classifies digits or cats and dogs. For additional details on the neural network models, metaparameters, and train/test splits, see Section 4.2.
Description of Data
We perform digit classification on 3 datasets: MNIST LeCun et al. (1998), SVHN Netzer et al. (2011) and USPS Hull (1994). These datasets are quite different from either other in style, while all consisting of images of digits through . In particular, a human can easily identify to which dataset a particular image belongs. This makes generalization between the datasets a non-trivial task, because neural networks that train on one of the datasets will in general overfit to the particular style and idiosyncracies present in the given data, which will be inconsistent with different-looking digits from the other datasets. The datasets come at significantly different resolutions: , , and , respectively. Additionally the sizes of the datasets vary widely: roughly , , and , respectively. SVHN digits are typeset whereas MNIST and USPS are handwritten. MNIST and USPS are direct 2-D impressions with significant pre-processing already done, while the SVHN numbers are natural images of typeset digits with tilts, warping, presence of secondary digits, and other irregularities.
We also use two collections of labeled images of cats and dogs: Cats vs. Dogs Kaggle (2013) (which we call Kaggle), and the cat and dog labeled images from CIFAR-10, see Krizhevsky (2012). Note that the Kaggle Cats vs Dogs dataset, upon our download, included several images which were empty/corrupt and could not be loaded, so the size reported here is the total number of loadable images. This seems to be typical for this dataset. These datasets contain and images, respectively. The resolutions of the images in each dataset are and , respectively. Since we use these datasets to test the generalization accuracy of a neural network trained on one and tested on the either, we down-resolve the Kaggle data to