Efficient Topological Layer based on Persistent Landscapes

# Efficient Topological Layer based on Persistent Landscapes

## Abstract

We propose a novel topological layer for general deep learning models based on persistent landscapes, in which we can efficiently exploit underlying topological features of the input data structure. We use the robust DTM function and show differentiability with respect to layer inputs, for a general persistent homology with arbitrary filtration. Thus, our proposed layer can be placed anywhere in the network architecture and feed critical information on the topological features of input data into subsequent layers to improve the learnability of the networks toward a given task. A task-optimal structure of the topological layer is learned during training via backpropagation, without requiring any input featurization or data preprocessing. We provide a tight stability theorem, and show that the proposed layer is robust towards noise and outliers. We demonstrate the effectiveness of our approach by classification experiments on various datasets.

Keywords: topological data analysis, deep learning, persistent diagram, persistent homology, topological feature, stability theorem

## 1 Introduction

With its strong generability, deep learning has become one of the most pervasively applied techniques in machine learning. However, there is still no general principle toward an optimal model architecture for a given task and the performance often varies drastically from task to task. To improve learnability of deep learning models, various architectures and layer structures have been proposed. Some people scheme out an efficient data processing method through specialized feature maps. For instance, it is widely known that inserting a convolutional layer greatly improves visual object recognition and other tasks in computer vision (e.g., Krizhevsky et al., 2012; LeCun et al., 2016). On the other hand, a large body of work in computer vision focuses on choosing optimal initial architectures (Simonyan and Zisserman, 2014; He et al., 2016; Szegedy et al., 2015). Moreover, a substantial amount of recent studies have explored how various properties of neural networks (e.g., the depth, width, and connectivity) relate to their expressivity and generalization capability (e.g., Raghu et al., 2017; Daniely et al., 2016; Guss and Salakhutdinov, 2018).

In this paper, we explore an alternative way to enhance learnability of deep learning models by introducing a novel topological layer which feeds topological features of underlying data structure in an arbitrary network. The power of topology lies in its capacity which differentiates sets in topological space in a robust and meaningful geometric way Carlsson (2009); Ghrist (2008). It provides important insight to the global ”shape” of data structure via persistent homology Zomorodian and Carlsson (2005). The use of topological methods in data analysis has been limited by the difficulty of combining the main tool of the subject, persistent homology, with statistics and machine learning. However, a series of recent studies have reported a notable success on utilizing topological methods in data analysis (e.g., Zhu, 2013; Nanda and Sazdanović, 2014; Tralie and Perea, 2018; Seversky et al., 2016; Brown and Knudson, 2009; Gamble and Heo, 2010; Pereira and de Mello, 2015; Umeda, 2017; Liu et al., 2016; Venkataraman et al., 2016; Emrani et al., 2014)

Specifically, we design a novel topological layer that can be placed into any deep learning model based on tools in topological data analysis. The proposed topological layer consists of many structure elements, each of which is a differentiable parametrized projection and to be trained in the way that it can pass critical information on the topological feature of layer’s input through the network in order to improve the task-specific learning performance. There are at least three benefits of using the topological layer in deep learning; 1) we can efficiently extract robust global features of input data that otherwise would not be readily accessible via traditional feature maps, 2) the optimal structure of the layer for a given learning task can be easily embodied via backpropagation during training, and 3) with proper filtration it can be applied to arbitrarily complicated data structure even without any data preprocessing.

Related Work. Idea of incorporating topological theories into deep learning has been explored only recently, but mostly via feature engineering methods where we use some fixed, predefined features that contain topological information (e.g., Umeda, 2017; Liu et al., 2016). Guss and Salakhutdinov (2018); Rieck et al. (2018) proposed a complexity measure for neural network architectures based on topological data analysis. Carlsson and Gabrielsson (2018) applied topological approaches to deep convolutional networks to understand and improve the computations of the network. Hofer et al. (2017) first developed a technique to input persistent diagrams into neural networks by introducing the topological layer. Poulenard et al. (2018); Gabrielsson et al. (2019); Hofer et al. (2019); Carrière et al. (2019) also proposed topology loss function and topology layer in a particular form. Nevertheless, all the previous approaches for topological layer/loss suffer from the following limitations: 1) they rely on a particular parametrized map or filtration, 2) they lack stability result or the result is limited to s particular type of input data representation, and 3) most importantly, the differentiability of persistence homology is not guaranteed for arbitrary input setting therefore we are not guaranteed to use the layer in the middle of deep networks in general.

Contribution. This paper presents a new topological layer that does not suffer from the above limitations. Our topological layer does not rely on particular filtration or parametrized mapping but still possesses favorable theoretical properties. The proposed layer is designed based on weighted persistent landscapes in the way that we suffer less from the extreme persistence distortion. We verify that our stability bound can be tighter than Hofer et al. (2017), and that it is also stable with respect to a small perturbation/noise/outliers in our input data. Importantly, we show that our layer is differentiable with respect to the layer input for a general persistent homology with any valid filtration, so we can place it anywhere in the network to extract topological features where they might be useful.

## 2 Background and definitions

Topological data analysis (TDA) is a recent and emerging field of data science that relies on topological and geometric tools to infer relevant features for possibly complex data Carlsson (2009). In this section, we briefly review basic concepts and main tools in TDA which we will use to develop our own topological layer. We refer the interested readers reader to Chazal and Michel (2017); Hatcher (2002); Edelsbrunner and Harer (2010); Chazal et al. (2009, 2016b) for further details and formal definitions.

Throughout, we will let denotes a subset of , and denotes a finite collection of points from an arbitrary space , and let denote the open ball centered at with radius .

### 2.1 Simplicial complex, persistent homology, and diagrams

When inferring topological properties of from its finite collection of samples , we rely on simplicial complex , a discrete structure built over the observed points to provide a topological approximation of the underlying space. Two common examples are the Čech complex and the Vietoris-Rips complex. The Čech complex is the simplicial complex where -simplices correspond to the nonempty intersection of balls centered at vertices. The Vietoris-Rips complex, also referred to as Rips complex, is the simplicial complex where simplexes are built based on pairwise distances among its vertices. We refer to Section A for formal definitions.

A collection of simplicial complexes satisfying whenever is called a filtration of . A typical way of setting the filtration is through a monotonic function on the simplex. A function is monotonic if whenever is a face of . Now we let , then the monotonicity implies that is a subcomplex of and whenever . In this paper, we assume that the filtration is built from a monotonic function.

Persistent homology (Barannikov, 1994; Zomorodian and Carlsson, 2005; Edelsbrunner et al., 2000; Chazal et al., 2014a) is a multiscale approach to represent topological features of the complex . Specifically, one can apply homology in some degree (e.g., -dimension: connected components, -dimension: loops, -dimension: cavities, …) with coefficients in some field to the above finite filtered complex, again to obtain a sequence of vector spaces and linear maps such that

 H(K0)→H(K1)→...→H(Kn)

which we call a persistence module. It turns out that the persistence module can be completely described by a finite sequence of pairs with . For such pair , there is a choice of a nonzero homology class that is not in the image of and whose image is nonzero in but is zero in . We often say is born at and dies at , and will refer to each as birth-death pair. In this sense, we call the persistence of a homological feature (or homological persistence, collectively). Note that for now we have but we can generalize to by associating to a corresponding increasing sequence of real numbers in the filtration. Finally, considering these pairs as points in the plane, one obtains the persistence diagram as below.

###### Definition 2.1.

Let . A persistent diagram is a finite multiset of . We let denote the set of all such ’s.

We will often write to indicate that persistent diagrams are drawn from the simplicial complex constructed on original data source . In computational settings, there are always only finitely many points with finite numbers in the persistence diagram and we usually truncate ’s that persist until the maximum filtration value at that value. Thus, as in Bubenik (2018) we make the following assumption.

###### Assumption A1.

Throughout this paper, we will only consider persistence diagram space where every persistent diagram consists of finitely many points with .

Lastly, we define the following metrics to measure the distance between two persistent diagrams.

###### Definition 2.2 (Bottleneck and Wasserstein distance).

Given two persistence diagrams and , their bottleneck distance () and -th Wasserstein distance () are defined as

 dB(DX,DY) =infγ∈Γsupp∈DX∥p−γ(p)∥∞, Wq(DX,DY) =[infγ∈Γ∑p∈DX∥p−γ(p)∥q∞]1q,

respectively, where is the usual -norm and the set consists of all the bijections , where Diag is the diagonal with infinite multiplicity.

Note that for , for any given . As tends to infinity, the Wasserstein distance approaches the bottleneck distance.

### 2.2 Persistent Landscapes

A persistence diagram is a multiset, which makes it difficult to analyze and feed as input to learning or statistical methods. Hence, it is useful to transform the persistent homology into a functional Hilbert space, where the analysis is easier and learning methods can be directly applied. Good examples include persistent landscapes (Bubenik, 2015, 2018; Bubenik and Dłotko, 2017) and silhouettes (Chazal et al., 2015, 2014b), both of which are real-valued functions that further summarize the crucial information contained in a persistence diagram. Here we briefly introduce the persistent landscapes as we use them to design our layer.

Landscapes. Let denote a persistent diagram that contains off-diagonal birth-death pairs . We first define a set of functions for each birth-death pair in as follows:

 Λp(t)=⎧⎪ ⎪⎨⎪ ⎪⎩t−b,t∈[b,b+d2]d−t,t∈(b+d2,d]0,otherwise. (1)

For each birth-death pair , is piecewise linear. Then the persistence landscape of the persistent diagram is defined by the sequence of functions , where

 λk(t)=kmaxpΛp(t),t∈[0,T],k∈N, (2)

and is the -th largest value in the set. Hence, the persistence landscape is a function . is large enough to satisfy . is commonly denoted by -th order persistent landscape.

Note that landscape functions can be evaluated over , and are easy to compute. Many recent studies including previously listed ones have revealed that this kind of functional summaries of the persistence module not only show favorable theoretical properties, but can be easily averaged and used for subsequent statistics and machine learning modeling Chazal et al. (2014b); Bubenik (2018); Bubenik and Dłotko (2017).

### 2.3 Distance to measure (DTM) function

The Distance to measure (DTM) (Chazal et al., 2011, 2016a) is a robustified version of the distance function. More precisely, the DTM for a probability distribution with parameter and is defined as

 dμ,m0(x)=(1m0∫m00(δμ,m(x))rdm)1/r,

where . If is not specified, then is used as a default. When is a weighted empirical measure , where is an indicator function, the empirical DTM is

 ^dm0(x) =dPn,m0(x) =(∑Xi∈Nk(x)ϖ′i∥Xi−x∥rm0∑ni=1ϖi)1/r, (3)

where is the subset of containing the nearest neighbors of , is such that , and for one of ’s that is -th nearest neighbor of and otherwise. Hence the empirical DTM behaves similarly to the -nearest distance with . The DTM is a preferred choice for the filtration function, since the persistence diagram computed on the DTM is less prone to noise.

## 3 A novel topological layer based on weighted persistent landscapes

In this section, we present a detailed algorithm to obtain the novel topological layer for a neural network. Let , , denote input, persistent diagram induced from , the topological layer, respectively. Broadly speaking, the construction of our topological layer consists of two steps: computing persistent diagram from the input and constructing topological layer from the persistent diagram.

### 3.1 Computation of diagram: X→DX

For computing the persistence diagram from the input data, we first need to define the filtration. Given a simplicial complex and a function , we use the filtration . There can be several choices for the simplicial complex and the function . One popular choice for the simplicial complex is the Rips complex, where the complex is the power set of and the function is . Another option is when is a grid (so that there is a natural cubical complex ) and the function is defined on , to extend the function to the complex by . Then, we compute the persistence diagram of the filtration and denote this by .

One appealing choice for is the distance to measure (DTM) function (see Section 2.3). As the resultant persistent diagrams from the DTM function are robust to noise (Chazal et al., 2011, 2017), the DTM function has been widely used in topological data analysis Anai et al. (2019); Xu et al. (2019). Nonetheless, despite this and other favorable properties, to best of our knowledge, the DTM function has not been adopted by previous studies in deep learning.

We detail two possible scenarios, where the input data can be used as either the data points or the weights. First, the input data is considered as the empirical data points, and then the empirical DTM in (3) with weights ’s becomes

 ^dm0(x)=(∑Xi∈Nk(x)ϖ′i∥Xi−x∥rm0∑ni=1ϖi)1/r, (4)

where and are determined as in (3).

Second, the input data is considered as the weights corresponding to fixed points , and then the empirical DTM in (3) with data points ’s and weights ’s becomes

 ^dm0(x)=(∑Xi∈Nk(x)X′i∥Yi−x∥rm0∑ni=1Xi)1/r, (5)

where and are determined as in (3).

### 3.2 Construction of topological layer: DX→htop

Our topological layer is defined upon a parametrized mapping which takes any persistent diagram to be projected onto . Our construction is no longer afflicted by the artificial bending due to separate logarithmic transformations (Hofer et al., 2017), yet still guarantees the crucial information in the persistence diagram to be well preserved by harnessing persistent landscapes (2) with more favorable theoretical properties which we will address in Section 5. Furthermore, insignificant points with low persistence are likely to be ignored systematically without introducing additional nuisance parameters (Bubenik and Dłotko, 2017).

Let denote . Given a persistent diagram of a certain homological dimension, we compute -th order persistent landscape function on the fixed interval for every . Then, we compute the weighted average with a weight parameter , for all . Next, we set a resolution , and sample equal-interval points from to obtain . Consequently, we have defined a mapping which is a (vectorized) finite-sample approximation of the weighted persistent landscapes at the resolution , at fixed, predetermined locations. Finally, we consider a parametrized differentiable map which takes and is differentiable with respect to as well. Now, the projection of with respect to the mapping defines a single structure element for our topological input layer. We summarize the procedure in Algorithm 1.

The projection is continuous at every . Also note that it is differentiable with respect to and , regardless of resolution level . In what follows, we delineate some guidelines for encoding each parameter.

: The weight parameter can be initialized as equal weight, i.e. , and will be re-determined in the way that a certain landscape that conveys significant information has more weight. In general, lower-order landscapes tend to be more significant than higher-order landscapes (mostly low persistence), but the optimal weights may vary from task to task and thus should be determined after the training process. We use the softmax layer and treat as -th entry of the softmax output.

: Likewise, some birth-death pairs ’s, encoded as ’s in (1), may contain more crucial information about topological features of the input data structure than others. Roughly speaking, this is equivalent to say certain mountains (or their ridge or valley) in the landscape are especially important. Hence, the parametrized map should be able to reflect this by its design. In general, it can be done by affine transformation with scale and translation parameter, followed by an extra nonlinearity and normalization if necessary. We list two possible choices as below.

• Affine transformation: with scale and translation parameter , and .

• Logarithmic transformation: with same , .

Note that other constructions of are also possible so long as they satisfy the sufficient conditions described above. Finally, since each structure element corresponds to a single node in layer, given a collection of the structure elements obtained by Algorithm 1 we concatenate them altogether to form our topological layer.

###### Definition 3.1 (Topological layer (TopLayer) based on weighted persistent landscapes).

For , let denote the set of parameters for the -th element and . Given and resolution , we define our topological layer by a parametrized mapping with of such that

 D↦(Sηi(D;ν))nhi=1. (6)

Note that this is nothing but a concatenation of topological structure elements (nodes) with different parameter sets (thus is our layer dimension). The topological layer defined above is trainable via backpropagation as each is differentiable with respect to . Figure 1 provides some real data examples of persistent landscapes for 1-dimensional features. As shown in Figure 2, topological features are expected to be robust against external noise.

## 4 Differentiability

This section is devoted to the analysis of the differential behavior of the proposed topological layer with respect to its inputs (or outputs from previous layer), by computing the derivatives . Since , this can be done by combining two derivatives and . We have extended Poulenard et al. (2018) so that we can compute above derivatives for general persistent homology under arbitrary filtration in our setting. We present the result in Theorem 4.1.

###### Theorem 4.1.

Let be the filtration function. Let be a map from each birth-death point to a pair of simplices . Suppose that is locally constant at , and are differentiable with respect to ’s, and is differentiable. Then, is differentiable and

 ∂htop∂Xj =∑i∂f(βi)∂Xjm∑l=1∂gθ∂xlKmax∑k=1ωk∂λk(lv)∂bi +∑i∂f(δi)∂Xjm∑l=1∂gθ∂xlKmax∑k=1ωk∂λk(lv)∂di.
###### Proof.

Let be the simplicial complex, and suppose all the simplices are ordered in the filtration so that the values of are nondecreasing, i.e. if comes earlier than then . Note that the map from each birth-death point to a pair of simplices is simply the pairing returned by the standard persistence diagram (Carlsson et al., 2005). Let be the homological feature corresponding to , then the birth simplex is the simplex that forms in , and the death simplex is the simplex that causes to collapse in . For example, if were to be a -dimensional feature, than is the edge in that forms the loop corresponding to , and is the triangle in which incurs the loop corresponding to can be contracted in .

Now, and , and from being locally constant on ,

 ∂bi∂Xj=∂f(ξ(bi))∂Xj=∂f(βi)∂Xj,  ∂di∂Xj=∂f(ξ(di))∂Xj=∂f(δi)∂Xj. (7)

Therefore, the derivatives of the birth value and the death value are the derivatives of the filtration function evaluated at the corresponding pair of simplices. And is the collection of these derivatives, hence applying (18) gives

 ∂DX∂X={(∂bi∂Xj,∂di∂Xj)}(bi,di)∈DX,Xj∈X={(∂f(βi)∂Xj,∂f(δi)∂Xj)}ξ−1(βi,δi)∈DX,Xj∈X. (8)

For computing , note that can be computed using chain role as

 ∂htop∂bi =∂Sθ,ω∂bi=∂(gθ∘¯¯¯¯Λω)∂bi=∇gθ∘∂¯¯¯¯Λω∂bi =m∑l=1∂gθ∂xl∂¯¯¯λω(lν)∂bi,

where we use as shorthand notation for input of the function . Then, applying gives

 ∂htop∂bi=m∑l=1∂gθ∂xlKmax∑k=1ωk∂λk(lv)∂bi. (9)

Similarly,

 ∂htop∂di=m∑l=1∂gθ∂xlKmax∑k=1ωk∂λk(lv)∂di. (10)

And therefore, is the collection of these derivatives, i.e.,

 ∂htop∂DX={(m∑l=1∂gθ∂xlKmax∑k=1ωk∂λk(lv)∂bi,m∑l=1∂gθ∂xlKmax∑k=1ωk∂λk(lv)∂di)}(bi,di)∈DX. (11)

Hence, by combining (19) and (22), can be computed as

 ∂htop∂Xj =∑i∂htop∂bi∂bi∂Xj+∑i∂htop∂di∂di∂Xj =∑i∂f(βi)∂Xjm∑l=1∂gθ∂xlKmax∑k=1ωk∂λk(lv)∂bi+∑i∂f(δi)∂Xjm∑l=1∂gθ∂xlKmax∑k=1ωk∂λk(lv)∂di.

Note that are nothing but piecewise constant functions and are easily computed even in explicit forms. Also can be easily realized by using an automatic differentiation framework such as tensorflow or pytorch.

### 4.1 DTM function

Here we provide a specific example of computing when is the DTM filtration which has not been explored in previous approaches. We first consider the case of (4) where ’s are data points, as in Proposition 4.1. See Appendix C.2 for the proof.

###### Proposition 4.1.

When ’s and satisfy that are different for each , then is differentiable with respect to and

 ∂f(ς)∂Xj=ϖ′j∥∥Xj−y∥∥r−2(Xj−y)I(Xj∈Nk(y))(^dm0(y))r−1m0∑ni=1ϖi,

where is an indicator function and . In particular, is differentiable a.e. with respect to Lebesgue measure on .

Similarly, we consider the case of (5) where ’s are weights, as in Proposition 4.2. See Appendix C.3 for the proof.

###### Proposition 4.2.

When ’s and satisfy that are different for each , then is differentiable with respect to and

 ∂f(ς)∂Xj=∥∥Yj−y∥∥rI(Yj∈Nk(y))−m0(^dm0(y))rr(^dm0(y))r−1m0∑ni=1Xi,

where . In particular, is differentiable a.e. with respect to Lebesgue measure on and .

Computation of , are simpler and can be done in a similar fashion. In the experiments, we set .

## 5 Stability Analysis

A key property of the topological layer is stability; its discriminating power should remain stable against non-systematic noise or perturbation of input data. In this section, we shall provide our theoretical results on the stability properties of the proposed layer defined in (6). In what follows, we address the stability for each structure element with respect to change in persistent diagrams.

###### Theorem 5.1.

Given a Lipschitz function with Lipschitz constant and resolution , for two persistent diagram ,

 ∣∣Sθ,ω(D;ν)−Sθ,ω(D′;ν)∣∣≤Lgm1/2dB(D,D′).

Proof of Theorem 5.1 is given in Appendix C.4. Theorem 5.1 shows that is stable with respect to the bottleneck distance (2.2). It should be noted that here our result only requires Lipschitz continuity of .

Next corollary shows that under certain conditions our approach may provide an improved bound compared to Hofer et al. (2017).

###### Corollary 5.1.

Let denote the number of points in the persistent diagram . Then, the ratio of our stability bound in Theorem 5.1 to that in Hofer et al. (2017) is strictly upper bounded by

 Cgθ,T,ν1+CD,D′(nD−1),

where are constants to be specified in the proof.

See Appendix C.5 for the proof. Corollary 5.1 implies that our stability bound is tighter than that of Hofer et al. (2017) at polynomial rates. Hence for complex data structures where we would possibly get many birth-death pairs in each , our proposed layer guarantees tighter stability property.

In particular, for our DTM-function-based filtration using (4) or (5), Theorem 5.2 can be turned into the stability result with respect to the input .

###### Theorem 5.2.

Suppose is used for the DTM function. Let a differentiable function and resolution be given, and let be a distribution. For the case when ’s are data points, i.e. when (4) is used as the DTM filtration of , define as the empirical distribution defined as . For the case when ’s are weights, i.e. when (5) is used as the DTM filtration of , define as the empirical distribution defined as . Let be the persistence diagram of the DTM filtration of , and be the persistence diagram of the DTM filtration of , as in (4) when ’s are data points, or as in (5) when ’s are weights. Then,

 ∣∣Sθ,ω(DX;ν)−Sθ,ω(DP;ν)∣∣ ≤Lgm1/2m−1/20W2(Pn,P).

See Section C.6 for the proof. Theorem 5.2 implies that if the empirical distribution from our input approximates the true distribution well with respect to the Wasserstein distance , then our topological layers constructed on those observed points are stable with respect to small perturbations of the Wasserstein distance. This means the topological information embedded in the proposed layer is robust against small noise, data corruption, or outliers.

We have also discussed the stability result for the Vietoris-Rips or the Čech complex in Section B.

## 6 Experiments

To demonstrate the effectiveness of the proposed approach, we study classification problems on three different datasets: MNIST handwritten digits, 3D object, and CIFAR-10. To fairly evaluate the benefits of using our proposed method, we keep the network architecture as simple as possible so that we can focus on benefits from our topological layer modules. Even though we do not have any restrictions on where to place the proposed layer, in the experiments we put our topological layer only at the beginning of the network to make a fair comparison with Hofer et al. (2017). In the experiments, we want to explore the benefits of our layer through the following questions: 1) does it make the network more robust and reliable against noise, etc.? and 2) does it improve the overall generalization capability compared to vanilla models? We intentionally use a small number of training data (1000) so that the convergence rates could be included in the evaluation criteria. We refer to Appendix C.8 for details about each simulation and model architectures.

### Robustness on MNIST handwritten digits

We first perform the classification task of images of handwritten digits using MNIST dataset. Each digit has its own distinctive topological information which can be encoded into the proposed topological layer as illustrated in Figure 1.

Topological layer. To compute persistent diagrams, we proceed with (5) where we define fixed points on grid and use a set of grayscale values as a weight vector for the fixed points with the smoothing parameter . Then we construct each structure element of our topological layer with , by using the affine transformation described in Section 3.2. We use both 0- and 1-dimensional features.

Noise and corruption process. The corruption process is designed by a random omission of pixel values in the raw input; we randomly remove a certain percentage of signal pixels from each sample (so it has fewer data points). In the noise process we add a certain amount of uniformly-distributed noise signals over each example. The example of the corrupted and noisy image is illustrated in Figure 2.

Baselines & Simulation. As our baseline methods, we employ 2-layer vanilla MLP, 2-layer CNN, and the topological signature method by Hofer et al. (2017) (which we will refer to as SLayer). For the SLayer and our proposed approach (TopLayer), we augment the vanilla MLP and CNN with the features learned from these topological methods. To observe the marginal benefit of robustness provided by our method with respect to noise and corruption, we compute the average test accuracy across various noise and corruption rate given to the training dataset.

Result. In Figure 3 we observe that although improvement in accuracy over baseline methods is somewhat marginal, in overall the proposed approach (TopLayer) is more robust to noise and corruption than the SLayer by Hofer et al. (2017) or plain MLP/CNN. Particularly in the presence of noise, the decrement remains small consistently for TopLayer whereas a steep drop is observed for the other baselines.

### 3D shape classification

In this experiment, we consider the problem of 3D object classification. Specifically, we consider eight 3D geometric shapes inspired by Thurston’s geometrization conjecture, as shown in Figure 4. Final data are represented in 3D point cloud with added noise (noise rate of 0.25). Detailed data generating process is described in Appendix C.8.

Topological layer. We put our raw input on the predefined grid space (where the grid consists of equi-spaced points) such that and extend the DTM function to the complex as described in Section 3.1, (4). Other configurations and parameter setting remain the same as before. This time we only use 1- and 2-dimensional features as they are turned out to be influential.

Baselines & Simulation. As baseline methods, we use 2-layer vanilla MLP, MLP augmented with the SLayer (Hofer et al. (2017)) just like in the previous experiment on MNIST, and PointNet(Qi et al. (2016)), which is the state-of-the-art deep neural network model for 3D point-cloud data. We also employ Deep Sets (Zaheer et al., 2017, Section 4.1.3) which is known to perform well on a point cloud set, where we use two permutation equivariant layers. We compare the test accuracy of the MLP augmented with our TopLayer against these baseline methods on the 3D object dataset.

Result. As shown in Table 1, our method achieves a comparable level of test accuracy to the state-of-the-art PointNet, even with a relatively small network complexity. Our method performs on par with Hofer et al. (2017), and slightly better than Deep Sets. It is worth noting the difference between our method and the well-known baselines; both PointNet and Deep Sets internally generate numerous features that are translation- and rotation-invariant based on specific rules, whereas we explicitly code such information as TDA features. From vanilla MLP, it is sufficient to add only a small number of additional nodes (10) for the TopLayer to boost the accuracy significantly (from 75 % to 99 %).

### CIFAR-10 dataset classification

Here we aim to implement the image classification with the CIFAR-10 dataset, where 32 32 images are grouped into 10 classes. We use grayscale images for this experiment.

Topological layer. We use two types of filtrations together for this dataset. First, we compute a filtration where we use 32 32 input values as weights of the DTM function as we did in the MNIST case. Secondly, to capture the minutiae we use a 2D-grid filtration where we compute the persistent homology of superlevel sets where we directly take the 32 32 input as 2D-function values.

Baselines & Simulation. Typically CIFAR-10 dataset is studied via variants of CNN as it delivers the best performance. We employ 4-layer CNN, and CNN + SLayer as our baseline. We proceeded with the noise and corruption rates of 0.15.

Result. Although the overall accuracy suffers from the small sample size and color deficiency, it is observed that when topological layers are augmented in the network it increases accuracy in general. The proposed layers yield roughly 5% improvement in accuracy that has not been reaped by CNN.

## 7 Discussion

In this study, we have presented a novel topological layer based on weighted persistent landscapes where we can exploit topological features effectively. For the DTM filtration, we guarantee differentiability of the proposed layer with respect to inputs on general persistent homology. Hence, our study provides the first general topological layer which can be placed anywhere in the deep learning network. We also present new stability theorems that verify the robustness and efficiency of our approach. It is worth noting that our analytical results are extended to silhouettes (Chazal et al., 2015, 2014b).

The results in this paper pave the way to various interesting future work as it has bridged the gap between modern TDA tools and deep learning. In a forthcoming applied paper, we prepare extensive experiments with more complex real-world datasets where we try silhouette-based layers as well. Furthermore, although it is intuitively appealing to place the topological layer in the beginning of the network as it directly exploits useful geometric features of input data structure, there may be situations that placing the proposed layer in the middle of the architecture is more beneficial. We will be exploring this issue as well.

## Appendix A Simplicial complex

A simplicial complex can be seen as a high dimensional generalization of a graph. Given a set , an (abstract) simplicial complex is a set of finite subsets of such that and implies . Each set is called its simplex. The dimension of a simplex is , and the dimension of the simplicial complex is the maximum dimension of any of its simplices. Note that a simplicial complex of dimension is a graph.

When approximating the topology of the underlying space by observed samples, a common choice is the Čech complex, defined next. Below, for any and , we let denote the closed ball centered at and radius .

###### Definition A.1 (Čech complex).

Let be finite and . The (weighted) Čech complex is the simplicial complex

 \v{C}echXX(r):={σ⊂X: ∩x∈σBX(x,r)≠∅}, (12)

The superscript will be dropped when understood from the context.

Another common choice is the Vietoris-Rips complex, also referred to as Rips complex, where simplexes are built based on pairwise distances among its vertices.

###### Definition A.2.

The Rips complex is the simplicial complex defined as

 RX(r):={σ⊂X:d(xi,xj)<2r,∀xi,xj∈σ}. (13)

Note that the Čech complex and Rips complex have following interleaving inclusion relationship

 \v{C}echXn(r)⊂RXn(r)⊂\v{C}echXn(2r). (14)

In particular, when is a Euclidean space, then the constant can be tightened to :

 \v{C}echXn(r)⊂RXn(r)⊂\v{C}echXn(√2r). (15)

## Appendix B Stability for Vietoris-Rips and Cech filtration

When we use Vietoris-Rips or Čech filtration, our result can be turned into the stability result with respect to points in Euclidean space. Let be two bounded sets.

The next corollary re-states our stability theorem with respect to points in .

###### Corollary B.1.

Let be any -coverings of , and let denote persistent diagrams induced from the Rips or Čech filtration on respectively. Then we have

 |Sθ,ω(DX;ν)−Sθ,ω(DY;ν)|≤2Lgm1/2(dGH(X,Y)+2ϵ). (16)

The proof is given in Appendix C.7. Corollary B.1 implies that if we assume our observed data points are sufficiently decent quality in the sense that , then our topological layers constructed on those observed points are stable with respect to small perturbations of the true representation under proper persistent homologies. Here, could be interpreted as an uncertainty from incomplete sampling. This means the topological information embedded in the proposed layer is robust against small sampling noise or data corruption by missingness.

Moreover, since Gromov-Hausdorff distance is upper bounded by Hausdorff distance, the result in Corollary B.1 also holds when we use in place of in RHS of (16).

###### Remark 1.

In fact, when we have very dense data that have been well-sampled uniformly over the true representation so that , our result converges to the following.

 ∣∣Sθ,ω(DX;ν)−Sθ,ω(DY;ν)∣∣≤2Lg(Tν)1/2dGH(X,Y). (17)

## Appendix C Proofs

### c.1 Proof of Theorem 4.1

Let be the simplicial complex, and suppose all the simplices are ordered in the filtration so that the values of are nondecreasing, i.e. if comes earlier than then . Note that the map from each birth-death point to a pair of simplices is simply the pairing returned by the standard persistence diagram (Carlsson et al., 2005). Let be the homological feature corresponding to , then the birth simplex is the simplex that forms in , and the death simplex is the simplex that causes to collapse in . For example, if were to be a -dimensional feature, than is the edge in that forms the loop corresponding to , and is the triangle in which incurs the loop corresponding to can be contracted in .

Now, and , and from being locally constant on ,

 ∂bi∂Xj=∂f(ξ(bi))∂Xj=∂f(βi)∂Xj,  ∂di∂Xj=∂f(ξ(di))∂Xj=∂f(δi)∂Xj. (18)

Therefore, the derivatives of the birth value and the death value are the derivatives of the filtration function evaluated at the corresponding pair of simplices. And is the collection of these derivatives, hence applying (18) gives

 ∂DX∂X={(∂bi∂Xj,∂di∂Xj)}(bi,di)∈DX,Xj∈X={(∂f(βi)∂Xj,∂f(δi)∂Xj