TopoReg: A Topological Regularizer for Classifiers
Abstract
Regularization plays a crucial role in supervised learning. A successfully regularized model strikes a balance between a perfect description of the training data and the ability to generalize to unseen data. Most existing methods enforce a global regularization in a structure agnostic manner. In this paper, we initiate a new direction and propose to enforce the structural simplicity of the classification boundary by regularizing over its topological complexity. In particular, our measurement of topological complexity incorporates the importance of topological features (e.g., connected components, handles, and so on) in a meaningful manner, and provides a direct control over spurious topological structures. We incorporate the new measurement as a topological loss in training classifiers. We also propose an efficient algorithm to compute the gradient. Our method provides a novel way to topologically simplify the global structure of the model, without having to sacrifice too much of the flexibility of the model. We demonstrate the effectiveness of our new topological regularizer on a range of synthetic and realworld datasets.
leftmargin=* \setenumerate[1]label=(0),align=left \setenumerate[2]label=()
TopoReg: A Topological Regularizer for Classifiers
Chao Chen Department of Computer Science City University of New York (CUNY) New York, USA chao.chen.cchen@gmail.com Xiuyan Ni Department of Computer Science City University of New York (CUNY) New York, USA xni2@gradcenter.cuny.edu Qinxun Bai Hikvision Research America Santa Clara, CA, USA qinxun.bai@gmail.com Yusu Wang Department of Computer Science & Engineering Ohio State University (OSU) Columbus, OH, USA yusu@cse.ohiostate.edu
noticebox[b]Preprint. Work in progress.\end@float
1 Introduction
Regularization plays a crucial role in supervised learning. A successfully regularized model strikes a balance between a perfect description of the training data and the ability to generalize to unseen data. A common intuition for the design of regularzers is the Occam’s razor principle, where a regularizer enforces certain simplicity of the model in order to avoid overfitting. Classic regularization techniques include functional norms such as [25], (Tikhonov) [33] and RKHS [38] norms. Such norms produce a model with relatively less flexibility and thus is less likely to overfit.
A particularly interesting category of methods is inspired by the geometry. These methods design new loss terms to enforce a geometric simplicity of the classifier. Some methods stipulate that similar data should have similar score according to the classifier, and enforce the smoothness of the classifier function [5, 47, 4]. Others directly pursue a simple geometry of the classifier boundary, i.e., the submanifold separating different classes [10, 41, 30, 29]. These geometrybased regularizers are intuitive and have been shown to be useful in many supervised and semisupervised learning settings. However, regularizing total smoothness of the classifier (or that of the classification boundary) is not always flexible enough to balance the tug of war between overfitting and generalization accuracy. The key issue is that these measurement are usually structure agnostic. For example, in Figure 1 (b) and (c), the classifier may either overfit noise (as in (b)), or becomes too smooth and lose generalization accuracy as well (as in (c)).
In this paper, we propose a new direction to measure the “simplicity” of a classifier – Instead of using geometry such as total curvature / smoothness, we directly enforce the “simplicity” of the classification boundary, by regularizing over its topological complexity. (Here, we take a similar functional view as Bai et al. [4] and consider the classifier boundary as the valued level set of the classifier function ; see Figure 2 (b) for an example.) Furthermore, our measurement of topological complexity incorporates the importance of topological features, e.g., connected components, handles, in a meaningful manner, and provides a direct control over spurious topological structures. This new structural simplicity can be combined with other regularizing terms (say geometrybased ones or functional norms) to train a better classifier. See Figure 1 (a) for an example, where the classifier computed with topological regularization (together with a normalized kernel logistic regression classifier) achieves a better balance between overfitting and classification accuracy.
(a)  (b)  (c)  (d) 
To design a good topological regularizer, there are two key challenges:
(1). As mentioned above, we want to measure and incorporate importance of topological features. For example, in Figure 2 (a), we see that there are three connected components in the current classification boundary. However, the “importance” of the two small components is not the same: Figure 2 (b) shows the functional graph of the current classifier, where intuitively the left component (the left valley in (b)) is more likely to be a spurious feature as it takes less perturbation to remove it than the right one. Leveraging several recent development in the field of computational topology[21, 6, 7], we quantify such “robustness” of each topological feature and define our topological loss as the sum of the squared robustness over all topological structures from the classification boundary.
(2). A bigger challenge is to show that gradients can be computed for the topological loss function we design. We show in Section 3 that this can indeed be done by discretizing the domain and using a piecewise linear approximation of the classifier function. Our topological loss is differentiable (almost everywhere). And it can be easily combined with various classifiers. In particular, we will apply it to a kernel logistic regression model and show in Section 4 how the new regularizer outperforms others on various synthetic and realworld datasets.
In summary, our contributions are as follows:

propose the novel view of regularizing the topological complexity of a classifier, and present the first work on developing such a topological loss function;

derive the gradient of the topological loss;

instantiate the regularizer on a kernel classifier; and

provide experimental evidence of the effectiveness of the proposed method on several synthetic and realworld datasets.
(a)  (b)  (c) 
1.1 Related Work
Topological information as features. Topological information has been used in machine learning. In particular, the topological summary called persistence diagram/barcode (will be defined in Section 2) carries global structural information of the data. It has been used in unsupervised learning, e.g., clustering [15, 34]. In supervised setting, topological information has been used as powerful features. The major challenge is the metric between such topological summaries of different data is not standard Euclidean. Adams et al. [2] proposed to directly vectorize such information. Bubenik [9] proposed to map the topological summary into a Banach space so that statistical reasoning can be carried out [14]. To fully leverage the topological information, various kernels [37, 27, 26, 13, 48] have been proposed to approximate their distance. Hofer et al. [23] proposed to use the topological information as input for deep convolutional neural network. Perhaps the closest to us is [42], which computes the topological information of the classification boundary. All these methods use topological information as an observation/feature of the data. To the best of our knowledge, our method is the first to leverage the topological information as a prior for training the classifier.
Topological constraints in optimization. In the context of computer vision, topological information has been incorporated as constraints in discrete optimization. Connectivity constraints can be used to improve the image segmentation quality, especially when the objects of interest are in elongated shapes. However, topological constraints, although intuitive, are highly complex and computationally infeasible to be fully enforced in the optimization procedure [43, 35]. One has to resort to various approximation schemes [45, 16, 40, 36].
2 Level Set, Topology, and Robustness
Given a dimensional feature space, , we first focus on the binary classification problem ^{1}^{1}1For the multilabel classification, we will use multiple onevsall binary classifiers (see Section 3.1).. The classifier function is a scalar function, and the prediction for any training/testing data is . For a smooth , a point is critical if the gradient of vanishes at . We are interested in describing the topology and geometry of the classification boundary of , i.e., the boundary between the positive and negative classification regions. The level set of at a fixed threshold, , is the subspace . For a generic so that does not pass through a critical point, this level set is a dimensional manifold. The classification boundary is simply the zerovalued levelset . We further use to denote the sublevel set of w.r.t. , which consists of all points whose values are equal or below . Symmetrically, the superlevel set contains all points with function values equal or greater than . Obviously, . See Figure 2(a) and (b), where the red curve represents , while the darker region represents the sublevel set . For convenience, we denote by the classification boundary, i.e., .
Topological structures. We will use homology classes / homology group to describe the topological structures. We only provide a very informal treatment below to illustrate intuition, and refer interested readers to [32, 20] for more formal definitions. Given a topological space , we will consider its th dimensional homology group over coefficients; and elements of this group are called homology classes. Very informally, 0th homology group is the vector space spanned by connected components (0dimensional topological features) in ; 1st homology group is spanned by “independent” loops (1D topological features) in ; while 2D topological features are 2dimensional voids (such as a topological sphere, torus and so on). See Figure 2 (c), where a handle is captured by a 1D topological feature. Our topological loss function (to be introduced in Section 3) can be defined on any dimensional topological features. However, the current implementation only incorporate 0dimensional features (i.e, connected components information) for computational efficiency. Hence below, we will consider only the 0dimensional topological features in when describing the robustness measure. For convenience, we further drop the subscript and use as the set of all connected components of .
(a)  (b)  (c)  (d) 
Robustness. Recall the example in Figure 2(a), where the classification boundary has three components (i.e., ). We aim to assign importance to these three components (i.e., basis elements of ). Note that simply inspecting the level set itself, it is not clear which components are important. However, if we consider the function as a whole, the graph of which is shown in Figure 2(b), we note that intuitively the left component is less important than the right one in the following sense: less perturbation of the function is needed so that the left component disappears from the level set . We measure a perturbation using the maxnorm , i.e., .
In the example of Figure 3(a), to remove the right component from the levelset (black curves), there are two possible ways:

Remove the component completely by lifting the function value of all points from the right valley up to value for an infinitesimally small positive value . Let the new function be . It is easy to see that the perturbation we need in this case is , where (yellow point) is the minimum of this valley.

Merge the two components by finding a path connecting them and lower the function values along the path to . To keep perturbation small, the best path is the one whose highest point is lowest in values. Such path is the one passing through the saddle (green point) with being the highest value along it. Let be the perturbed function: in this case, the perturbation we need is .
The least perturbation of needed to remove this component will be the smaller of the above two quantities, . We note that the amount of perturbation of needed in this example depends on the function values of two critical points, the minimum and the (index1) saddle . It turns out that this is not a coincidence: this pairing () is in fact a socalled persistence pairing computed based on the theory of persistent homology.
Persistent homology [22, 49] is a fundamental recent development in the field of computational topology, underlying many topological data analysis methods. Again, we will only provide an intuitive description here. Given a space and a continuous function , we sweep the domain in increasing values. It gives rise to the following growing sequence of sublevel sets:
We call it the sublevel set filtration of w.r.t. . As we sweep, sometimes, new topological features (homology classes), say, a new component or a handle, will be created. Sometimes an existing one will be killed, say a component either disappear or merged into another one, or a void is filled. These changes will only happen when we sweep through a critical points. In Figure 3(d), we see the time when the righthandsize component is created and killed. The persistent homology tracks these topological changes, and pair up critical points into a collection of persistence pairings . Each pair are the critical points where a same topological feature is created and killed. Their function values and are referred to as the birth time and death time of this feature. The corresponding collection of pairs of (birthtime, deathtime) is called the persistence diagram, formally, . For each persistent pairing , its persistence is defined to be , which measures the lifetime (and thus importance) of the corresponding topological feature w.r.t. . An example is given in Figure 4.
(a)  (b) 
To capture all the topological features in the level sets of all different threshold values, we use an extension of the aforementioned sublevel set persistence, called the level set zigzag persistence [11]. Intuitively, we sweep the domain in increasing function values and now track topological features of the level sets, instead of the sublevel sets. The resulting set of persistence pairings and persistence diagram have analogous meanings: each pair of critical points corresponds to the creation and killing of some connected components in the level set, and the corresponding pair are the birth / death times of this feature.
Most importantly, for a classifier function , let and be the level set zigzag persistence pairs and persistence diagram that are only relevant to 0D topological features. It follows from results of [7] that:
Theorem 2.1.
There is a onetoone correspondence between the set of connected components in and the subset of points .
In Figure 4(b), points in the red box correspond to the two components in the zerovalued level set of . With this correspondence, for each 0D topological feature (connected component) from , let be its pairing of critical points. The corresponding belongs to . For example, let be the righthandside component in Figure 3(a). We have and . Our earlier discussion (two cases in (A) and (B) before) suggests that minimum perturbation of needed to remove this feature from the level set is . We thus define the robustness formally as follows:
Definition 1.
Each persistence point represents the birth and death times (function values) of a corresponding 0D topological feature (connected component) . Its robustness is defined as .
Remark 1.
In general, the level set zigzag persistence takes time to compute, where is the total complexity of the discretized representation of the domain . However, since we only consider 0D features, we can compute it more efficiently in by computing the extended persistence for the socalled Reeb graph of (see [3, 17]). Finally, we can naturally extend the above definition by considering persistent pairs and the diagram corresponding to the birth and death of high dimensional topological features, e.g., handles, voids, etc.
3 Topological Regularizer
In this section, we formalize the topological loss based on the robustness introduced in the previous section. We also derive the gradient of this loss. We conclude with details on the computation and and on the extension to multilabel classifiers. Given a data set and a classifier parameterized by , we define the loss as the sum of the perdata loss and our topological loss.
(3.1) 
The standard perdata loss encourages the prediction to be close to the true label , e.g., crossentropy loss, quadratic loss or hinge loss. Our topological loss, , is the accumulative robustness of all topological features of the classification boundary . Recall each topological structure of the classification boundary, , is associated with two critical points and , and its robustness . For convenience, we denote by the one among the two critical points which determines the actual robustness, i.e., . We define the topological loss in Equation (3.1) as the total robustness:
Gradient. To directly compute the gradient is challenging, because there is no closed form solutions for the critical points of any nontrivial function ^{2}^{2}2It was observed that even a simple mixture of isotropic Gaussians can have exponentially many critical points [19, 12].. To circumvent the issue, we resort to a standard approach in topological data analysis: assume a discretization of the domain into a triangular mesh or its higher dimensional counterparts (Figure 5). We evaluate the classifier function at vertices of the mesh and linearly interpolate the function over highdimensional elements, e.g., edges, triangles, tetrahedra, etc. With such a piecewise linear approximation of the classifier function , we redefine critical points as vertices at which topological changes of the sub/super/level set filtration happen. We can show that for any generic function, , within a sufficiently small neighborhood, the relevant critical point of any level set topological feature , , is unique and fixed. In other words, remains constant when taking derivative with regard to or . There exist degenerate cases. For example, when the two critical points and have , the choice of is not unique and the derivative does not exist. However, such singularity cases are rare and are within a measure zero subset of the space of all legible piecewise linear functions .
With such piecewise linear setting, using the chain rule, we have the main theorem (proof omitted).
Theorem 3.1.
The topological loss is differentiable almost everywhere. Its gradient is
Intuition of the gradient. The gradient for each topological structure , , is essentially pushing the function value of away from zero. Note that is the gradient direction that maximally increases . When , , the gradient decreases the value of in the maximal direction. Moving along the gradient direction is equivalent to pulling down the function value of . When , , the gradient increases the value of . Moving along the gradient direction is pushing up the function value of . Pushing the closer among and away from zero is the best way to increase the robustness of .
Next we calculate the gradient of the loss function using a specific classifier:
(3.2) 
3.1 Instantiating on Kernel Machines
In principle, our topological loss can be incorporated with any classifier. In this paper, we combine it with a kernel logistic regression classifier to demonstrate its advantage in practice. We first present details assuming a binary classifier. We will extend it to multilabel settings.
In a kernel logistic regression, the prediction function is , in which is the logit sigmoid function. The dimensional feature vector consists of the Gaussian kernel distance between and the training data. The perdata loss is the standard crossentropy loss and the gradient of the loss for each training data is . See [8] for detailed derivations.
Next we derive the gradient for the topological loss. First we need to modify the classifier slightly. Notice the range of is between zero and one, and the prediction is . To fit our setting in which the zerovalued level set is the classification boundary, we use a surrogate function as the input for the topological loss. Notice that . We have the gradient of the topological loss
(3.3) 
Substituting Equation (3.3) into Equation (3.2) gives us the gradient of the loss.
Computation. Our overall algorithm repeatedly computes the gradient of the loss function and update the parameters accordingly, until it converges. At each iteration, to compute the gradient of the topological loss, we compute the persistent homology for the current classifier function . The output include all pairs of critical points. We identify ’s for all .
For a 2D domain, we use the triangular mesh and evaluate the classifier function values at all vertices (Figure 5). This piecewise linear approximation is fed into the persistence computation algorithm. As mentioned in Remark 1, the computation takes time, in which is the size of the mesh. For high dimensional data, this approach is computationally prohibitive; the mesh size grows exponentially to the dimension of the domain. Instead, we construct a discretization based on the nearest neighbor (KNN) graph of the training data: all training data constitute the vertices, and all edges of the KNN graph constitute the edges of the discretization. Since we are focusing on 0D topological features, only vertices and edges are relevant. So the algorithm is quasilinear to the size of the KNN graph, and thus assuming a constant for the KNN construction. Our choice is justified by experimental evidences that the KNN graph provides sufficient approximation for the topological computation in the context of clustering [15, 34].
Multilabel settings. For multilabel classification with classes, we use the multinomial logistic regression classifier with parameters . The perdata loss is again the standard crossentropy loss. For the topological loss, we create different scalar functions . If , we classify as label . The 0valued level set of is the classification boundary between label and all others. Summing the total robustness over all different ’s give us the multilabel topological loss. The overall loss function is
in which is the zerovalued level set of . We omit the derivation of the gradients due to space constraint. The computation is similar to binarylabeled setting, except that at each iteration, we need to compute the persistent diagrams for all the functions.
4 Experiments
We test our method (TopoReg) on multiple synthetic datasets and real world datasets. While our method removes spurious topological structures, it does not directly control the smoothness/simplicity of the classifier. We add to our overall loss function model (Equation 3.1) an additional norm, which complements our loss by smoothing the model slightly. The weights of the norm and the topological loss , as well as the Gaussian kernel width , are tuned via crossvalidation. To compute topological information requires discretization of the domain. For 2D data, we normalize the data to fit the unit square , and discretize the square into a mesh with 300 300 vertices. For highdimensional data, we use the KNN graph with .
Baselines. We compare our method with several baseline methods. We compare with standard kernel methods such as SVM with RBF kernels (SVM) and Kernel Logistic Regression (KLR), both of which with functional norms ( and ) as regularizers. We also compare with two stateoftheart methods based on geometricregularizers, namely, the Euler’s Elastica classifier (EE) [29] and the Classifier with Differential Geometric Regularization (DGR) [4]. All relevant hyperparameters are tuned using crossvalidation.
For every dataset and each method, we randomly divide the datasets into 6 folds. Then we use each of the 6 folds as testing set, while doing a 5fold cross validation on the rest 5 folds data to find the best hyperparameters. Once the best hyperparameters are found, we train on the entire 5 folds data and test on the testing set. We calculate the mean and standard deviation of the error over the six fold testing. We also report the average error over each category.
Data. In order to thoroughly evaluate the bahavior of our model and validate the quality of KNNbased discretization (see Section 5), we created synthetic data with various noise levels. We create twomoons (Moons) and twocircles datasets (Circles) with both feature and label noise. We also create synthetic datasets with points normally distributed from two and three clusters (Blob2 and Blob3).We also evaluate our method on real world data. We use several UCI datasets with various sizes and dimensions to test our method [28]. In addition, we use the kidney renal clear cell carcinoma cancer (KIRC) dataset [44] extracted from the Cancer Genome Atlas project (TCGA) [1]. The features of the dataset are the protein expression measured on the MD Anderson Reverse Phase Protein Array Core platform (RPPA).
The results are reported in Table 1. We also report the average performance over each category (AVERAGE). The two numbers next to each dataset name are the data size and the dimension , respectively. The average running time over all the datasets for our method is 2.08 seconds.
Synthetic  
SVM  EE  DGR  KLR  TopoReg  
Circles (100,2)  40.859.23  47.71 8.45  57.06 2.94  40.85 5.74  38.76 7.00 
Moons (500,2)  19.80 1.70  19.00 1.76  19.01 1.16  18.83 3.58  18.63 3.89 
Blob3 (100,2)  42.09 10.93  62.48 13.81  41.24 12.31  40.00 10.12  39.02 10.92 
Blob2 (500,5)  7.61 1.53  8.41 3.35  7.41 1.78  7.80 1.67  7.20 2.64 
AVERAGE  27.59  34.4  31.18  26.87  25.90 
UCI  
SVM  EE  DGR  KLR  TopoReg  
SPECT (267,22)  18.68 6.12  16.38 6.88  23.92 3.69  18.31 5.89  17.54 7.63 
Congress (435,16)  4.59 2.31  4.59 2.17  4.80 2.70  4.12 3.56  4.58 2.54 
Molec. (106,57)  19.79 6.10  17.25 6.74  16.32 7.81  19.10 9.86  12.62 10.74 
Cancer (286,9)  28.64 5.31  28.68 7.01  31.42 7.13  29.00 5.03  28.31 6.98 
Vertebral (310,6)  23.23 3.70  17.15 5.40  13.56 2.80  12.56 2.26  12.24 2.25 
Energy (768,8)  0.65 0.70  0.91 1.14  0.78 0.64  0.52 0.64  0.52 0.64 
AVERAGE  15.93  14.16  15.13  13.94  11.80 
TCGA  
SVM  EE  DGR  KLR  TopoReg  
KIRC (243,166)  32.56 10.04  31.38 11.61  35.50 9.13  31.38 11.55  26.81 8.85 
5 Discussion and Future Work
Our method generally outperforms existing methods on datasets in Table 1. On synthetic data, we found that TopoReg has a bigger advantage on relatively noisy data. This is expected. Our method provides a novel way to topologically simplify the global structure of the model, without having to sacrifice too much of the flexibility of the model. Meanwhile, to cope with large noise, other baseline methods have to enforce an overly strong global regularization in a structure agnostic manner. To evaluate the quality of KNN approximation for topological computation, we also compared the performance of meshbased and KNNbased implementation on 2D synthetic datasets, where both methods are available. We observe that KNNbased approach performs only slightly worse than meshbased implementations. This gives us more confidence on our KNNbased method when applying it to highdimensional data.
Our topological loss and its gradient is agnostic of the underlying classifier. In the future, we would like to extend the proposed method to the convolutional neural network (CNN) classifier, in which powerful regularizers [24, 39] have been invented but are much less understood [46].
Scalability. Topological information characterizes the data in a very global manner. One concern is the scalability of the computation of the gradient of our topological loss. Using various implementation techniques, such as KNNbased discretization and focusing on 0D topological features, our method has been very efficient on the datasets we tested on. For even larger dataset, the major bottleneck is the computation of the kernel matrix. To extend to CNNs, we expect a reasonable computational speed can be achieved by adopting stateoftheart library for approximate KNN query [31] and stateoftheart implementation for topological computation [18].
References
 [1] TCGA: The Cancer Genome Atlas.
 [2] H. Adams, T. Emerson, M. Kirby, R. Neville, C. Peterson, P. Shipman, S. Chepushtanova, E. Hanson, F. Motta, and L. Ziegelmeier. Persistence images: A stable vector representation of persistent homology. The Journal of Machine Learning Research, 18(1):218–252, 2017.
 [3] P. K. Agarwal, H. Edelsbrunner, J. Harer, and Y. Wang. Extreme elevation on a 2manifold. Discrete and Computational Geometry (DCG), 36(4):553–572, 2006.
 [4] Q. Bai, S. Rosenberg, Z. Wu, and S. Sclaroff. Differential geometric regularization for supervised learning of classifiers. In International Conference on Machine Learning, pages 1879–1888, 2016.
 [5] M. Belkin, P. Niyogi, and V. Sindhwani. Manifold regularization: A geometric framework for learning from labeled and unlabeled examples. Journal of Machine Learning Research, 7:2399–2434, 2006.
 [6] P. Bendich, H. Edelsbrunner, and M. Kerber. Computing robustness and persistence for images. IEEE transactions on visualization and computer graphics, 16(6):1251–1260, 2010.
 [7] P. Bendich, H. Edelsbrunner, D. Morozov, A. Patel, et al. Homology and robustness of level and interlevel sets. Homology, Homotopy and Applications, 15(1):51–72, 2013.
 [8] C. M. Bishop. Pattern Recognition and Machine Learning, volume 4. springer New York, 2006.
 [9] P. Bubenik. Statistical topological data analysis using persistence landscapes. The Journal of Machine Learning Research, 16(1):77–102, 2015.
 [10] X. Cai and A. Sowmya. Level learning set: A novel classifier based on active contour models. In Proc. European Conf. on Machine Learning (ECML), pages 79–90. 2007.
 [11] G. Carlsson, V. de Silva, and D. Morozov. Zigzag persistent homology and realvalued functions. In Proc. 25th Annu. ACM Sympos. Comput. Geom., pages 247–256, 2009.
 [12] M. Á. CarreiraPerpiñán and C. K. Williams. On the number of modes of a gaussian mixture. In International Conference on ScaleSpace Theories in Computer Vision, pages 625–640. Springer, 2003.
 [13] M. Carrière, M. Cuturi, and S. Oudot. Sliced wasserstein kernel for persistence diagrams. In International Conference on Machine Learning, pages 664–673, 2017.
 [14] F. Chazal, M. Glisse, C. Labruère, and B. Michel. Convergence rates for persistence diagram estimation in topological data analysis. In International Conference on Machine Learning (ICML), pages 163–171, 2014.
 [15] F. Chazal, L. J. Guibas, S. Y. Oudot, and P. Skraba. Persistencebased clustering in riemannian manifolds. Journal of the ACM (JACM), 60(6):41, 2013.
 [16] C. Chen, D. Freedman, and C. H. Lampert. Enforcing topological constraints in random field image segmentation. In Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, pages 2089–2096. IEEE, 2011.
 [17] D. CohenSteiner, H. Edelsbrunner, and J. Harer. Extending persistence using Poincaré and Lefschetz duality. Foundations of Computational Mathematics, 9(1):79–103, 2009.
 [18] T. K. Dey, D. Shi, and Y. Wang. Simba: an efficient tool for approximating ripsfiltration persistence via simplicial batchcollapse. arXiv preprint arXiv:1609.07517, 2016.
 [19] H. Edelsbrunner, B. T. Fasy, and G. Rote. Add isotropic gaussian kernels at own risk: More and more resilient modes in higher dimensions. Discrete & Computational Geometry, 49(4):797–822, 2013.
 [20] H. Edelsbrunner and J. Harer. Computational Topology: an Introduction. AMS, 2010.
 [21] H. Edelsbrunner, D. Letscher, and A. Zomorodian. Topological persistence and simplification. In Foundations of Computer Science, 2000. Proceedings. 41st Annual Symposium on, pages 454–463. IEEE, 2000.
 [22] H. Edelsbrunner, D. Letscher, and A. Zomorodian. Topological persistence and simplification. Discrete Comput. Geom., 28:511–533, 2002.
 [23] C. Hofer, R. Kwitt, M. Niethammer, and A. Uhl. Deep learning with topological signatures. In Advances in Neural Information Processing Systems, pages 1633–1643, 2017.
 [24] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.
 [25] B. Krishnapuram, L. Carin, M. Figueiredo, and A. Hartemink. Learning sparse bayesian classifiers: multiclass formulation, fast algorithms, and generalization bounds. IEEE. Trans. Pattern. Anal. Mach. Intell, 32, 2005.
 [26] G. Kusano, Y. Hiraoka, and K. Fukumizu. Persistence weighted gaussian kernel for topological data analysis. In International Conference on Machine Learning, pages 2004–2013, 2016.
 [27] R. Kwitt, S. Huber, M. Niethammer, W. Lin, and U. Bauer. Statistical topological data analysisa kernel perspective. In Advances in neural information processing systems, pages 3070–3078, 2015.
 [28] M. Lichman. UCI machine learning repository, 2013.
 [29] T. Lin, H. Xue, L. Wang, B. Huang, and H. Zha. Supervised learning via euler’s elastica models. Journal of Machine Learning Research, 16:3637–3686, 2015.
 [30] T. Lin, H. Xue, L. Wang, and H. Zha. Total variation and Euler’s elastica for supervised learning. Proc. International Conf. on Machine Learning (ICML), 2012.
 [31] Y. A. Malkov and D. Yashunin. Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs. arXiv preprint arXiv:1603.09320, 2016.
 [32] J. R. Munkres. Elements of algebraic topology. CRC Press, 1984.
 [33] A. Y. Ng. Feature selection, l 1 vs. l 2 regularization, and rotational invariance. In Proceedings of the twentyfirst international conference on Machine learning, page 78. ACM, 2004.
 [34] X. Ni, N. Quadrianto, Y. Wang, and C. Chen. Composing tree graphical models with persistent homology features for clustering mixedtype data. In International Conference on Machine Learning, pages 2622–2631, 2017.
 [35] S. Nowozin and C. H. Lampert. Global connectivity potentials for random field models. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pages 818–825. IEEE, 2009.
 [36] M. R. Oswald, J. Stühmer, and D. Cremers. Generalized connectivity constraints for spatiotemporal 3d reconstruction. In European Conference on Computer Vision, pages 32–46. Springer, 2014.
 [37] J. Reininghaus, S. Huber, U. Bauer, and R. Kwitt. A stable multiscale kernel for topological machine learning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4741–4748, 2015.
 [38] B. Schölkopf and A. J. Smola. Learning with kernels: support vector machines, regularization, optimization, and beyond. MIT press, 2002.
 [39] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: A simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research, 15(1):1929–1958, 2014.
 [40] J. Stühmer, P. Schröder, and D. Cremers. Tree shape priors with connectivity constraints using convex relaxation on general graphs. In ICCV, volume 13, pages 1–8, 2013.
 [41] K. Varshney and A. Willsky. Classification using geometric level sets. Journal of Machine Learning Research, 11:491–516, 2010.
 [42] K. R. Varshney and K. N. Ramamurthy. Persistent topology of decision boundaries. In Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on, pages 3931–3935. IEEE, 2015.
 [43] S. Vicente, V. Kolmogorov, and C. Rother. Graph cut based image segmentation with connectivity priors. In Computer vision and pattern recognition, 2008. CVPR 2008. IEEE conference on, pages 1–8. IEEE, 2008.
 [44] Y. Yuan, E. M. Van Allen, L. Omberg, N. Wagle, A. AminMansour, A. Sokolov, L. A. Byers, Y. Xu, K. R. Hess, L. Diao, et al. Assessing the clinical utility of cancer genomic and proteomic data across tumor types. Nature biotechnology, 32(7):644, 2014.
 [45] Y. Zeng, D. Samaras, W. Chen, and Q. Peng. Topology cuts: A novel mincut/maxflow algorithm for topology preserving segmentation in n–d images. Computer vision and image understanding, 112(1):81–90, 2008.
 [46] C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals. Understanding deep learning requires rethinking generalization. In International Conference on Learning Representations (ICLR), 2017.
 [47] D. Zhou and B. Schölkopf. Regularization on discrete spaces. In Pattern Recognition, pages 361–368. Springer, 2005.
 [48] X. Zhu, A. Vartanian, M. Bansal, D. Nguyen, and L. Brandl. Stochastic multiresolution persistent homology kernel. In IJCAI, pages 2449–2457, 2016.
 [49] A. Zomorodian and G. Carlsson. Computing persistent homology. Discrete & Computational Geometry, 33(2):249–274, 2005.