Let's take a Walk on Superpixels Graphs: Deformable Linear Objects Segmentation and Model Estimation
Abstract
While robotic manipulation of rigid objects is quite straightforward, coping with deformable objects is an open issue. More specifically, tasks like tying a knot, wiring a connector or even surgical suturing deal with the domain of Deformable Linear Objects (DLOs). In particular the detection of a DLO is a nontrivial problem especially under clutter and occlusions (as well as selfocclusions). The pose estimation of a DLO results into the identification of its parameters related to a designed model, e.g. a basis spline. It follows that the standalone segmentation of a DLO might not be sufficient to conduct a full manipulation task. This is why we propose a novel framework able to perform both a semantic segmentation and bspline modeling of multiple deformable linear objects simultaneously without strict requirements about environment (i.e. the background). The core algorithm is based on biased random walks over the Region Adiacency Graph built on a superpixel oversegmentation of the source image. The algorithm is initialized by a Convolutional Neural Networks that detects the DLO’s endcaps. An open source implementation of the proposed approach is also provided to easy the reproduction of the whole detection pipeline along with a novel cables dataset in order to encourage further experiments.
1 Introduction
Plenty of manipulation tasks deal with objects that can be modeled as nonrigid linear – or more generally tubular – structures. In case the task has to be executed by a robot in an unstructured environment, particular effort must be devoted to effectiveness, reliability and efficiency of the automated perception subsystem.
Tying knots, for example, is a common though hard to automate activity. In particular, in surgical operations like suturing, grasp and knottying is a very important and repetitive subtask ([javdani2011modeling][jackson2015automatic][saha2007manipulation]). Similar knottying and path planning procedures, like e.g. knots untangling, are also relevant to contexts like service, collaborative and rescue robotics ([lui2013tangled][nair2017combining][schulman2013tracking][hopcroft1991case]). As for industrial scenarios, one of the hardest tasks involving flexible linear objects is wire routing in assembly processes ([remde1999picking][yue2002manipulating][alvarez2016approach][koo2008development]).
This paper focuses primarily on the industrial field and extends the original concepts introduced in the WIRES
In this paper we propose a novel computer vision algorithm for generic Deformable Linear Object (DLO) detection. As highlighted in Fig. 1, the proposed algorithm yields a twofold representation of detected objects, namely a bspline model for each target alongside with segmentation of the whole image. This twofold representation helps addressing both relatively simpler application settings dealing with cable detection as well as more complex endeavours calling for estimation of cable bend. The proposed algorithm consists of two distinct modules: the first, which may be considered as a preprocessing stage, detects the endcaps regions of DLOs by exploiting offtheshelf Convolutional Neural Networks ([redmon2017yolo9000, huang2016speed]); the second module, instead, is the core of this work and allows for identifying DLOs based on the coarse position of their endpoints in images featuring complex backgrounds as well as occlusions (e.g. cables crossing other cables) and selfocclusions (e.g. a cable crossing itself, also several times).
The algorithm exploits an oversegmentation of the source image into superpixels to build a Region Adjacency Graph (RAG [tremeau2000regions]). This representation enables to detect the area enclosing each target object by efficiently analyzing meaningful regions (i.e. superpixels) only rather than the whole pixels set. The task is accomplished by an iterative procedure capable to find the best path (or walk) through the RAG between two seed points by analyzing several local and global features (e.g. visual similarity, overall curvature etc.). This iterative procedure yields a directed graph of superpixels conducive to vectorization and bspline approximation. As better explained in the remainder of the paper, our approach is mostly unsupervised (i.e. only the endpoints detection CNN is trained supervisedly) and relies on just a few parameters that can be easily tuned manually (or estimated based on the characteristics of object’s material, e.g. elasticity or the plasticity). As we shall see in Sec. 2, our algorithm outperforms other known approaches that one may apply to try solving the addressed task.
2 Related Work
Object Type  Curvature  Intersections  Bifurcations 

Cables  Spline Model  A  N 
Fingerprints  Low and Bounded  N  A 
Guidewires  Spline Model  N  N 
Pavement Cracks  Random  A  A 
Power Lines  Spline Model  A  N 
Roads  Model Based  A  A 
Ropes  Spline Model  A  N 
Surgery Threads  Spline Model  A  N 
Vessels  Low and Bounded  N  A 
Although the literature concerning DLOs is more focused on manipulation than on perception (see e.g. [saha2007manipulation]), we may refer to the broader topic of Curvilinear Objects Segmentation to highlight related works as well as suitable alternatives to evaluate our proposal comparatively. As described in [bibiloni2016survey], the aforementioned topic pertains several kinds of objects, as summarized in Table 1. Our target objects are Cables, which, however, share similar properties in terms of Intersections and Bifurcations with the other categories highlighted in bold in Table 1,
As far as Cables are concerned, visual perception is typically addressed in fairly simple settings. In [jiang2011robotized] Augmented Reality markers are deployed to track endpoints. In other works, like [remde1999picking] and [camarillo2008vision], detection relies on background removal
Moving to the domain of knottying with Ropes, the basic approach still turns out to be background removal, like in [hopcroft1991case], or its 3D counterpart – plane removal– like in [lui2013tangled] and [schulman2016learning]. All these methods produce a raw set of points on which a region growing algorithm is run to attain a vectorization of the target object. A different approach is used in [schulman2013tracking], with the model of the object registered to the 3D point cloud in realtime in order to avoid the segmentation step. As described in [nair2017combining], deep learning can also be used to track a deformable object: deep features are associated with rope configurations so to establish a direct mapping toward energy configurations without any explicit modeling step. This approach, however, may hardly work effectively in presence of complex and/or unknown backgrounds.
In the medical field, Surgery Threads detection is just the same kind of problem, albeit at a smaller scale. Also the literature dealing with this domain is more focused on manipulation than on detection issues, either assuming the latter as solved [moll2006path] or addressing it by handlabeled markers [javdani2011modeling]. A more scalable approach is proposed in [padoy20113d] and [jackson2015automatic], where the authors borrow the popular Frangi Filter [frangi1998multiscale] from the field of vessels segmentation in order to enhance the curvilinear structure of suture threads and produce a binary segmentation amenable to estimate a spline model.
Despite Table 1 would suggest Ropes, Guidewires, Powelines and Surgery Threads to exhibit more commonalities with Cables, applications domains like Vessels Segmentation or Road detection can provide interesting insights and solutions for curvilinear object detection. Akin to most object detection problems, the most successful approaches to Vessels Segmentation leverage on deep learning. In [liskowski2016segmenting] the authors trained a Convolutional Neural Network (CNN) with hundreds of thousands labeled images in order to obtain a very effective detector. This supervised approach, however, mandates availability of a huge training set and lots of manhours, a combination quite unlikely amenable to realworld industrial settings. Similar considerations apply to other remarkably effective vessels segmentation approaches based on supervised learning like [melinvsvcak2015retinal] and [li2016cross]. Yet, in this realm several methods exploiting 2D filtering procedures are very popular and may be applied within a cable detection pipeline for industrial applications. The first interesting approach was developed by Frangi et.al [frangi1998multiscale] (hereinafter Frangi algorithm
3 Algorithm Description
The basic idea underlying our algorithm is to detect DLOs as suitable walks within an adjacency graph built on superpixels. We provide first an overview of the approach with the help of Fig. 2, which illustrates the following main steps of the whole pipeline.

Endpoints Detection: The first step consists in detecting endpoints. This is numbered as zero because it may be thought of as an external process not tightly linked to the rest of the algorithm. Indeed, any external algorithm capable to produce detections around targets () may be deployed in this step.

Superpixel Segmentation: The source image is segmented into adjacent subregions (superpixels) in order to build a set of segments exhibiting a far smaller cardinality than the whole pixel set. Moreover, an adjacency graph is created on top of this segmentation in order to keep track of the neighborhood of each superpixel. Eventually, those superpixels containing the 2D detections obtained in the previous step are marked as seeds ().

Start Walks: From each seed we can start an arbitrary number of walks () by moving into adjacent superpixels as defined by the adjacency graph.

Extend Walks: For each walk we move forward along the adjacency graph by choosing iteratively the best next superpixel (e.g. in Fig. 2(3)) between the neighbourhood {} of the current one.

Terminate Walks: When a walk reaches another seed (or lies in its neighborhood) it is marked as closed. Due to the iterative nature of the computation, a maximum number of extending steps is allowed for each search to ensure a bounded time complexity.

Discard Unlikely Walks: As a set of random walks are started in Step 2, we keep only the most likely ones and mark others as outliers.
(1) Example of crossing and selfcrossing wires  
(2) Visual likelihood  (3) Curvature likelihood  (4) Distance likelihood 
In the remainder of this Section we will describe in detail the different concepts, methods and computations needed to realize the whole pipeline. In particular, in Sec. 3.1 we address Superpixels together with the Region Adjacency Graph; in Sec. 3.2 we define walks and how they can be built iteratively by analyzing local and global features; in Sec. 3.3 we propose a method to start the random walks by exploiting an external Object Detector based on a CNN and, to conclude, in Sec. 3.4 we describe how to deploy walks to attain a semantic segmentation of the image as depicted in Fig. 1.
3.1 Superpixel Segmentation and Adjacency Graph
The main aim of Superpixel Segmentation is to replace the rigid structure of the pixel grid with an higherlevel subdivision into more meaningful primitives called superpixels. These primitives group regions of perceptually similar raw pixels, thereby potentially reducing the computational complexity of further processing steps. As regards the Cable Detection and Segmentation problem (i.e. DLOs with a thickness higher than a single pixel, in the majority of applications), our assumptions are that the target wire can be represented as a subset of similar adjacent superpixels. Thus, the overall problem can be seen as a simple iterative search through the superpixels set subject to modeldriven constraints (e.g. avoiding solutions with implausible curvature or nonuniform visual appearance).
Superpixel Segmentation algorithms can be categorized into graphbased approaches (e.g. the method proposed by Felzenszwalb et al. [felzenszwalb2004efficient]) and gradientascent methods (e.g. Quick Shift [vedaldi2008quick]). In our experiments we found the stateoftheart algorithm referred to as SLIC [achanta2012slic] to perform particularly well in terms of both speed and accuracy. Accordingly, we deploy SLIC in our Superpixel Segmentation stage. SLIC is an adaptation of the kmeans clustering algorithm to pixels represented as 5D vectors , with denoting color channels in the CIELAB space and image coordinates. During the clustering process the compactness of each cluster can be either increased or reduced to the detriment of visual similarity. In other words we can choose easily to assign more importance to visual consistency of superpixels or to their spatial uniformity. Fig. 4 (b),(c) show two segmentations provided by SLIC according to different settings for the visual consistenvy vs. spatial uniformity tradeoff.
Superpixel Segmentation allows then to build a Region Adjacency Graph (RAG in short) according to the method described in [tremeau2000regions]. Thus, a generic image can be partitioned into disjoint nonempty regions (i.e. the superpixels) such as . Accordingly, an undirected weighted graph si given by , where is the set of nodes , corresponding to each region , and is the set of edges such as that if and are adjacent. A graphical representation of this kind of graph is shown in Fig. 4(d), where black dots represents nodes and black lines represent edges . In this quite straightforward to observe that in Fig. 4(d) there exists a walk trough the graph , highlighted by white dots, which covers a target DLO (i.e. the red cable in the middle).
It is worth pointing out that our approach is similar to a Region Growing algorithm, with a seed point corresponding to the cable’s tip and the search space bounded by the RAG. The main difference with a classical Region Growing approach is that we restrict the search along a walk applying several modelbase constraints rather than relying only on visual similarity only. In particular, the shape of the walk is considered by assigning geometric primitives to the elements of the adjacency graph, i.e. 2D points for our nodes and 2D segments for our edges , as further described in Sec. 3.2. In simple terms, the geometric consistency of the curve superimposed on a walk is analyzed to choose the next node during the iterative search, and all unlikely configurations are discarded.
(a)  (b)  (c)  (d) 
3.2 Walking on the Adjacency Graph
Formally, a walk over a graph is a sequence of alternating vertices and edges , where an edge connects nodes and , and is the length of the walk. The definition of walk is more general with respect to the path or trail over the graph because it admits repeated vertices, a common situation when dealing with selfcrossing cables. It is important to notice that the Region Adjacency Graph shown in Fig. 4(d) is a simpleconnectivity relationship graph, or, equivalently, connectivity is of order , with this meaning that only directly connected regions are mapped into the graph. We can build also RAGs with order , thereby allowing, for example, second or third order connectivity. All this translates into the possibility to jump during the walk also to the vertex not directly connected to the current region. This turns out very useful, for example, to deal with intersections like that depicted in Fig. 3, where vertices are of order and vertices of order .
For the sake of simplicity, we can define a generic walk as , i.e. an ordered subset of vertices, without considering edges. Under the hypothesis that the target walk is superimposed to a portion of the object, the problem is to extend the walk in such a way that the next node does belong to the sought DLO. An exemplar situation is illustrated in Fig. 3, where we have a current path which ends with and we wish to choose between the 8 vertices the best one to extend the walk. Considering the new path , i.e. the path with the addition of vertex , we cast the problem as the estimation of the likelihood of the new path given the current one, which we denote as . Moreover, we estimate this likelihood based on visual similarity, curvature smoothness and spatial distance features and assume these features to be independent:
(1) 
The three terms , and are referred to as Visual, Curvature and Distance likelihood, respectively, and computed as follows.
Visual Likelihood
measures the visual similarity between the previous path and the path achievable by adding node . Assuming an evenly coloured DLO, we can compute this similarity by matching only the last node of , , with . Although, in principle, it may be possible to use any arbitrary visual matching function, as highlighted in [ning2010interactive] we found the Color Histogram of the superpixels associated with vertices to be a good feature to compare two image regions. Denoting as and the normalized color histograms (in the HSV color space) of the two regions associated with and , respectively, we can compute their distance with the intersection equation: . Then we normalize this distance in the range using the Bradford normal distribution:
(2) 
where is a parameter that enables to control the shape of the distribution and, hence, the weight assigned to the visual similarity information in the overall computation of the likelihood (Eq. 1). Fig. 3(2) plots the visual likelihoods computed for the different neighbours of , which suggests nodes and to represent the most likely superpixels to extend the walk.
Curvature Likelihood
is concerned with estimating the most likely configuration of a DLO’s curvature. Following the intuitions of Predoehl et al. [predoehl2013statistical], for each new node we can assume that the object’s curvature changes smoothly along the walk. To quantify this smoothness criterion we exploit the product of the von Mises distributions of the angles between two successive vertices. As introduced in Sec. 3.1, by extending the model of our adjacency graph with geometric primitives we can assign a 2D point corresponding to the centroid of the associated superpixel to each vertex , as well as a unit vector to each edge by considering the segment joining two consecutive centroids , . As shown in Fig. 5, this allows for measuring the angle difference between two consecutive edges. By denoting as the angle difference between two consecutive edges , , the overall von Mises distribution allowing to establish upon the smoothness of the curvature of a target DLO is given by:
(3) 
where is the von Mises distribution at each vertex. An exemplar estimation is shown in Fig. 3(3): vertices and appear to be the most likely candidates to extend the walk as they minimize the curvature changes of the target .
Distance Likelihood
is the term concerned with the spatial distance of the next vertex in the walk. This term is mainly introduced to force the iterative procedure to choose the nearest available vertex without undermining the chance to pick a far vertex instead, for example when we want to deal with an intersection (see Fig. 3(1)). Thus, similarly to subsubsection 3.2.1 we normalize the distance in pixel between two nodes, , according to the Bradford normal distribution:
(4) 
with tuned such that the decay of the distribution is slow to prefer nearest vertex but not enough to discard the furthest points. Fig. 3(4) highlights how, thanks to the normalization in subsubsection 3.2.3, second order neighbours () are not excessively penalized with respect to first order ones () and hence have the chance to be picked in case they exhibit a high visual similarity and/or yield a particularly smooth walk.
Estimation of the most likely walk
can therefore be computed for all considered neighbours in order to pick the most likely vertex, , to extend the walk, with:
(5) 
Considering again the example in Fig. 3, although the farthest from , vertex is selected to extend the walk as it shows a high visual likelihood as well as a high curvature likelihood.
3.3 Starting and Terminating Walks
As described Sec. 3, walks need to be initialized with seed superpixels located at DLOs’ endpoints. Purposely, we deployed a Convolutional Neural Network to detect endpoints. In particular, we finetuned the publicly available YOLO v2 model [redmon2017yolo9000] pretrained on ImageNet based on the images from our Electrical Cable Dataset (see Sec. 4.1) and by performing several data augmentations. As already mentioned, the endpoint detection module may be seen as an external process with respect to our core algorithm and, as such, in the comparative experimental evaluation we will use the same set of endpoints obtained by YOLO v2 to initialize all considered methods.
As illustrated in Fig. 2(1), the endpoint detection step predicts a set of bounding boxes around the actual endpoints. For each such prediction we find the superpixel containing the central point of the box. The graph vertex corresponding to this superpixel is marked as a seed to start a new walk. As no prior information concerning the direction of the best walk across the target DLO is available, multiple walks are actually started, in particular along each possible direction (see Fig. 2(3)). It is worth pointing out that the considered directions are those defined by the seed vertex and all its neighbours in the graph, which, as discussed in Sec. 3.2, can be both first order () as well as higher order () neighbours. Each started walk, then, is iteratively extended according to the procedure described in Sec. 3.2. .
As for the criteria to terminate walks, first of all we set a maximum number of iterations to extend a walk. We also terminate a walk if it reaches another seed vertex in the adjacency graph. More precisely, as depicted in Fig. 2(5), we terminate a walk if the distance from the current vertex and a seed is smaller than a radius threshold . Thus, given a seed, all walks started from that seed will terminate and we will have to pick only the optimal one. Purposely, we exploit again the Curvature analysis described in Sec. 3.2 and use a formulation similar to Eq. 3 to pick the smoothest path (i.e. the walk with the highest value of ).
3.4 Segmentation and Model Estimation
The simplest technique to segment the image is to assign different labels to the superpixels belonging to the different walks alongside with a background label to those superpixels not included into any walk. Besides, we estimate a BSpline approssimation (with the algorithm described in [dierckx1995curve]) for each walk based on the centroids of the its superpixels . Then, given the BSpline model we can refine the segmentation by building a pixel mask. In particular, by evaluating the smoothing polynomial we can sample densely the points belonging to each parametrised curve and then adjust the thickness of the segmented output based on the mean size of the superpixels belonging to the walk. The accurate segmentation provided by this procedure is shown in Fig. 1, where the right image contains the colored BSplines built over the walks represented in the middle image (the color is estimated by averaging the color of the corresponding superpixels).
4 Experimental results
4.1 Software and Dataset
We provide an open source software framework called Ariadne available online
The Dataset is made up of two separated parts: the first consists of 60 cable images with homogeneous backgrounds (white, wood, colored papers, etc.); the second includes 10 cable images with complex backgrounds. In Fig. 7, the first 3 rows deal with homogenous backgrounds whilst the other with complex one.
For each image in the dataset we provide an handlabeled binary mask superimposed over each target cable separately and an overall mask that is the union of them. Furthermore, we provide a discretization of the BSpline for each cable in every image which consists in a set of 2D points in pixel coordinates useful to have a lighter model of the cable and track easily its endings. Further details can be found on the dataset website
4.2 Segmentation results
To test our approach we compared it to the popular Curvilenar Object Detector discussed in Sec. 2: the Frangi 2D Filter [frangi1998multiscale], the Ridge Algorithm [staal2004ridge] and the more generic one ELSD [puatruaucean2012parameterless]. For each algorithm we produce a mask associated with the detected curvilinear structures and compare it, by means of the Intersection Over Union, , to the ground truth provided by our dataset. Table 1 reports the weighted on images, where the weight is proportional to the number of cables present in the image: . The first row refers to images featuring a homogeneous background (60 images with a total of 395 cables), the second to those having a complex background (10 images and a total of 40 cables). We tuned the hyperparameters of the three competing methods trying to choose the best configuration in order to cope with both the simpler and harder dataset images. As for our algorithm, we used: a 3D Color Histogram with 8 bins for each channel as Visual similarity feature; a Von Mises distribution with to compute the Curvature likelihood and a degree for the adjacency graph (i.e. it means that we search in a neighborhood of level 3 during our walk construction, as described in Sec. 3.2). For both our approach and the competitors we exploit the information about the endpoints provided by the inintialization step: for Ariadne we use this information to start random walks, for the competitors to discard many outliers.
The results reported in Table 1 show how our approach outperforms the other methods by a large margin, although it is fair to point out that the competitors are generic curvilinear detectors and not specific DLO detectors. It is also worth highlighting that our method is remarkably robust with respect to complex backgrounds and that this is achieved without any training or fitting procedure that would hinder applicability to unknown scenarios. Thus, our approach could be used in any real industrial application without requiring prior knowledge of the environment.
Finally, Fig. 7, present some qualitative results obtained by our approach. Moreover, an additional qualitative evaluation is present in the supplementary material, where interactive examples of the Ariadne software are shown. In the abovementioned material, as a naive proofofconcept, we tested Ariadne also in similar challenging contexts like Roads and Rivers segmentation.
Ariadne  Frangi [frangi1998multiscale]  Ridge [staal2004ridge]  ELSD [puatruaucean2012parameterless]  

Homegeneous Background  0.754  0.406  0.293  0.225 
Complex Background  0.583  0.063  0.023  0.147 
4.3 Timings and failure cases
Ariadne is an iterative approach, with the number of iterations depending on the length of the DLO. Thus, as depicted in Fig. 6, we can estimate an average iteration time of about and an average segmentation time of about . We also point out that these measurements were obtained with the actual Python implementation. Moreover, Fig. 8(a),(b) shows the two main failure cases that we found for Ariadne. In (a), as two DLOs (blue and green) are adjacent and exhibit very similar colour and curvature, the walk may jump on the wrong cable. In (b), as the distance between the DLO and the camera varies greatly, the density of Superpixels is not constant and the walk can cover only a portion of the sought object or even completely fail.
5 Concluding Remarks
We presented an effective unsupervised approach to segment DLOs in images. This segmentation method may be deployed in industrial applications involving wire detection and manipulation. Our approach requires an external detector to localize cable terminals, as otherwise we should start walks at every superpixel, which would be almost unworkable, although not impossible. So far we deploy an external detector which provides only the approximate position of the endpoints. We are currently working to develop a smarter endpoint detector capable to infer also the orientation of the cable terminal in order to dramatically shrink the number of initial walks. Another future development concerns building a much larger Electrical Cable dataset equipped with ground truth information suitable to train a specific CNN aimed at cable segmentation and compare this supervised approach to Ariadne. It is worth pointing out that, in specific and known in advance settings, a supervised approach may turn out peculiarly effective: in such circumstances Ariadne could be used to vastly ease and speed up the manual labeling procedure required to obtain the training images by replacing the initial object detector with the interactive intervention of the user.
References
Footnotes
 This work was supported by the European Commissions Seventh Framework Programme (FP7/20072013) under grant agreement no. 601116.
 In [camarillo2008vision] the authors deal with a thin flexible manipulator which may be described as a cable due to the high number of degrees of freedom.
 https://github.com/ntnubioopt/libfrangi
 https://github.com/kapcom01/Curviliniar_Detector
 https://github.com/m4nh/ariadne
 https://github.com/m4nh/cables_dataset
 https://github.com/m4nh/cables_dataset