Using Curvilinear Features in Focus for Registering a Single Image to a 3D Object

Using Curvilinear Features in Focus for Registering a Single Image to a 3D Object

Hatem A. Rashwan, Sylvie Chambon, Pierre Gurdjos, Géraldine Morin and Vincent Charvillat University of Toulouse, IRIT
Abstract

In the context of 2D/3D registration, this paper introduces an approach that allows to match features detected in two different modalities: photographs and 3D models, by using a common 2D reprensentation. More precisely, 2D images are matched with a set of depth images, representing the 3D model. After introducing the concept of curvilinear saliency, related to curvature estimation, we propose a new ridge and valley detector for depth images rendered from 3D model. A variant of this detector is adapted to photographs, in particular by applying it in multi-scale and by combining this feature detector with the principle of focus curves. Finally, a registration algorithm for determining the correct viewpoint of the 3D model and thus the pose is proposed. It is based on using histogram of gradients features adapted to the features manipulated in 2D and in 3D, and the introduction of repeatability scores. The results presented highlight the quality of the features detected, in term of repeatability, and also the interest of the approach for registration and pose estimation.

I Introduction

Many computer vision and robotic applications are used to take 2D contents as input, but, recently 3D contents are simultaneously available and popular. In order to benefit from both modalities, 2D/3D matching is necessary. For medical imaging, registration of pre-operative 3D volume data with intra-operative 2D images becomes more and more necessary to assist physicians in diagnosing complicated diseases easily and quickly [1]. For robotic, the 2D/3D matching is very important for many tasks that need to determine the 3D pose of an object of interest: 3D navigation or object grasping [2]. The main goal of 2D/3D registration is to find the transformation of the 3D model that defines the pose for a query 2D image. Thus, a typical 2D/3D registration problem consists of two mutually interlocked subproblems, point correspondence and pose estimation.

To match 2D photographs directly to 3D models or points clouds, most systems rely on detecting and describing features on both 2D/3D data and then on matching these features [6, 7]. Recently, some approaches are based on learning by specific supervision classifier [3, 4]. These methods produce very interesting results, however, they require huge amount of viewpoint-annotated images to learn the classifiers. What makes difficulty to the problem of matching 3D features of an object to 2D features of one of its photographs is that the appearance of the object dramatically depends on intrinsic characteristics of the object, like texture and color/albedo, as well as extrinsic characteristics related to the acquisition, like the camera pose and the lighting conditions. Consequently, some approaches manually define correspondences between the query image and the 3D model, such as [5]. These manual methods can be robust but it can easily become hard to apply this manual selection to large image sets. Moreover, in this paper, we focus on automated approaches. Note that some systems are able to generate a simultaneous acquisition of photographs and scanning of a 3D model but using this kind of systems induces limited applications. Other methods solve the problem by distinguishing two subproblems: to choose the common representation of the data and, then, to find the correspondences. These methods transforms the initial 2D/3D registration problem to a 2D/2D matching problem by rendering multiple 2D images of 3D models from different viewpoints, such as [8, 9, 10].

Consequently, the first task of 2D/3D registration is to find an appropriate representation of 3D models in which reliable features can be extracted in 2D and 3D data. In [8], synthetic images of the 3D model are rendered, while depth images are rendered in [9]. More recently, [10] proposes average shading gradients. This rendering technique for a 3D model averages the gradient normals over all lighting directions to cope with the unknown lighting of the query image. The advantage of representing the 3D model by a set of depth images is that it can express the model shape independently to color and texture information. Therefore, representing the 3D model by a set of depth images is the best option for this work, see Fig. 1. In this case, features extracted from depth images are only related to shape information.

Fig. 1: To compare 2D images with 3D models, we use a collection of rendered images of the 3D models from different viewpoints, and then we detect points of interest (curvilinear saliency) with common basis definitions between depth images and intensity images. For that evaluation, each depth image is compared with the original 2D image, based on these point of interest detection, and the proposed algorithm gives as output the depth image with the most similar point of view with the point of view of the 2D image.

Then, the second difficulty of 2D/3D registration consists in proposing how to match entities between the two modalities in this common representation. It can be partial [11] or dense matching, based on local or global characteristics [12]. In [8], silhouettes extracted from synthetic images are matched to ones extracted from the color images. However, this method is not able to take into account most of the occluding contours that are useful for accurate pose estimation. In turn, in [10], image gradients are matched with their 3D representation. Since image gradients are still affected by image textures and background, this technique can fail to estimate the correct correspondences. A key requirement on these features, as in classic 2D matching between real images, is to be computed with a high degree of repeatability. In our case, similar to the definition in [13], the repeatability of a feature is defined as the frequency with which one detected in the depth image is found within pixels around the same location in the corresponding intensity image (if it is supposed that the features are not moving or are following a small deplacement). Then, since we suppose that an individual photograph of an object of interest is acquired, in a textured environment, we will focus on comparing pre-processed features of color images with features of a set of rendered images of 3D models, more precisely, a set of depth images, see Fig. 1.

More precisely, the 3D object will be given by a set of 3D depth surfaces, which describe how the original object surface is shortened by a perspective viewing and the image is given by the 3D intensity surface. Since the depth and the intensity surfaces have a different order of representation, the two surfaces can not be directly matched. Thus, bringing both rendered depth images and photographs into a common representation, such as gradient and edge representation, allows to establish a robust sparse 2D-to-3D matching [10]. We propose to extract gradient-based features corresponding to object’s shapes in both depth and intensity images regardless of illumination and texture changes. In other words, as 2D photographs (intensity images) are affected by background, textures and lighting changes, we take into account these difficulties by reducing the influence of non-redundant information (i.e., color and texture) on features extracted from photographs. It means that we extract features in depth images that highlight geometric characteristics of an object. For photographs, we need to refine detected features by selecting salient points acquired by a camera in focus. These points are a function of the degree of blur (blurriness) in an image. Thus, the detected points are analyzed based on measuring the blur amount of every feature point. Finally, what we call focus points should be able to detect the approximate shape and to discard the other components such as textures.

To summary, the contributions of this paper, as shown in Fig. 1, are:

  1. A ridge and valley detector, for depth images rendered from 3D model. We name it curvilinear saliency (CS) as it is related to the curvature estimation (a function of the eigenvalues derived of the Hessian matrix). This representation directly relates to the discontinuities of the object’s geometry, and, by nature, the extracted features should be robust to texture and light changes.

  2. A variant of this detector adapted to photographs. This curvilinear saliency detector is applied in multi-scale by searching over all scales and all image locations in order to identify scale-invariant interest points. To reduce the influence of structures due to texture and background regions, we introduce the extraction of focus curvilinear saliency features. It corresponds to ridges that are not affected by blur.

  3. A registration algorithm for determining the correct viewpoint of the 3D model and thus the pose. This method is based on using histogram of gradients, HOG, features [14], adapted to the features manipulated in 2D and in 3D, and the introduction of repeatability scores. More precisely, the HOG descriptor is computed on both depth images (i.e., curvilinear features extracted with curvilinear saliency detection) and photographs (i.e., curvilinear features in focus extracted with multi-scales curvilinear saliency detection) and it combines the curvilinear saliency value with the orientation of the curvature. The repeatability score measures the set of repeatable points detected both in a photograph and in the rendered depth images.

After presenting the related work and reminder on differential geometry, sections II and III, we introduce the 3D model reprensentation, section IV, and, then, the image representation used for 2D/3D matching, section V. Then, we describe how we try to be robust to background and texture by using the same principle used in the detection of focus curves, section VI. We illustrate how this new global approach for 2D/3D matching allows to obtain more repeatable features, compared to state of the art, section VII. Finally, we explain how we obtained 2D/3D registration results, section VIII, and pose estimation, section IX, by highlighting the interest of the proposed approach in these applications, before conclusion, section X.

Ii Related work

As mentioned earlier, a typical 2D/3D registration problem consists of two subproblems: feature correspondence and pose estimation (i.e., alignment). Thus, the related work is divided into three parts related to these subproblems: 1) detect features in 2D photography 2) detect features in a 3D model and finally 3) match 2D features to 3D features to estimate the 3D pose.

Ii-a Classic 2D feature detection

In this section, we try to identify if, in the literature, it exists a 2D classical detector interesting in order to obtain points comparable as points detected, with the same principle or same tool, in 3D. We suppose that, for this purpose, it is necessary to detect features that are related to points of interest on the structure of the object and not on the texture or the light changes on the object. In 2D, edge detection [15] based on the first-order derivative information is the initial technique. It can detect any kind of edges, even low contrasted edges that are not due to the structure but more due to texture. The second technique is to detect the interest points [16] that refer to point-like features in an image by analyzing the eigenvalues of the structure tensor (i.e., the first-order derivative) at each point – the two eigenvalues have to be maximal to highlight a point of interest. Again, this technique does not take into account difficulties due to textures, light changes or scale changes. In another way, blob detection [18, 20, 19, 21] provides a complementary description of image structures in terms of regions, as opposed to point-like algorithms. These methods are based on the Hessian matrix (i.e., the second-order derivative) because of its independence to zero- and first-order changes and its good performance in computation time and accuracy. More recently, multi-scale approaches have been introduced, like a generalization of Harris or Laplacian detectors [22] or the well known approach of SIFT, Scale Invariant Features Transform [19]. In [23], SURF, Speeded Up Robust Features, a detector also based on Hessian matrix analysis, is introduced to be faster than SIFT and other multi-scale techniques by using approximation of Laplacian and algorithmic tricks. All these techniques are robust to light changes, rotations and translations. It makes the features detection invariant to view-point changes. However, they totally rely on texture and/or intensity changes to find the features.

Curvature detection is one of the most important techniques of second-order derivative-based approaches used for extracting the structure properties. Recently, [17] has proposed a detector based on curvature expressed as the change of the image gradient along the tangent to obtain a scalar approximating . In addition, [24] presented PCBR, Principal Curvature-Based Regions, detector that uses the maximum or minimum eigenvalue of the Hessian matrix to find the principal curvature in a multi-scale space. As mentioned in [24], the maximum eigenvalue yields a high value only for the dark side of edges, i.e. the minimum eigenvalue detects light lines on a dark background. By definition, it is a restrictive way to select features and it does not guarantee to select the maximal number of reliable features.

In conclusion of this review, curvature features have several advantages over more traditional intensity-based features [17], especially with extracting local structure of the points of interest. In addition, particularly, curvature features are invariant to viewpoint changes and to transformations that do not change the shape of the surface.

Ii-B Classic 3D feature detection

In this subsection, our goal is to find if, in the literature, some 3D detector can be directly adapted to 2D, in order to obtain comparable points of interest. Feature extraction of 3D models/scenes can be classified into point-based and image-based approaches. Most of point-based methods are based on using SIFT in 3D by proposing an adaptation of the initial SIFT [25, 26]. These approaches are interesting but they are not dedicated to 2D/3D registration and, so, they do not consider to detect similar features in 2D and in 3D. Other methods proposed to find curves that have special properties in terms of differential geometry of the surface. For example, in [27], curves are supposed to be located at paraboloic lines which occur at points of extremal curvature. These curves capture important object properties closed to the object surface, but they do not vanish along the surface when the viewpoint changes.

In image-based approaches, the 3D model is first rendered to form images or geometric buffers. Then image processing methods are applied such as edge detection [28]. In [29], 2D SIFT is applied on images of a rendered 3D mesh because this multi-scale representation extracts features that are supposed to be related to local extrema of the surface variation. The apparent ridges (AR), which are a set of curves whose points are local maxima on a surface, are introduced in [30]. In this paper, a view-dependent curvature corresponds to the variation of the surface normal with respect to a viewing screen plane. However, apparent ridges often produce false edges that are not related to occluding contours, which are important for pose estimation of most of objects. Mesh saliency measures the region importance of 3D models using Gaussian-weighted mean curvatures in multi-scales [31]. However, mesh saliency is based on mesh images that are also affected by lighting conditions. Average Shading Gradients, ASG, was proposed in [10]. This rendering technique is based on averaging gradients over all lighting directions to cope with the unknown lighting conditions.

In conclusion, only [10] can be used in a 2D/3D registration context and we will describe, in section IV, and compare our work with this method, in section VII.

Ii-C 2D/3D registration and pose estimation

In the computer vision literature, the problem of automatically aligning 2D photographs with an existing 3D model of the scene has been investigated in depth over the past fifteen years. In the general case, the proposed solution will be an image-to-model registration to estimate the 3D pose of the object. The 2D-to-3D registration problem is approached in the literature through indirect and direct methods [32].

For indirect registration, these methods are performed either by 3D-to-3D registration or by finding some appropriate registration parameters, such as the standard Iterative Closest Point, ICP algorithm [65]. This kind of techniques are more global and do not really provide points to points correspondences.

For direct registration methods, in [26], correspondences are obtained by matching SIFT feature descriptors between SIFT points extracted in the images and in the 3D models. However, establishing reliable correspondences may be difficult due to the fact that the set of points in 2D and in 3D are not always similar, in particular because of the variability of the illumination conditions during the 2D and 3D acquisitions. In the same context, in [45], the authors assume that the object in the input image has no or poor internal texture. Methods relying on higher level features, such as lines [33], planes [34] and building bounding boxes [35], are generally suitable for Manhattan World scenes and hence applicable only in such environments. Skyline-based methods [36] as well as methods relying on a predefined 3D model [39] are, likewise, of limited applicability. Recently, the histogram of gradients, HOG, detector [44, 41] or a fast version of HOG [9] have been also used to extract the features from rendering views and real images. All of these approaches give interesting results, however, they do not evaluate the repeatability between the set of points detected in an intensity image and those detected in an image rendered from the 3D model. Finally, in [10], 3D corner points are detected using the 3D Harris detector and the rendering average shading gradients images on each point. For a query image, similarly, corner points are detected in multi-scale. Then, the gradients computed for patches around each point is matched with the database containing average shading gradient images using HOG descriptor. This method still relies on extracting gradients of photographs affected by textures and background and it can give erroneous correspondences. Consequently, they propose a refine stage based on RANdom SAmple Consensus, RANSAC [56] to improve the final pose estimation.

In this paper, structural cues (e.g., curvilinear shapes) based on curvilinear saliency are extracted instead of only considering silhouettes, since they are more robust to intensity, color, and pose variations. In fact, they have the advantage to both represent outer and inner (self-occluding) contours that also characterize the object and that are useful for estimating the pose. In order to merge in the same descriptor curvilinear saliency values and curvature orientation, the HOG descriptor, widely used in the literature and that properly describes the object shape, is employed. Finally, HOG features and repeatability scores are used to match the query image with a set of depth images rendered from a 3D model.

In the rest of the paper, after the reminder on differential geometry, we will describe the 3D model representation and the image representation that are introduced in order to compare and to match the 3D and the 2D data. In particular, we will explain how these representations allow to be robust to background details and to texture before illustrating the interest of the proposition with experimental results on feature detection, registration between 2D images and 3D models and pose estimation.

Iii Reminder on Differential Geometry

Notations

In the sequel, these notations are used:

  • : the gradient vector of a scalar-valued function .

  • : the partial first-order derivative of a vector-valued function w.r.t. variable .

  • Similarly, : the partial second-order derivatives of w.r.t. variables and .

Moreover, we assume a calibrated perspective camera where the image point coordinates are given with respect to the normalized image frame i.e., as obtained from the equation , where are pixel coordinates and is the usual calibration upper triangular matrix [63].

Iii-a Definitions: Differential of a Map – Tangent Plane – Gauss Map

Let be a regular surface111See the definition of a regular surface in in [66, p. 281-286]. whose parameterization is given by the differentiable map with

(1)

To each is associated a map , called the differential of at and defined as follows [62, p. 128]. Let be a vector and let be a differentiable curve satisfying and . By the chain rule, the curve in is also differentiable. We then define

It provides a linear (i.e., first-order) approximation to when the increment is small enough. This is illustrated in Fig. 2(a). The vector subspace has dimension 2 and is a plane consisting of all tangent vectors of at . It is called the tangent plane of at and denoted by .

It can be proved [62, p129] that the above definition does not depend on the choice of . Furthermore, the fact that entails that is linear in . In particular, in the canonical bases of and , we have

involving the Jacobian matrix of at

(2)

with and as columns. This also shows that the vector subspace has indeed dimension 2.

Let be a point of . Let be the differentiable map that assigns to the coordinate vector on the unit sphere representing the unit normal of at and computed as

(3)

This map is called the Gauss map of .

The Gauss map is a mapping between the two surfaces and and the definition of differential is extended to that case. The differential of the Gauss map of at point is the map defined as follows. Let be a vector on the tangent plane of at and let be a differentiable curve on the surface satisfying and . By the chain rule, the curve in is also differentiable; we then define

It expresses how behaves — how curves— in the vinicity of . This is illustrated in Fig. 2(b). Again, it can be proved [62, p129] that the above definition of does not depend on the choice of one possible curve .

(a)
(b)
Fig. 2: (a) A map parameterizing a regular surface and its differential at point along direction . (b) The map N is the Gauss map (which is a mapping between the surface and the unit sphere ) and differential at along .

Similar to the differential of a map seen above, the fact that entails that is linear in . The vector subspace has dimension 2: it is the plane consisting of all tangent vectors to the unit sphere at point called the tangent plane of at . Therefore the domain of values of is . Actually, it can readily seen that and are parallel planes so the differential of is usually defined as .

Iii-B Curvatures. Fundamental forms of a surface

Let be a unit vector representing a direction on the tangent plane of at . Let be the curve obtained by slicing with the normal section of at along i.e.,222This holds for any plane through parallel to . This is due to Meusnier’s theorem [38, p482] “All curves lying on a surface and having at a given point the same tangent line have at this point the same normal curvatures. ” the plane through parallel to both and . The normal curvature of in (unit) direction is the curvature of at which can be given by [62, p. 144]:

(4)

It expresses how behaves — how curves— in the vinicity of . An important remark is that the radius of curvature of at is equal to .

The principal curvatures , of at can be defined as the extrema of function (4) with respect to directions , subject to the constraint . The corresponding directions are called principal directions of at . It is well-known that ‘‘the coefficients , are decisive parameters that fully describe local surface shape up to the second order modulo a rigid movement.” [64] This means that the principal curvatures are invariant to the surface parameterization.

Now consider a new 3D coordinate system with as origin. Any vector on the tangent plane can be written as

(5)

where is the Jacobian matrix of defined in (2)à and are so-called local coordinates of . From now on, we will put as subscript to relate a vector to its local coordinates, e.g., .

Iii-B1 First fundamental form of a surface

Given any , the norm of any vector on writes

(6)

where

The quadratic form on

(7)

is called the first fundamental form of [62, p94].

Iii-B2 Second fundamental form of a surface

Let be a direction of the tangent plane , given in local 2D coordinates. On the one hand, it can be shown [62, p156] that the differential of the Gauss map of at along writes in standard 3D coordinates

On the other hand, let denote by the differential of at along expressed in local 2D coordinates i.e., such that . Then we have

(8)

where

(9)

The proof can be found in [62, p156].

The quadratic form

(10)

is called the second fundamental form of [62, p143]. It directly follows from this that Eq. (4) can be expressed in local coordinates and writes

(11)

For any (non unit) vector on , in order to set it as a unit direction, it is needed to divide (11) by the square of expression (6). Hence, the normal curvature in the unit direction is can be given by

(12)

Iii-B3 Closed-form solutions for principal curvatures

The principal curvatures , of at can be defined as the extrema of function (12) with -coordinates as variables.

Seeing (12) as a generalized Rayleigh quotient, it is known [37, p18] that has an extremum at only if is a root of or, equivalently, only if is an eigenvalue of the matrix , which is not symmetric but always has real eigenvalues [38, p500]. As a result, the principal curvatures are the two eigenvalues () of the matrix . The principal 3D directions are where are the corresponding eigenvectors.

Now we state a proposition that we derive from the above results, which will be used in our work.

Proposition 1

The principal curvature () at associated to the unit principal 3D direction is equal to the absolute magnitude of the differential of the Gaussian map at this point i.e.,

(13)

Proof. Since the Euclidean norm is invariant to changes of Euclidean coordinates, without loss of generality, let choose a new parameterization , for some new height function , w.r.t. 3D orthonormal frame whose origin is and -plane coincides with the tangent plane . More generally, we will add the symbol to distinguish the new representations from the old ones, except for the principal curvatures which are irrespective of parameterizations. Let us remind that , where is the associated eigenvector, and note that the new first fundamental matrix, , is then the identity. As a result, starting from the fact that is an eigenvalue of the matrix , we have:

 

Iv 3D model representation

The work the most related to what is proposed in this paper is the Average Shading Gradient (ASG) approach, proposed in [10]. After introducing how object surface can be represented, we highlight the differences between these two approaches.

Object surface : Denote by the surface of some observed object associated to a parameterization , where varies over the restricted image domain of a given camera which is delimited by the occluding contour of the object. Under perspective projection, every visible 3D point of (seen from the camera viewpoint), with vector , is assumed to be in one-to-one correspondence with the 2D image point with vector , such that and . As a result, we get

(14)

Let denotes the Gaussian map of which assigns, on the unit sphere, to each point point of the unit normal of defined by where, using (14), writes

(15)

It can be shown that the Jacobian matrix of writes

(16)

where the columns of have the form

(17)

standing for either and or .

Iv-a Average Shading Gradient (ASG) Feature [10]

Plötz et al. assumed in [10] that the image intensity function obeys the Lambertian shading function

(18)

for a parallel light source . This means that the reflectance describing the object material is assumed to be Lambertian with constant albedo333A general shading function is where is the albedo at object point . . In addition, the background is assumed to be constant (e.g., a plane).

The authors propose as feature in the intensity image the magnitude of the gradient of the shading function. To register the intensity image to the 3D (untextured) model, the idea is to generate virtual images viewing the object from different camera pose candidates. Nevertheless, it is clearly impossible to render any such virtual image obeying the shading function (18) without prior information about the lighting direction and so about . Therefore, the authors propose to replace the gradient magnitude feature, in the virtual images, by a feature corresponding to the average value of the gradient magnitude computed over all light directions, so-called average shading gradient magnitude. Denoting  the magnitude of the gradient of the shading function (18) then the magnitude of the average shading gradient is:

(19)

where the vector , cf. (18), varies over the unit sphere in and is the volume element.

The nice contribution of Plötz et al. is, by applying Jensen’s inequality, to derive the following closed-form bound on

(20)

with . It is reported by the authors to behave like a very good approximation of . This is the elegant way the authors get rid of the unknown lighting direction in (18).

Iv-B Curvilinear Saliency Features (CS)

As already mentioned, our goal is to find a common representation between the 3D model and the 2D image in order to be able to match them. For that purpose, we first show how the 3D model can be represented or studied from different points of view and how these different viewpoints can be analyzed and compared to a 2D image. For that purpose, we represent the observed 3D object by a set of synthetic depth maps generated from camera locations distributed on concentric spheres encapsulating, by sampling elevation and azimuth angles, as well as distances from the camera to the object. A depth map (or depth image) associates to every image point the -coordinate, w.r.t. the camera frame, of the object 3D point (14) that projects onto .

Let denote the depth surface that is the 3D surface whose graph parameterization is444Note the difference with (14).

Which features should be extracted in the depth map? We aim at detecting depth “discontinuities” by searching points on having high principal curvature in one direction and low principal curvature in the orthogonal direction. We call Curvilinear Saliency features of a surface loci of such points. Basically, they correspond to the ridges and valleys of this surface. In this work, we use the difference of the principal curvatures to describe the ridges and valleys.

Principal curvatures and directions: Consider a point . Let denote the Gaussian map of assigning to the unit normal of at , such that

(21)

with and .

As the two columns of the Jacobian matrix of are and , the first fundamental form of can be computed as

and the second fundamental form of can be computed as

(22)

where is the Hessian matrix of i.e., with the second-order partial derivatives of w.r.t. and as elements.

The principal curvatures of at coincide with the eigenvalues () of , which are always real. In the tangent plane , the local coordinates of the principal directions of at are given by the eigenvectors of so the 3D principal directions in 3D wrote . As Koenderink wrotes in [64], “it is perhaps not superfluous to remark here that the simple (eigen-)interpretation in terms555By neglecting . of is only valid in representations where ”, which is the condition for the point to be local extremum.

Thanks to proposition 1, we know that that the principal curvature at associated to the principal 3D direction is equal to the absolute magnitude of the change of the normal

(23)

where denotes the differential of at in direction . We will make us of this result for the image representation, cf. §V. Now let us explain why we propose as feature the difference where .

Curvilinear feature: Without losing generality, let and be the principal curvatures computed as ordered eigenvalues of so that . We aim at detecting points lying on “elongated” surface parts. In this work, we detect points at which this difference is high:

(24)

We call (24) the curvilinear saliency (CS) feature. Curvilinear means a feature that belongs to a curved line. The rest of this paragraph justifies such a choice.

Given a point on , let be the Cartesian coordinates on the tangent plane ) w.r.t. the 2D frame whose origin is and the orthonormal basis is formed by the principal directions . As a result, can now locally be associated to the new parameterization , for some height function . In that case, it can be readily seen that is the identity matrix, and so is exactly the Hessian matrix of . For some small enough, consider on the two planes parallel to at distances from , the curves . It can be shown [38, p500] that the first-order approximation of the intersections of with the two parallel planes is the union of two conics (one real and one virtual) with equations This union is known as the Dupin indicatrix when written in canonical form (i.e., by replacing by ). The real Dupin conic characterizes the local shape of and gives local information on the first-order geometry of the surface, at least at points where the conic is non degenerate. It specializes to a parabola if the Gauss curvature vanishes i.e., , to an ellipse if , and to an hyperbola if , see Fig. 3. Points are said to be elliptic, hyperbolic or parabolic respectively.

Fig. 3: The real conics of the Dupin indicatrix.

Let us now focus on the Dupin central conics i.e., the real ellipse and real hyperbola. We do not consider the case of a parabola since, in real experiments, the condition will never be verified exactly.

Various measures can describe such a conic shape [64]. We introduce the quantity (24) that provides a unified way of treating ellipses and hyperbolas using the following nice interpretation.

Let the semi-major and semi-minor axes of the Dupin central conic be and respectively, where and are the radii of curvature of the curves obtained through the normal section of at along the principal directions.

Proposition 2

in (24) is the squared ratio between the eccentricity of the Dupin conic and its minor axis (due to lack of space, the straightforward proof is omitted):

(25)

where

The eccentricity can be interpreted as the fraction of the distance along the semimajor axis at which the focus lies. The quantity is normalized by additionally dividing the squared eccentricity by the squared semi-minor axis. Note that (25) works for Dupin ellipses as well as Dupin hyperbolas. The curvilinear saliency is large when , that is in presence of distant foci and so a highly elongated ellipse or a “squashed” hyperbola, see Fig. 3. This occurs e.g., when the point is located on a depth “discontinuity”. In turn, when , the conic approaches a circle and the distance between foci becomes very small.

A simple way to compute the curvilinear feature: After algebraic manipulations, it can be shown that where

(26)
Proposition 3

The squared curvilinear feature can be computed as

(27)
(28)

where is the mean curvature of and its Gaussian curvature.

Proof. Let and be the ordered eigenvalues (,) computed from as defined in (26), to have , . As the two eigenvalues of write

(29)

we have and . Since as well as

(30)

the squared curvilinear saliency is then defined.  

However, the rely on the highest or smallest principal curvature alone is not adequate for defining accurate ridges [42]. In Fig. 4, we show the different detections obtained using the minimum or the maximum principal curvature, as proposed by [24]. The maximum principal curvature provides a high response only for dark lines on a light background, while the minimum gives the higher answers for the light lines on a dark background. The difference of the principal curvatures, , improves robustness as it responds in both settings.

V Image Representation

V-a Proposed Curvilinear Features for Images

Let denote the value of the image intensity function at image point . Similar to the work of [10], we assume the Lambertian shading model (18). Let the intensity image be treated as an intensity surface defined by the vector function

(31)

Remind that the unit normal is where is defined in (15), and so, only depends on the depth and its derivatives up to order-.

We now want to detect features in the intensity surface and check whether they are good candidates to be matched to curvilinear features detected in the depth surface , w.r.t. a given camera pose. The key issue here is that detected features in can be matched to features detected in on the condition that both are based on measurements with the same order of derivation in , in order to yield a “compatible” matching that ensures repeatablility. The fact that depends on and its derivatives up to order-, entails that the detection of features in must rely on order- variations of the surface , e.g., on its differential along some adequate direction.

Consider a point on the image surface. Let be the differential of at . Given a unit direction in the image -plane, we have where is the Jacobian matrix of and and , where

(32)

standing for either and or . It is an order- measurement of the image surface variation at and is compatible with our curvilinear measurements of the depth surface (i.e., with same order of the derivatives of ).

In order to a get a scalar measurement, we define the unit vectors and by rotating by . For , we also define

(33)

which is the differential of along unit direction in the image plane. It can be easily seen that is the eigenvector of

(34)

associated with the largest eigenvalue . It is worthy to note that the similarity between the expression of the principal curvature computed for the depth surface, cf. (13) and the formulae (33). Also, note that the matrix (V-A) is that of the first fundamental form of . Clearly, the maximum and minimum values of the quadratic form correspond to the two eigenvalues of the first fundamental form matrix given in (V-A).

By a similar approach to §IV-B, we can propose as feature the difference where .

Proposition 4

Let be the two eigenvalues of the first fundamental form matrix of , ordered in descending order. Then, we have

(35)

Proof. We can deduce the ordered eigenvalues of from those of , i.e., and , so and . Which ends the proof.   

Again, as in §IV-B, we can describe the local shape of at by means of the eccentricity of a conic, here given by the quadratic form . How can we interpret this conic? The first order Taylor expansion for infinitesimal changes in the vinicity of yields

(36)

For any unit direction in the -plane, the quadratic form returns the linear part of growth in arc length from to . Therefore, we have

(37)
Fig. 4: Curvilinear saliency of two shapes (columns 1, 5) with minimum (2, 6), maximum (3, 7) and the difference between maximum and minimum eigenvalues (4, 8).

An important remark that we highlight here and not mentioned in [10] is the following one. The AVG feature defined in (20) is actually the Frobenius norm of the Jacobian matrix of the map , see (16), up to constant . Clearly, this describes the second-order behaviour of the surface relative to the normal at one of its points in the immediate vicinity of this point. Using the result in (15), (16) and (17), we can claim that the extracted feature in the virtual image only depends on and their derivatives up to order-. It is consistent (regarding the considered orders of the derivatives of ) with the feature detected in the intensity image, where , with is given in (32).

V-B Multi-Curvilinear Saliency (MCS)

Multi-scale helps to detect important structures as well as small details. In consequence, in this paper, we compute the curvilinear saliency images in a multi-scale space. To build the scale pyramid, an edge-preserving smoothing approach, named anisotropic diffusion filter [47], is used in order to avoid oversmoothing. In fact, this filter tries to separate the low frequency components (i.e, sharp edges) from the high frequency components (i.e., textures) by preserving the largest edges in an image.

Contrary to depth images which represent textureless 3D shapes, intensity images are composed of shape and texture components. Consequently, the curvilinear saliency (CS) estimated from intensity images is affected by the textured regions. Our idea is to put forward the assumption that multi-scale analysis can discriminate between keypoints (those with high CS value in the image) due to shape and keypoints due to texture. At a coarse level, edges detected are reliable but with a poor localization and they miss small details. At a fine level, details are preserved, but detection suffers greatly from clutters in textured regions. In addition, the CS values of small details and textures are high in the coarse level, whereas these values become lower in the finest levels. To combine the strengths of each scale, the CS value of each pixel over scales is analyzed. If this value in all scales is higher than a threshold , the maximum curvilinear saliency () value of this pixel over all scales is then kept. This threshold is a function of the number of the smoothed images, , (i.e., : when is small, then is a big value and vice versa). However, if the CS value is lower than in one level, it is considered as a point that belongs to a texture (or a small detail) point, thus it is removed from the final multi-scale curvilinear saliency, , image. Adding this multiscale step should help to reduce the impact of the texture on the point of interest detection. However, in the next section, we propose to introduce the principle used for estimating focus map in order to increase the robustness to the background and to the presence of the texture.

Vi Robustness to background and to texture

Before introducing the proposed improvement, we briefly present existing works about texture detection and, in particular, about focus curve estimation.

Vi-a Extraction of texture: state of the art

Various methods, such as [57, 58, 59, 60] have been proposed for extracting the texture from a natural image. In these approaches, a given image is separated into two components while preserving edges. In [57], Aujol et al. proposed a variational model based on total-variation (TV) energy for extracting the structural part. In [58], the authors proposed an algorithm in the field of scale space theory. This technique is a rolling guidance method based on an associated depth image to automatically refine the scale of the filtering in order to preserve edges. A structure-preserving image smoothing approach is introduced in [59]. This method locally analyzes the second order feature statistics of a patch around a pixel. The algorithm used a 7-dimensional feature vector that contains intensity, orientation and pixel coordinates. Finally, under a condition that the images contain smooth illumination changes and Lambertian surfaces, [60] proposed an intrinsic image decomposition model which explicitly determines a separate texture layer, as well as the shading layer and the reflectance layer. The method is based on surface normal vectors generated from an RGB-D image. All these works first smooth the intensity image as a pre-processing stage and then extracting the shape from that image relying on prior knowledge. And, to sum up, most of these methods for structure-texture decomposition are analogous to the classical signal processing low pass-high pass filter decomposition. However, even if it is correct to consider that the structure part of an image contains strong edges, the texture can also contain medium and high frequencies and the texture can only be partially removed. Another possibility is to consider focusness, related to the degree of focus.

Usually focusness is defined as inversely proportional to the degree of blur (blurriness) [52]. It is a very valuable tool for depth recovery [50] but also for blur magnification, or for image quality assessment. Blur is usually measured in regions containing edges, since edges would appear in images as blurred luminance transitions of unknown blur scale [43]. Then the estimation of the blur can be propagated to the rest of image. Since blur occurs for many different causes, this task is challenging and, in the literature, many methods on focus map estimation have been proposed. In [43], the authors identify blur as focal blur, induced by finite depth of field, as penumbra blur or shading blur and then estimate the blur scale. In the context of matting method [50], the blur ratio for every pixel corresponds to the ratio between the gradients of the input image and the re-blurred images. In [55], the authors use the K nearest neighbors (KNN) matting interpolation under the assumption that depth is locally uniform. However, blurring can also appear with edges caused by shadows and glossy highlights can also produce error in focus estimation. To remove errors induced by these other sources of blur, [50] used a cross bilateral filtering and estimation of sharpness bias and [52] used multi-scale. In the field of saliency detection, another way to estimate the amount of blur consists in computing the defocus blur between two Difference of Gaussian (DoG) images in multi-scale levels [52]. An optimization algorithm is used by minimizing the difference between the blurriness of a pixel and the weighted average of blurriness of neighboring pixels. In a Markov Random Field (MRF) formulation, a local contrast prior based on comparing local contrast and local gradient is also introduced in [51]. In [53], the authors propose to use the ratio between principle components, that is a weighted mixture of the spectral components. Moreover, the weights are proportional to the energy in the spectral component. Some algorithms also use the analysis of localized Fourier (Gabor filtering) spectrum [54]. In addition, smoothness constraints and image color edge information are taken into account to generate a defocus map for trying to preserve discontinuities on the transitions between objects.

Following all these aforementioned approaches, we can find that most of the existing algorithms [50, 52, 53]depend on measuring the blur amount using the ratio between the edges in two different scale levels (i.e., the original image and the re-blurred image). In consequence, we propose to use the ratio between the two curvilinear saliency images that contain robust edges in different scales to determine the blur amount based on the methods developed in [50]. For the multi-scale aspect, our approach is inspired by the principles explained in [52].

Vi-B Removing background with focus curves: state of the art

Based on the mapping between the depth of a point light source and the focus level of its image, Shape From Defocus (SFD) approaches recover the 3D shape of a scene from focused images that represent the focus level of each point in the scene [48]. We can also notice that the focus (defocus) maps can be also used as an alternative for depth map, like in existing Adobe tools [50, 52, 54]. Consequently, it seems interesting to introduce what we call the detection of “focus curves” that capture blurriness in images. More precisely, focus curves mean that we estimate the scale of blur at the curvilinear saliency feature of the original image and we suppose that these features should be only related to discontinuities.

Focal blur occurs when a point is out of focus, as illustrated in Fig 5. When the point is at the focus distance, , from the lens, all the rays from it converge to a sharp single sensor point. Otherwise, when , these rays generate a blurred region in the sensor area. The blur pattern generated by this way is called the circle of confusion (CoC) whose diameter is denoted .

Fig. 5: A thin lens model for image blur as proposed in [50].
Fig. 6: Overview of our blur estimation approach: here, and are respectively the convolution and the gradient operators, is the standard deviation of a re-blurring Gaussian function. The black dash line denotes the edge location as proposed in [50].

In [49, 50, 52], the defocus blur can be modeled as a convolution of a sharp image with the point spread function (PSF) as shown in Fig. 6. The PSF is usually approximated by a Gaussian function , where the standard deviation measures the blur amount and is proportional to the diameter of the CoC:

where are focus distance, defocus distance and focal length respectively as shown in Fig. 5. A blurred edge is then given by

(38)

where is an ideal edge where is the step function. The terms and corresponds to the amplitude and the offset of the edge, respectively. Note that the edge is located at .

In [50], the blur estimation method was described for 1D case. The gradient of the re-blurred edge is:

(39)

where is the standard deviation of the re-blur Gaussian kernel. Thus, the gradient magnitude ratio between the original and the re-blurred edges is:

(40)

It can be proved that the ratio is maximum at the edge location () and the maximum value is given by:

(41)

Finally, given the maximum value at the edge locations, the unknown blur amount can be calculated using:

(42)

Vi-C Focus curves based on Curvilinear Saliency – Multi Focus Curves (MFC)

We suppose that using focus can help to remove the background and using multiscale can help to reduce the influence of the texture in the same way as in section V-B. So, we propose to use the curvilinear saliency computation instead of the edge response to estimate the focus curves of an input image. In addition, we propose to estimate focus curves in multi-scales rather than in one scale as proposed in [50] to have scale invariant focus salient curves. In addition, we combine all information gotten from different blurring scales.

Assume the original pixel in an image is blurred as . Thus to get the curvilinear saliency, the structure tensor is calculated as:

(43)

If the Hessian matrix is expressed with eigenvectors and eigenvalues, we obtained:

(44)
(45)

The curvilinear saliency can be described as:

(46)

In particulary, the curvilinear saliency can be directly computed as:

(47)

The re-blurred curvilinear saliency image, named , in multi-scale can then be defined as:

(48)

where is the number of scales.

Consequently, the ratio between the original and re-blurred curvilinear saliency is:

(49)

Within the neighborhood of an pixel, the response reaches its maximum when and , thus:

(50)

Finally, given the maximum value in each scale level, the unknown blur amount can be calculated using

(51)

For scales, we compute focus curve scales by using the ratio between curvilinear saliency of the coarse level (i.e., the original image) and the next scale levels. By following the same remarks as in section V-B, we define Multi Focus Curves (MFC) that correspond to the fusion of all the focus curves into one map by keeping only the pixels that have a high focus value in all the