A Computational Model for Amodal Completion

A Computational Model for Amodal Completion

Abstract

This paper presents a computational model to recover the most likely interpretation of the 3D scene structure from a planar image, where some objects may occlude others. The estimated scene interpretation is obtained by integrating some global and local cues and provides both the complete disoccluded objects that form the scene and their ordering according to depth. Our method first computes several distal scenes which are compatible with the proximal planar image. To compute these different hypothesized scenes, we propose a perceptually inspired object disocclusion method, which works by minimizing the Euler’s elastica as well as by incorporating the relatability of partially occluded contours and the convexity of the disoccluded objects. Then, to estimate the preferred scene we rely on a Bayesian model and define probabilities taking into account the global complexity of the objects in the hypothesized scenes as well as the effort of bringing these objects in their relative position in the planar image, which is also measured by an Euler’s elastica-based quantity. The model is illustrated with numerical experiments on, both, synthetic and real images showing the ability of our model to reconstruct the occluded objects and the preferred perceptual order among them. We also present results on images of the Berkeley dataset with provided figure-ground ground-truth labeling.

1Introduction

Visual completion is a pervasive process in our daily life that works by hallucinating contours and surfaces in the scene when there is not a physical magnitude for them. Whenever we look at an image, our brain unconsciously reconstructs the 3D scene by completing partially occluded objects while inferring their relative depth order into the scene (see Figure ?). In Figure ?, for instance, our brain prefers to interpret the scene as four disks partially occluded by four rectangles instead of, e.g., the more straightforward description of eight quarters of a disk and four rectangles fitting together.

In this paper, we are interested in computationally modeling this perceptual phenomenon, recovering what the brain infers about the structure and the relative depth of the objects composing the scene from a planar image. To simplify the analysis of our approach, we focus on scenes where objects appear at two different depths, ones occluding the others. The current approach can handle scenes with both partially occluded and fully visible objects. Our contribution is twofold: firstly, we propose a computational method relying on perceptual findings related to amodal visual completion to compute the disoccluded objects that form the possible 3D interpretations or configurations that arise from a planar image; and secondly, we propose a Bayesian probabilistic model which chooses between these possible interpretations of a planar image the most plausible one, justifying the visual completion human experience. The disocclusion method works by minimizing the Euler’s elastica and incorporates the concepts of relatability of the occluded contours and convexity of the disoccluded objects. Roughly speaking, two contours are relatable if they can be connected with a smooth contour without inflection points [28] (see Figure ?–)) . An equivalent and more precise definition is given in Section 3 (see Def. ?) but let us now notice that the relatability property implies that two contours can be relatable no matter how far away their corresponding ending points are. Once the objects conforming the scene are disoccluded, we follow a Bayesian approach and give definitions for the prior probability and the likelihood, measured, respectively, by the object complexities and an elastica-based quantity. As a consequence, our probability model takes into account the shape of the objects in the hypothesized scenes as well as the effort of bringing these objects in their relative positions in the visual image.

The structure of the paper is as follows. In Section 2 we review the related work and the fundamentals of visual completion. Section 3 is devoted to present the proposed approach: in particular, subSection 3.1 presents the object disocclusion method, while Subsection Section 3.2 details the probabilistic model. Section 4 explains the numerical algorithm while Section 5 provides experimental results. Finally, we present our conclusions in Section 6.

2Related research

The visual completion phenomenon has been intensively investigated during the past fifty years and it is still an active area of research [23]. Nowadays, it is well acknowledged that occlusion patterns evoke both local and global completion processes, but how the final perception outcome is conveyed is still not well understood. Local completion has been related to the good continuation of visible contours [57], while global completion is driven by the simplicity principle [30]: The assumption that the visual system favors interpretations characterized by phenomenal simplicity, such as symmetry, repetition, familiarity or context properties [48], and regularity, and typically leads toward the simplest completed shape, even though the good continuation principle may be violated. Figure ? shows an example where two different amodal completions occur depending on whether a global cue as symmetry is incorporated or only more local cues, while in Figure ?, both interpretations coincide. Some authors (e.g., [40] and references therein) have noticed that features favoring completion through good continuation are read out more quickly (in the very first second) than are features favoring completion through symmetry (which are incorporated in the following 9 seconds). The incorporation of different cues was also studied by Rubin [46] who experimentally proved that local and global occlusion cues affect the perception of amodal completion at different stages of visual processing. As for global cues, the author focused in relatability and surface similarity, being cues that seem to be instantaneously used at first stages of occlusion perception. Relatability (see Figure ?) was first introduced by Kellman and Shipley [28], who noticed that it is a necessary global condition for completion to occur. Then, for the perception of amodal completion, Rubin proposed that the detection of local cues such as T-junctions (see Figure 1) generates a local pattern of activation which launches a process of propagation of the contour which is either enhanced or stopped depending on whether or not other global cues such as relatability or surface similarity hold.

Figure 1: Example of a T-junction arising in presence of partial occlusion.
Figure 1: Example of a T-junction arising in presence of partial occlusion.

The evidence that the visual system generates multiple interpretations and that visual completion is the result of a competition between them, was discussed by van Lier et al. [55]. In Figure ? is shown that sometimes global and local processes may diverge, and since it is not always the same process which is prevalent, a theory based on either local or global principles alone cannot hold. Later, van Lier et al. [54] proposed an integrative model of global and local aspects of occlusion. In the present work, the global cues that we use are relatability and convexity (which are instantaneously considered at first stages of occlusion perception). As discussed in the following Section, we leave the incorporation of other global properties such as global self-similarities for a future work.

In computer vision, the computational translation of the visual completion phenomenon is commonly referred to as disocclusion or inpainting. A pioneering contribution to the recovery of image plane geometry was given by Nitzberg, Mumford and Shiota [43]. The authors proposed a variational model for segmenting the image into objects which should be ordered according to their depth in the scene, providing the so-called image 2.1 sketch. The minimization of the functional should be able to find the occluding and the occluded objects, while finding the occluded boundaries. The energy functional is defined in terms of region simplification and completion of occluded contours. Contours completion is achieved by linking signatures of occlusion, the T-junctions (see Figure 1), with the Euler’s elastica, so that the completion tends to respect the principle of good continuation [24]. Despite its theoretical importance, the complexity of minimizing this energy makes the approach far from practical applications. On the other hand, it was one of the main sources of inspiration for the first inpainting algorithms. A first formulation was proposed by Masnou and Morel [36], who tried to interpolate the data into the uncomplete region by minimizing an energy functional based on the Elastica. A similar approach was followed in [3] and in the work of Chan and Shen [9]. Those methods belong to the so-called geometry-oriented methods where the images are modeled as functions with some degree of smoothness, expressed, for instance, in terms of the curvature of the level lines [36] or the total variation of the image [8]. Binary inpainting tools for images are also used to disocclude shapes and thus can be considered as geometry-oriented methods. Binary inpainting can be based on diffusion processes followed by thresholding, named as thresholding dynamics (e.g. [38]). Thresholding dynamic interpolations also usually minimize a geometric functional, based either on the length, area, or curvature of the shape contours [38].

This work is focused on a computational model that, given an image of a 3D scene, it automatically outputs the preferred – according to human perception – interpretation of the scene in terms of depth configuration of the scene objects together with their completion in case of occlusions. A related and inspiring work in the literature is the proposal of van Lier et al. [54]. They proposed to choose the preferred scene interpretation based on the minimum complexity or description code, taking into account local and global aspects of occlusion. Their model assumes that the most likely interpretation is the one that minimizes the sum of the complexity of three components of the visual pattern: (i) The internal structure, related to each of the visible shapes separately, (ii) the external structure, related to the positional relation between these shapes, and (iii) the virtual structure, related to the occluded parts of the shapes. The perceptual complexity of each of these three components is expressed in terms of structural information theory (SIT) [31], a formal coding model that encodes complexity in terms of descriptive parameters. However, van Lier et al. do not automatically complete the occluded objects and the complexities are manually estimated from line drawings; thus their approach can not be directly applied to images in a computer vision task. The same authors noticed in [52] that the global minimum principle can be settled in a Bayesian framework (see [29] and references therein) by properly defining prior and conditional probabilities. In this paper, based on a Bayesian framework, we propose a fully automatic method that can be applied to any image decomposed in shapes.

3The model

Our model is grounded in two elements: a disocclusion method that computes the different objects conforming different potential scenes that are compatible with the given planar image, and a probabilistic model that quantitatively justifies which scene configuration is the preferred. As we are considering two-depth images, the possible interpretations or hypothesis of the real 3D scene are three, namely: object A occluding object B, B occluding A, or A and B fitting together forming a mosaic (see Figure ?). We will denote them by , , and . Let us remark that sometimes the objects in the third interpretation (i.e., A and B fitting together) coincide with one of the others, both perceptually and using our algorithm (an example is shown in Figure ?), or even the objects in all three hypothesis coincide (an example is shown in Table 3). Even if the objects forming the scene coincide in different hypothesis, the depth ordering is not the same in each hypothesis. In Section 5 we will provide some experiments analyzing this phenomenon which is related to the well-known optical illusion of relative depth perception of the objects (see Figure ?). In this work, when coincides with one of the other hypothesis, we assume that both objects appear at the same depth (see Figure ?) and then the associated probabilities will decide which hypothesis has the highest likelihood. In the Perception community, the observed image is often called the proximal stimulus (e.g., the left image in Figure ? and ), and each of the hypothesized interpretations is called the distal stimulus.

o

Our method first computes several distal interpretations of the scene which are compatible with the proximal planar image. Rubin [46] studied the role of T-junctions, relatability and surface-similarity in the amodal completion phenomenon and illusory contour perception. The author proposed that T-junctions, being a local cue for occlusion, are used to launch the completion process when contours are relatable [28]. Then, the Gestalt law of good continuation plays an important role. This motivates us to use the Euler’s elastica in order to smoothly continue the contours. Let us recall that, given two T-junctions at points and , with tangents and to the respective terminating stems (also called T-stems, see Figure 1), Euler solved in 1744 [41] the problem of joining them with a smooth continuation curve minimizing

where and the minimum is taken among all the curves joining and with tangents and , respectively, denotes the curvature of , its arc length, and is a positive constant. The parameter plays a geometric role by settling the expected underlying a priory regularity. In this sense, with a larger , the energy favors the completion with straight lines (minimal length). Otherwise, smooth curves of low curvature are favored even if their length is increased. Figure ? illustrates the effect of the parameter ; as decreases the disoccluded shape converges to a disk; in the limit case , the energy to be minimized is the Willmore energy (with the boundary constraints on the tangents). The elastica energy is not lower semicontinuous [4] and some relaxed versions have been proposed [4], which are compatible with Kanisza’s amodal completion theory [26]. It has been frequently used to solve different computer vision problems (e.g. [43] among others). In a recent work that proposes a computational method for modal completion [22], the elastica is a key ingredient to obtain illusory contours. A method for both modal and amodal completion which uses geodesics in the group of rotations and translations was proposed in [11]. In this work, the elastica is used in two ways. We propose in Section 3.1 an elastica-based object disocclusion method which incorporates the relatability of partially occluded contours and the convexity of the disoccluded objects. On the other hand, the elastica is also used in Section 3.2 to select the most probable disoccluded scene.

3.1Elastica-based object disocclusion

For disoccluding the objects, we focus on the completion that takes place in the first time instants of observation [40] or when the local processes dominate due to limited regularities in the object or low saliency of the symmetry cues versus the good continuation [48]. In the perception community, this completion is usually called local completion. The disocclusion method we propose integrates global and local cues: Global cues such as relatability and convexity are incorporated in the initial step of our algorithm, followed by local ones such as smooth continuation.

We propose to disocclude partially occluded objects by a binary inpainting algorithm that simulates the minimization of the elastica . Disocclusion, also known as image completion or inpainting, is the recovery of missing or corrupted parts of an image in a given region so that the reconstructed image looks natural. Most available methods for inpainting can be divided into two groups: geometry [36] and texture-oriented methods [13]. The synthesis of methods of these two types is still an open question [6]. Since we are interested in recovering objects or shapes, we will focus on geometry-oriented methods, where images are usually modeled as functions with some degree of smoothness, expressed, for instance, in terms of the curvature of the level lines or the total variation of the image. Taking advantage of this structure, these methods interpolate the inpainting domain by continuing the geometric structure of the image (its level lines or edges), usually as the solution of a (geometric) variational problem or by means of a partial differential equation.

In this paper, we are concerned solely by the shape of the objects. Thus, we work with segmented objects and we perform a geometric inpainting of the binary images that represent these objects. More precisely, we disocclude each object in each hypothesis by separately considering the hypothesized occluding object as the inpainting mask. The object is automatically completed in such a way that its boundary minimizes a relaxed version of the elastica . For that, the object to be completed is represented in a binary image (given by the object segmentation) and its completion is performed through a threshold dynamics algorithm which mainly consists in a diffusion process followed by a thresholding. In our case, the minimization algorithm iteratively alternates one step of the Grzibovskis-Heintz scheme [19] that decreases , one step of the standard Merriman-Bence-Osher scheme [38] that decreases , and a thresholding step, as proposed by Esedoglu et al. in [47]. We present the pseudo-code and more details in Algorithm ?. Figure ? shows an example illustrating how the parameter affects the disoccluded shape. When is big, more weight is given to the length of the curve and then straight lines are favored. When decreases the disoccluded shape converges to a disk avoiding singularities of the curvature no matter if it produces a bigger length. On the other hand, depending on the resolution of the proximal stimulus, which translates into a smaller or bigger curvature of curvy boundaries, the parameter needs to be adapted to obtain the same underlying shape regularity. An example is shown in Figure ?: circles with larger radius need a larger value of in order to obtain the same regularity of the disoccluded shape. The reason is the following: the curvature of smooth plane curves is defined as the inverse of the radius of the osculating circle (the unique circle which most closely approximates the curve near the point). Therefore, there is a relationship between the numerical curvature of the disoccluded objects and the a priori regularity imposed through the parameter : the larger the , the larger the expected radius of the osculating circle.

In Section 3.2 we define prior probabilities and likelihoods which take into account global and local properties of the shape. As a consequence, the Bayesian approach is able to choose the more likely amodal completion not only between the different hypothesis on the scene configuration for a fixed disocclusion parameter (as shown in Sect. Section 3.2), but also between several disocclusions associated to different parameters , and therefore to integrate some global completion properties such as symmetry or repetitions. For instance, Figure ? presents the different hypothesis certainty, denoted by in Sect Section 3.2, for different values of . However, for the experimental results in Section 5, where the non-occluded part of the shapes have no constant curvature, the parameter has been fixed; an efficient method to compute the best goes out of the scope of the present work.



Initialization of the inpainting mask

Since the elastica energy is not convex, the inpainting result depends on the initial condition inside the inpainting mask. Let us illustrate it with a simple example. In Figure ? we show the inpainting results (shown in the second row) obtained by minimizing the elastica with different initializations, namely, initializing the mask with white, black, random (black and white chosen randomly from a uniform distribution) or with our proposal, which is explained in the remainder of this section. Notice how the proposed initialization gives a better result (according to the Gestalt laws of perception) and produces a completion that maintains the tangents at the endpoints of the disoccluded boundary.


In order to automatically compute an initialization of the inpainting problem sufficiently close to what humans perceive as disoccluded objects by amodal completion, we incorporate perceptual cues such as relatability of object contours [28] and convexity of the disoccluded objects.

The notion of relatability (see Figure ?) was introduced by Kellman and Shipley [28] in the attempt of defining under which conditions visual completion occurs. Let us recall the definition of relatability.

In [50], the authors showed that this definition is equivalent to the existence of a smooth contour without inflection points connecting and , and that the interpolating curve does not turn through a total angle of more than .

Since (non-occluded) objects in the world tend to be convex [5], we favor the convexity of the disoccluded object by taking advantage of the following well-known property of convex sets.


The automatic initialization of the binary image inside the inpainting mask is illustrated in Figure ?. In practice, our algorithm considers all the end-points of the object contours (given in this case by the level lines) arriving to the inpainting mask together with their tangents (illustrated in Figure ? by a line passing through them), and computes all the possible pairs of relatable contours (shown in Figures ? and ?). In order to compute these tangents we use the Line Segment Detector [56]. Then, for each pair of relatable contours, for the end-point and tangent we consider the half-space (or , depending on which half-space the object is), and we assign a vote to the half-space on which the known object is. Figure ? and Figure ? displays the image gathering these votes in the inpainting mask, for the shapes shown in Figure ? and Figure ?, respectively (the shape to disocclude is shown in white and the inpainting mask is shown in gray, respectively). Let us remark that, in order to better illustrate our perceptually inspired initialization, in Figure ? and Figure ? (respectively, in Figure ? and Figure ?) we only show the computed values inside the inpainting mask. Finally, we binarize the image containing the votes with a threshold based on a rank order filter of these votes. We order the votes in increasing order and start with a threshold with the value ranked at percentile th. If no new connected components appear in the initialization with this threshold we keep it. Otherwise we decrease the threshold (taking the preceding ordered value) and repeat the process until no new connected components appear. Two different examples of this binary image are shown in Figure ? and ?, they are the initialization of the binary inpainting algorithm. Figure ? shows an example where the threshold on the votes correspond to the th percentile while in Figure ? the threshold was automatically decreased to the th percentile in order to obtain an initialization with a single connected component.

3.2Elastica-based probabilistic model

In this section, we follow a Bayesian approach [29] in order to choose among all possible interpretations of the scene, the most plausible one. We propose definitions for the prior and the conditional probabilities which take into account the global complexity of the objects in the hypothesized scenes as well as the effort of bringing these objects in their relative positions in the visual image. As a consequence, the result of this probability model indicates that the most simple interpretation is the one that more likely results from the amodal completion process, which was also suggested by [52].

Inspired by the work of van Lier et al. [54], our probabilistic model takes into account the complexity of the objects. Each hypothesized scene is formed both by completely visible objects and disoccluded objects (computed using the method described in previous Section 3.1); their respective global complexities are taken into account to define the prior probability of the hypothesized scene under analysis. The likelihood, i.e. the conditional probability of the given image (proximal stimulus) given a certain hypothesis (distal stimulus) is defined through an Euler’s elastica-based quantity that measures two attributes: the effort of bringing these objects in their relative positions given in the image and the smoothness of the disoccluded boundaries. Our probabilistic model provides a formalization allowing to computationally verify directly on images the proposal of van der Helm [54] (based on manually estimating complexity from line drawings) giving a probabilistic interpretation of the visual completion process.

We propose to predict and justify the preferred interpretation by maximizing the responsibility or a posterior probability, given by the Bayes’ rule as

over the hypothesized interpretations , where is the proximal stimulus or given image. As the quotient remains the same for all hypothesis in the maximization process, we propose to select the preferred hypothesis by

Given the underlying hypothesis and the proximal image , we define the conditional probability as

where and stand for common and disoccluded boundaries, respectively (see Figure ? for an example with two hypothesis) and is a normalization constant. Formula measures the responsibility that hypothesis takes for explaining the proximal stimulus as well as the deviation of from .

With the first integral in we compute the difficulty of bringing the two objects together in order to get the perceived image taking into account only the known boundary of the objects; for example, it is easier to obtain configuration ? than ? as in the first case only two points need to coincide, independently of the two coinciding points we will perceive the same image, and in the other case, , a larger boundary needs to coincide in order to perceive exactly that configuration. The second integral takes into account the regularity of the occluded boundary of the shape to define the probability of obtaining a particular stimulus; for example in Figure ? we can move the disk at many different positions behind the square to obtain the same image we are observing, but in Figure ? the movements we can do are more limited, as the perceived image will change drastically. Let us remark that due to the way we disocclude the objects the resulting disoccluded boundaries are always smooth; if we had different models of disocclusion this term would help to distinguish among them (in addition to the prior term). For instance, with our disocclusion model based on the elastica we are not able to recover the occluded object in Figure ?(b) or the objects A in Figure ?. The probability distribution in also appeared in Mumford [41] and Williams and Jacobs [59], who characterized the probability distribution of the shape of boundary completions based on the paths followed by a particle undergoing a stochastic motion, a directional random walk. It turns out that the elastica has the interpretation of being the mode of the probability distribution underlying this stochastic process restricted to curves with prescribed boundary behavior (the maximum likelihood curve with which to reconstruct hidden contours).

The prior probabilities are defined as

where is a normalizing constant, and and are the (disoccluded) objects in the hypothesized interpretation . The factor denotes the complexity of the object or shape at depth . In the case that the object at one depth is formed by more than one connected component the complexity is computed separately for each connected component and their sum constitutes the complexity of . We use the definition of complexity of a shape defined by Chen and Sundaram in [10],

which takes into account global properties of the shape such the global distance entropy (), the local angle entropy (), the perceptual smoothness (), and a measure of shape randomness (). The global distance is defined in [10] as the distance of boundary points to the centroid of the shape. The local angle is the angle formed by the two segments joining three consecutive boundary points. The perceptual smoothness is computed using the local angle (as closer to the angle, the smoother the shape). Finally, the shape randomness is the maximum difference between two random traces obtained from the two more distant points of the boundary. Therefore, the prior probability considers global properties such as shape contour symmetries and repetitions.

Let us notice that with these definitions our whole model for amodal completion is able to choose, not only between the different hypothesis for a fixed disocclusion parameter but also between several disocclusions associated to different parameters , and therefore to take into account global completion properties such as symmetry or repetitions. In Figure ? there is an example illustrating this computational ability, where the different probabilities associated to the different disocclusion results depending on are given.

Both normalization constants, and , are defined, respectively, by the inverse of the maximum value, over all the hypothesis , of the elastica and the object complexity.

Let us comment on the term in our definition . When visual completion occurs while propagating the stem, (e.g., hypothesis in Figure ?; also, hypothesis in Figure ?), the common boundaries between the objects are reduced to the T-junctions. In this case: and thus . Let us notice that in the distal stimulus, since we are considering closed objects, belongs to both objects. Therefore, in the hypothesis where the objects are interpreted as being fit-together (e.g., hypothesis in Figure ?; also, hypothesis in Figure ?), a disoccluded boundary appears which coincides with (i.e., ). Let us also comment on the effect of the regularity of . Figure ? presents three different proximal stimuli or images. The numerical computation of the term associated to each of the three images will decrease from left to right in the fit-together (or mosaic) interpretation, the same behavior applies to the complexity-related terms and . Therefore, the visual completion will become more and more evident and the interpretation of two complex pieces fitting together will become perceptually less favorable.

Let us finally remark that we are not considering all possibles configurations [18] but only the ones favored by relatability, convexity, and good continuation. On the other hand, even if global cues such as symmetry or repetitions are taken into account in our probability model, we do not incorporate them in the disocclusion algorithm. In the future, we plan to integrate it with other disocclusion strategies (such as, e.g., exemplar-based methods [2] or [21]) allowing to model these global properties and obtain, e.g., the objects A in Figure ?.

4Algorithm and implementation details

Algorithm ? shows the steps of the whole numerical algorithm. Let us detail it: Our algorithm needs a decomposition of the given image into objects and object parts which are interpreted as projections of real 3D objects on the image plane. This decomposition can be given either from the classical decomposition in level sets, in bi-level sets or segmenting the image from a criterion. In this paper, for the synthetic images, we use the decomposition in bi-level sets, which are defined as usual by , where is the image domain and is a finite strictly increasing sequence; and for the real images, we use the segmented shapes from the Berkeley segmentation dataset [34]. In this way we obtain the objects that appear in the image; these objects will be denoted by and . From and the three hypothesis will be considered by the algorithm: occluding the distal object (corresponding to the proximal ), occluding the distal object (corresponding to the proximal ), and and fit together. Now, by applying the disocclusion method of Section 3.1 where and are, respectively, the inpainting mask, we compute the complete hypothesis and , respectively. Then, to this two hypothesis, we always add the additional hypothesis of the mosaic interpretation (which is obtained when we do not apply the disocclusion algorithm). For each we compute the probabilities and from the definitions in Sect. Section 3.2. Finally, we compute the perceptually preferred hypothesis by (Equation 3).

In Algorithm ? we describe the threshold dynamics method we use for disocclusion, and in Algorithm ? we present the algorithm for computing the conditional probability .

Let us add some details regarding Algorithm ?. The Gaussian convolution has been computed using the Lindeberg’s discrete scale-space method and its implementation described in [44], that is, we use that the Gaussian convolution is the solution of the heat equation for a diffusion time (set to in our experiments, to guarantee the prescribed upper and lower bounds depending on the curvature of the visible shape [38]) so we only need to discretize partial derivatives. We refer to [44] for more details on the discretization. Parameter needs to be close but less than [47].

Regarding Algorithm ?, the discrete boundaries of each shape are computed as external boundaries and using 4-connectivity. On the other hand, in order to compute the curvature of a discrete curve (or boundary) , we use the method of [49] and compute

where is the signed distance function to the boundary . We use forward derivatives to compute the gradient, backward derivatives for the divergence. The discrete signed distance function is computed using the algorithm explained in [37].

Finally, the prior probability is computed using with the complexity measure given by . We consider as boundary points for computing all the pixels that form the boundary of an object. In case of an object formed by more than one connected component we compute the complexity of every connected component and the final complexity measure is the addition of the individual complexities. For details about how to compute and we refer to Section 3.2 and to [10].

5Experimental results

The proposed method has been tested with both synthetic and real images. Parameter , which sets the underlying a priori regularity (see comments on its role in Section 3.1) has been fixed to for all the experiments in order to have an algorithm as general as possible. There are two exceptions, namely, Proximal 2 of Table 1 and Example 4 of Table 11, where is fixed to 1.2 and 1.7, respectively, due to the biggest size of the circular shapes. As explained in Section 3.1, there is a relationship between the numerical curvature of the disoccluded objects and the a priori regularity imposed through the parameter : the larger the , the larger the expected radius of the osculating circle locally approximating the curve.

The experiments of this Section are organized as follows: In Sections Section 5.1, Section 5.2 and Section 5.3 we introduce the experiments which agree with our perception, while in Section 5.4 we show and discuss the experiments that failed. The synthetic experiments, which are described in the following Section 5.1, are shown in Tables Table 1, Table 2 and Table 3 while the experiments on real images (described in Section 5.2) are shown in Tables Table 4, Table 5, Table 6 and Table 7. Table 5 shows our results on images of the Berkeley dataset with provided figure-ground ground-truth labeling [15]. Table 9 in Section 5.3 shows the ability of our method to also decide on (perceptually) fully visible objects over a background. Finally, Tables Table 10, Table 11 and Table 12 present the synthetic and real results where our method did not agree with human perception. For each row in each table we show a complete experiment.

Let us recall that our method assumes the proximal stimulus to be decomposed into objects and object parts (which can be interpreted as projections of real 3D objects on the image plane). As in the synthetic experiments, the images are formed by objects with a single and unique color, this already gives a segmentation and we apply our algorithm directly. For the real experiments, we use a segmentation of the image. In particular, we have taken segmented images from the Berkeley segmentation dataset [34] and from [20].

5.1Synthetic images

Tables Table 1, Table 2 and Table 3 show some experiments on synthetic images. For each row in each table, a complete experiment is shown. We first present the proximal image (piecewise constant) on the left, followed by the three hypothesis (each one separated by a gray box), together with the values and proportional to the conditional probability and the prior probability, respectively, and the probability value . Let us remark that we have normalized the probabilities in such a way that . The probability value of the preferred hypothesis is highlighted in boldface. For the first two hypothesis, and , we display the objects at depth 1 on the left, and the disoccluded objects (at depth 2) on the right. Let us recall that the objects at depth 1 are considered the inpainting mask for disoccluding the objects at depth 2. Finally the last column is the hypothesis where the two objects are fitting together at the same depth.

Let us comment on the results in Tables Table 2 and Table 3. In Table 2, the third hypothesis is not shown because it coincides with due to the fact that the disocclusion algorithm does not change the objects being disoccluded. Besides, in Table 3 there is shown a synthetic experiment where the three hypothesis coincide on account of the obtained disoccluded objects: The disocclusion algorithm applied in the first two hypothesis does not change the objects and thus , and the posterior probability is the same for all three hypothesis. Let us remark that, even if the objects forming the scene coincide in different hypothesis, the depth ordering is not the same in each hypothesis. Let us singularize and explain one example in Figure ?, where our method produces . However, as for depth order, is interpreted as two objects at the same depth (and having the real relative size which is observed in the proximal image) while can be interpreted as three quarters of a disk which is closer to the observer, plus a square which can be of bigger size but farther away from the three-quarters-of-a-disk shape and whose boundary partially coincides with part of the boundary of the three-quarters-of-a-disk shape. Notice that this situation is related to the well-known ambiguity in depth of some proximal stimulus, sometimes causing optical illusion of relative depth perception as those in the images displayed in Figure ?.

These experiments show that our method agrees with human perception. The perception literature acknowledges that, in a T-junction, the occluder is the surface on the T-head side while the surfaces on the T-stem side continue behind the occluder [46]. This phenomenon is validated by our method: The preferred hypothesis, highlighted in boldface, is the one that is obtained by continuation of the T-stems. Let us comment on the results corresponding to Proximal 6 and 7 of Table 1, which include quite similar shapes with equal occlusion signatures but different common boundaries among the shapes. In Proximal 7 and 8 (and also in Proximal 9 of Table 2), the local perception cue at the T-junctions indicates that there is an occluded disk which continues behind an incomplete square (the occluder). Our method is able to choose the corresponding preferred hypothesis as is shown by the probability values . In Proximal 12 on Table 2, according to the T-junctions we should prefer (as the method chooses), but for symmetries most of us perceptually prefer . In this case, according to the prior should be preferred (as symmetries are valued positively), but as the disoccluded and common boundary are so large in , the value is much bigger, and is selected as preferred. Finally, let us comment on Proximal 4 which is a well-known and perceptually controversial example. The preferred hypothesis for Proximal 4 is the one that agrees with the T-junction cues and the one reported in [54] to be the most preferred by the subjects participating in their psychophysics experiments. However, in this case the posterior probabilities for and are quite close and, according to our personal experience (e.g., by incorporating our knowledge about the world and objects of similar shapes), some people prefer the interpretation according to . Both experiments, Proximal 4 and 12, are examples where hypothesis and correspond, respectively, to a local and global completion of the occluded object. In both cases, our algorithm favors local completion, that is, a completion that produces good continuation instead of the global one which produces a more symmetric object (notice that the local completion in both cases produces a symmetric object with respect to one axis). As commented in Section 2, both kind of completions interplay and the prevalence of one of them depends on the observation time [40] and on the saliency of the good-continuation versus symmetry in the completion [48].

Table 1: Synthetic experiments. Each row shows a different experiment: the original image (proximal stimulus) is shown on the left and it is followed by the three different hypothesis (each one separated by a gray box). For the first two hypothesis, for , we show: the object at depth 1 (left) and the disoccluded object at depth 2 (right). Notice that the object at depth 1 acts as a mask for disoccluding the object at depth 2. In the case of the third hypothesis, , both objects are considered to be at the same depth and completely visible in the original image (no disocclusion is applied). In the lower part of each hypothesis we show the values , (proportional, respectively, to the likelihood and prior probabilities), and the posterior probability . The probability value of the preferred hypothesis is highlighted in boldface.
image image image image image image image
image image image image image image image
image image image image image image image
image image image image image image image
image image image image image image image
image image image image image image image
image image image image image image image
image image image image image image image
Table 2: Synthetic experiments. Each row shows a different experiment: the original image (proximal stimulus) is shown on the left and it is followed by two different hypothesis (each one separated by a gray box). For each hypothesis we show: the object at depth 1 (left) and the disoccluded object at depth 2 (right). Notice that the object at depth 1 acts as a mask for disoccluding the object at depth 2. In the lower part of each hypothesis we show the values , (proportional, respectively, to the likelihood and prior probabilities), and the posterior probability . The probability value of the preferred hypothesis is highlighted in boldface. The third hypothesis, , where both objects are considered to be at the same depth and completely visible in the original image (no disocclusion is applied) is not shown here because it coincides with (due to the fact that the disocclusion algorithm does not change the objects being disoccluded in ). More details are given in the text.
image image image image image
image image image image image
image image image image image
image image image image image
Table 3: Synthetic experiment where the three hypothesis coincide (since the disocclusion applied in the first two hypothesis does not change the objects). Thus, we have , and the posterior probability is the same for all three hypothesis. As for depth order, is interpreted as two objects at the same depth (and having the real relative size which is observed in the proximal image) while can be interpreted as a gray square which is closer to the observer, plus a white rectangle which can be of bigger size but farther away from the square and whose boundary partially coincides with part of the boundary of the square. Finally, can be interpreted as a white rectangle which is closer to the observer, plus a gray square which can be of bigger size but farther away from the rectangle and whose boundary partially coincides with part of the boundary of the rectangle.
image image image

5.2Real images

In this section we show some results on real images from the Berkeley dataset [34] and the dataset provided in [20]. For all experiments, we present the real image, the proximal stimulus which is a segmentation of the real image (one of the segmentations provided in the databases), followed by the three hypothesis (each one separated by a gray box), together with the values and proportional to the conditional probability and the prior probability, respectively, and the probability value . The probability value of the preferred hypothesis is highlighted in boldface. For the first two hypothesis, and , we display the objects at depth 1 on the left, and the disoccluded objects (at depth 2) on the right. Let us recall that the objects at depth 1 are considered the inpainting mask for disoccluding the objects at depth 2. Finally the last column is the hypothesis where the two objects are fitting together at the same depth.

We start illustrating that our method is robust to different segmentations of the same image. Table 4 shows a real image with a bear holding a branch and two different segmentations (representing the proximal stimuli). Both segmentations are from the ground truth available in [34]. Segmentation 1 reflects that some flowers are partially occluding the bear and increasing the complexity of the bear shape; the flowers do not appear in segmentation 2 and thus the bear shape has a lower complexity (its complexity is , while in the previous case, Segmentation 1, was ). Notice that the values are not comparable among the two experiments (only among different hypothesis within the same experiment) because they use a different normalizing constant (see Section 3.2 for further details). Finally, the most preferred interpretation of the image coincides using the two different segmentations, i.e., it is a branch partially occluding a bear for both stimulus.

Table 4: Experiments with real images from . The left-most image is a segmentation of the original image (shown in the first row). Both experiments correspond to the same image but considering different segmentations (both extracted from the ground truth segmentation available in ). The most preferred interpretation of the image coincides in both experiments, i.e. a branch partially occluding a bear. Notice that the values are not comparable among the two experiments (only among different hypothesis within the same experiment) because they use a different normalizing constant (see Section for further details).
image image image image image image image
image image image image image image image

In Table 5 we present results on images of the Berkeley dataset with provided figure-ground ground-truth labeled by humans. Then, Table 6 shows experimental results on real images from [20] and Table 7 shows results on images from the Berkeley Segmentation database [34]. Each row shows a different experiment: the two left-most images are, respectively, the original image and a segmentation of it, they are followed by the three different hypothesis (each one separated by a gray box). For the images in Table 5, superimposed on the original image, we display the provided figure-ground ground-truth [15] as a boundary in two colors, namely, black and white. The black side of the border indicates the object that is behind, while the white region indicates the frontal object.

Table 5: Experiments with real images from . Each row shows a different experiment: the two left-most images are, respectively, the original image and a segmentation of it. They are followed by the three different hypothesis (each one separated by a gray box). The lines superimposed in the original image indicate the figure/ground ground-truth labels (from ) for each boundary in the segmentation: the white boundary indicates the figure side and the black one the ground side. For the first two hypothesis, for , we show: the object at depth 1 (left) and the disoccluded object at depth 2 (right). Notice that the object at depth 1 acts as a mask for disoccluding the object at depth 2. In the case of the third hypothesis, , both objects are considered to be at the same depth and completely visible in the original image (no disocclusion is applied). In the lower part of each hypothesis we show the values , (proportional, respectively, to the likelihood and prior probabilities), and the posterior probability . The probability value of the preferred hypothesis is highlighted in boldface.
image image image image image image image image
image image image image image image image image
image image image image image image image image
image image image image image image image image
image image image image image image image image
image image image image image image image image
image image image image image image image image
image image image image image image image image
image image image image image image image image
image image image image image image image image
image image image image image image image image
image image image image image image image image
image image image image image image image image
Table 6: Experiments with real images from . Each row shows a different experiment: the two left-most images are, respectively, the original image and a segmentation of it, they are followed by the three different hypothesis (each one separated by a gray box). More details on the results shown for each hypothesis in Table .
image image image image image image image image
image image image image image image image image
image image image image image image image image
image image image image image image image image
Table 7: Experiments with real images from . Each row shows a different experiment: the two left-most images are, respectively, the original image and a segmentation of it, they are followed by the three different hypothesis (each one separated by a gray box). More details on the results shown for each hypothesis in Table .
image image image image image image image image
image image image image image image image image
image image image image image image image image
image image image image image image image image
image image image image image image image image
image image image image image image image image