# Learning an Interactive Segmentation System

###### Abstract

Many successful applications of computer vision to image or video manipulation are interactive by nature. However, parameters of such systems are often trained neglecting the user. Traditionally, interactive systems have been treated in the same manner as their fully automatic counterparts. Their performance is evaluated by computing the accuracy of their solutions under some fixed set of user interactions. This paper proposes a new evaluation and learning method which brings the user in the loop. It is based on the use of an active robot user - a simulated model of a human user. We show how this approach can be used to evaluate and learn parameters of state-of-the-art interactive segmentation systems. We also show how simulated user models can be integrated into the popular max-margin method for parameter learning and propose an algorithm to solve the resulting optimisation problem.

## 1 Introduction

Problems in computer vision
are known to be extremely hard, and very few fully automatic vision
systems exist which have been shown to be accurate and robust under
all sorts of challenging inputs. These conditions in the past had
made sure that most vision algorithms were confined to the
laboratory environment. The last decade, however, has seen computer
vision finally come out of the research lab and into the real world
consumer market. This great sea change has occurred primarily on the
back of the development of a number of interactive systems which
have allowed users to help the vision algorithm to achieve the
correct solution by giving hints. Some successful examples are
systems for image and video manipulation, and interactive 3D
reconstruction tasks. Image stitching and interactive image
segmentation are two of the most popular applications in this area.
Understandably, interest in interactive vision system has grown in
the last few years, which has led to a number of workshops and
special sessions in vision, graphics, and user-interface
conferences ^{1}^{1}1e.g. ICV07, and NIPS09.

The performance of an interactive system strongly depends on a number of factors, one of the most crucial being the user. This user dependence makes interactive systems quite different from their fully automatic counterparts, especially when it comes to learning and evaluation. Surprisingly, there has been little work in computer vision or machine learning devoted to learning interactive systems. This paper tries to bridge this gap.

We choose interactive image segmentation to demonstrate the efficacy of the ideas presented in the paper. However, the theory is general and can be used in the context of any interactive system. Interactive segmentation aims to separate a part of the image (an object of interest) from the rest. It is treated as a classification problem where each pixel can be assigned one of two labels: foreground (fg) or background (bg). The interaction comes in the form of sets of pixels marked by the user by help of brushes to belong either to fg or bg. We will refer to each user interaction in this scenario as a brush stroke.

This work addresses two questions: (1) How to evaluate any given interactive segmentation system? and (2) How to learn the best interactive segmentation system? Observe that the answer to the first question gives us an answer to the second. One may imagine a learning algorithm generating a number of possible segmentation systems. This can be done, for instance, by changing parameter values of the segmentation algorithm. We can then evaluate all such systems, and output the best one.

We demonstrate the efficacy of our evaluation methods by learning the parameters of the state-of-the-art system for interactive image segmentation and its variants. We then go further, and show how the max-margin method for learning parameters of fully automated structured prediction models can be extended to do learning with the user in the loop. To summarize, the contributions of this paper are: (1) The study of the problems of evaluating and learning interactive systems. (2) The proposal and use of a user model for evaluating and learning interactive systems. (3) The first thorough comparison of state-of-the-art segmentation algorithms under an explicit user model. (4) A new algorithm for max-margin learning with user in the loop.

#### Organization of the paper

In Section 2, we discuss the problem of system evaluation. In Section 3, we give details of our problem setting, and explain the segmentation systems we use for our evaluation. Section 4 explains the naïve line-search method for learning segmentation system parameters. In Section 5, we show how the max-margin framework for structured prediction can be extended to handle interactions, and show some basic results. The conclusions are given in Section 6.

## 2 Evaluating Interactive Systems

Performance evaluation is one of the most important problems in the development of real world systems. There are two choices to be made: (1) The data sets on which the system will be tested, and (2) the quality or accuracy measure. Traditional computer vision and machine learning systems are evaluated on preselected training and test data sets. For instance, in automatic object recognition, one minimizes the number of misclassified pixels on datasets such as PASCAL VOC [7].

In an interactive system, these choices are much harder to make because of the presence of an active user in the loop. Users behave differently, prefer different interactions, may have different error tolerances, and may also learn over time. The true objective function of an interactive system – although intuitive – is hard to express analytically: The user wants to achieve a satisfying result easily and quickly. We will now discuss a number of possible solutions, some of which, are well known in the literature.

### 2.1 Static User Interactions

This is one of the most commonly used methods in papers on interactive image segmentation [4, 18, 6]. It uses a fixed set of user-made interactions (brush strokes) associated with each image of the dataset. These strokes are mostly chosen by the researchers themselves and are encoded using image trimaps. These are pixel assignments with foreground, background, and unknown labels (see Figure (b)b). The system to be evaluated is given these trimaps as input and their accuracy is measured by computing the Hamming distance between the obtained result and the ground truth. This scheme of evaluation does not consider how users may change their interaction by observing the current segmentation results. Evaluation and learning methods which work with a fixed set of interactions will be referred to as static in the rest of the paper.

Although the static evaluation method is easy to use, it suffers from a number of problems: (1) The fixed interactions might be very different from the ones made by actual users of the system. (2) Different systems prefer different type of user hints (interaction strokes) and thus a fixed set of hints might not be a good way of comparing two competing segmentation systems. For instance, geodesic distance based approaches [3, 9, 18] prefer brush strokes which are equidistant from the segmentation boundary as opposed to graph cuts based approaches [5, 16]. (3) The evaluation does not take into account how the accuracy of the results improves with more user strokes. For instance, one system might only need a single user interaction to reach the ground truth result, while the other might need many interactions to get the same result. Still, both systems will have equal performance under this scheme. These problems of static evaluation make it a poor tool for judging the relative performance of newly proposed segmentation system.

### 2.2 User Studies

A user study involves the system being given to a group of participants who are required to use it to solve a set of tasks. The system which is easiest to use and yields the correct segmentation in the least amount of time is considered the best. Examples are [13] where a full user study has been done, or [3] where an advanced user has done with each system the optimal job for a few images.

While overcoming most of the problems of a static evaluation, we have introduced new ones: (1) User studies are expensive and need a large number of participants to be of statistical significance. (2) Participants need to be given enough time to familiarize themselves with the system. For instance, an average driver steering a Formula 1 car for the first time, might be no faster than with a normal car. However, after gaining experience with the car, one would expect the driver to be much faster. (3) Each system has to be evaluated independently by participants, which makes it infeasible to use this scheme in a learning scenario where we are trying to find the optimal parameters of the segmentation system among thousands or millions of possible ones.

### 2.3 Evaluation using Crowdsourcing

Crowdsourcing has attracted a lot of interest in the machine learning and computer vision communities. This is primarily due the success of a number of money [19], reputation [24], and community [17] based incentive schemes for collecting training data from users on the web. Crowdsourcing has the potential to be an excellent platform for evaluating interactive vision systems such as those for image segmentation. One could imagine asking Mechanical Turk [1] users to cut out different objects in images with different systems. The one requiring the least number of interactions on average might be considered the best. However, this approach too, suffers from a number of problems such as fraud prevention. Furthermore, as in user-studies, this cannot be used for learning in light of thousands or even millions of systems.

### 2.4 Evaluation with an Active User Model

In this paper we propose a new evaluation methodology which overcomes most of the problems described above. Instead of using a fixed set of interactions, or an army of human participants, our method only needs a model of user interactions. This model is a simple algorithm which – given the current segmentation, and the ground truth – outputs the next user interaction. This user model can be coded up using simple rules, such as “give a hint in the middle of the largest wrongly labelled region in the current solution”, or alternatively, can be learnt directly from the interaction logs obtained from interactive systems deployed in the market. There are many similarities between the problem of learning a user model and the learning of an agent policy in reinforcement learning. Thus, one may exploit reinforcement learning methods for this task. Pros and cons of evaluation schemes are summarized in Table 1.

Method | user | user can | inter- | effort | parameter | time | price |
---|---|---|---|---|---|---|---|

in loop | learn | action | model | learning | |||

User model | yes | yes | yes | yes | this paper | fast | low |

Crowd sourcing | yes | yes | yes | yes | conceivable | slow | a bit |

User study | yes | yes | yes | yes | infeasible | slow | very high |

Static learning | no | no | no | no | used so far | fast | very low |

## 3 Image Segmentation: Problem Setting

### 3.1 The Database

We use the publicly
available GrabCut database of 50 images, in which ground truth
segmentations are known [2].
In order to perform large scale testing and comparison, we
down-scaled all images to have a maximum size of ,
while keeping the original aspect ratio^{2}^{2}2We confirmed by
visual inspection that the quality of segmentation results is not
affected by this down-scaling operation.. For each image, we
created two different static user inputs: (1) A “static
trimap” computed by dilating and eroding the ground truth
segmentation by pixels^{3}^{3}3This kind of input is used by
most systems for both comparison to competitors and learning of
parameters, e.g. [4, 18].. (2) A
“static brush” consisting of a few user made brush strokes which
very roughly indicate foreground and background. We used on average
about strokes per image. (The magenta and cyan strokes in Fig.
(c)c give an example). All this data is visualized
in Figure 1.
Note, in Sec. 3.3 we will describe a
third “dynamic trimap” called the robot user where we
simulate the user.

### 3.2 The Segmentation Systems

We now describe 4 different interactive segmentation systems we use in the paper. These are: “GrabCutSimple(GCS)”, “GrabCut(GC)”, “GrabCutAdvanced(GCA)”, “GeodesicDistance” (GEO).

GEO is a very simple system. We first learn Gaussian Mixture Model (GMM) based color models for fg/bg from user made brush strokes. We then simply compute the shortest path in the likelihood ratio image as described in [3] to get a segmentation.

The other three systems all built on graph cut. They all work by minimizing the energy function:

(1) |

Here is an undirected graph whose nodes correspond to pixels, is the segmentation label of image pixel with color , where 0 and 1 correspond to the background and the foreground respectively. We define to be an 8-connected 2D grid graph.

The unary terms are computed as follows. A probabilistic model is learnt for the colors of background () and foreground () using two different GMMs and . is then computed as where contains the three color channels of pixel . An important concept of GrabCut [16] is to update the color models based on the whole segmentation. In practice we use a few iterations.

The pairwise term incorporates both an Ising prior and a contrast-dependent component and is computed as

where and are weights for the Ising and contrast-dependent pairwise terms respectively, and is a parameter, where denotes expectation over an image sample [16]. We can scale with the parameter .

To summarize, the models have two linear free parameters: and a single non-linear one: . The system GC minimizes the energy defined above, and is pretty close to the original GrabCut system [16]. GrabCutSimple(GCS) is a simplified version, where color models (and unary terms) are fixed up front; they are learnt from the initial user brush strokes (see Sec. 3.2) only. GCS will be used in max-margin learning and to check the active user model, but it is not considered as a practical system.

Finally, “GrabCutAdvanced(GCA)” is an advanced GrabCut system performing considerably better than GC. Inspired by recent work [14], foreground regions are 4-connected to a user made brush stroke to avoid deserted foreground islands. Unfortunately, such a notion of connectivity leads to an NP-hard problem and various solutions have been suggested [23, 15]. However, all these are either very slow and operate on super-pixels [15] or have a very different interaction mechanism [23]. We simply remove deserted foreground islands in a postprocessing step.

### 3.3 The Robot User

We now describe the different active user models tested and deployed by us. Given the ground truth segmentation and the current segmentation solution , the active user model is a policy which specifies which brush stroke to place next. Here, denotes the user interaction history of image up to time . We have investigated various options for this policy: (1) Brush strokes at random image positions. (2) Brush strokes in the middle of the wrongly labelled region (center). For the second strategy, we find the largest connected region of the binary mask, which is given by the absolute difference between the current segmentation and ground truth. We then mark a circular brush stroke at the pixel which is inside this region and furthest away from the boundary. This is motivated by the observation that users tend to find it hard to mark pixels at the boundary of an object because they have to be very precise.

We also tested user models which took the segmentation algorithm
explicitly into account. This is analogous to users who have learnt
how the segmentation algorithm works and thus interact with it
accordingly. We consider the user model which marks a circular brush
stroke at the pixel (1) with the lowest min marginal (sensit). (2)
which results in the largest change in labeling (roi size). (3)
which decreases the Hamming error by the biggest amount (Hamming).
We consider each pixel as the circle center and choose the one where
the Hamming error decreases most (Hamming). This is very expensive,
but in some respects is the best solution^{4}^{4}4Note, one could
do even better by looking at two or more brushes after each other
and then selecting the optimal one. However, the solution grows
exponentially with the number look-ahead steps.. “Hamming” acts
as a very “advanced user”, who knows exactly which interactions
(brush strokes) will reduce the error by the largest amount. It is
quite questionable that a user is actually able to find the optimal
position, and a user study might be needed. On the other hand, the
“centre” user model behaves as a “novice user”.

Fig. (c)c shows the result of a robot user interaction, where cyan and magenta are the initial fixed brush strokes (called “static brush trimap”), and the red and blue dots are the robot user interactions. The robot sets brushes of a maximum fixed size (here 4 pixel radius). Apart from the true object boundary, the maximum brushes size is used. At the boundary, the brush size is scaled down, in order to avoid that the brush straddles the boundary.

Fig. (d)d shows the performance of the 5 different user models (robot users) over a range of brushes. Here we used the GCS system, since it is computationally infeasible to apply the (sensit; roi; Hamming) user models on other interaction systems. GCS can be used because it allows efficient computation of solutions. It does this by recycling computation when doing the optimization [11]. In the other systems, this is not possible, since unaries change with every brush stroke, and hence we have to treat the system as a black box.

As expected, the random user performs badly. Interestingly the robot users minimizing the energy (roi, sensit) also perform badly. Both “Hamming” and “centre” are considerably better than the rest. It is interesting to note that “centre” is actually only marginally worse than “Hamming”. It has to be said that for other systems, e.g. GEO this conclusion might not hold, since e.g. GEO is much sensitive to the location of the brush stroke than a system based on graph cut, as [18] has shown. To summarize, “centre” is the robot user strategy which simulates a “novice user” and is computational feasible, since it does not look at the underlying system at all. Also, “centre” performed for GCS nearly the same as the optimal strategy “Hamming”. Hence, for the rest of the paper we always stick to the user (centre) which we call from here onwards our robot user.

### 3.4 The Error Measure

For a static trimap input there are many different ways for obtaining an error rate, see e.g. [4, 10]. In a static setting, most papers use the number of misclassified pixels (Hamming distance) between the ground truth segmentation and the current result. We call this measure “”, i.e. Hamming error for brush . One could do variations, e.g. [10] weight distances to the boundary differently, but we have not investigated this here. Fig. (d)d shows how the Hamming error behaves with each interaction.

For learning and evaluation we need an error metric giving us a single score for the whole interaction. One choice is the “weighted” Hamming error averaged over a fixed number of brush strokes B. In particular we choose the error “” as: where . Note, to ensure a fair comparison between systems, B must be the same number for all systems. Another choice for the quality metric which matches more closely with what the user wants is described as follows. We use a sigmoid function of the form:

(2) |

Observe that encodes two facts: all errors below are considered negligible and large errors do never weigh more than . The first reasons of this settings is that visual inspection showed that for most images, an error below corresponds to a visually pleasing result. Of course this is highly subjective, e.g. a missing limb from the segmentation of a cow might be an error of but is visually unpleasing, or an incorrectly segmented low-contrast area has an error of but is visually not disturbing. The second reason for having a maximum weight of is that users do not discriminate between two systems giving large errors. Thus errors of 50% and 55% are equally penalized.

Due to runtime limitations for parameter learning, we do want to run
the robot user for not too many brushes (e.g. maximum of
brushes). Thus we start by giving an initial set of brush strokes
which are used to learn the colour models. At the same time, we want
that most images reach an error level of about . When we
start with a static brush trimap we get for of images an
error rate smaller than and for smaller than ,
with the GCA system. We also confirmed that the inital static brush
trimap does not affect the learning considerably^{5}^{5}5We started
the learning from no initial brushes and let it run for 60 brush
strokes. The learned parameters were similar as with starting from
20 brushes.

## 4 Interactive Learning by line-search

Systems with few parameters can be trained by simple line (or grid)
search. Our systems, GC and GCA, have only 3 free parameters: . Line search is done by fixing all but one free
parameter and simulating the user interaction process for
different discrete values of the free parameter
over a predefined range. The optimal value from
the discrete set is chosen to minimize the leave-one-out (LOO)
estimator of the test error^{6}^{6}6This is number-of-data-point-fold cross
validation.. Not only do we prevent
overfitting but we can efficiently compute the Jackknife estimator
of the variance [25, ch. 8.5.1], too. This
procedure is done for all parameters in sequence with a
sensible starting point for all parameters. We do one sweep only.
One important thing to notice is that our dataset was big enough
(and our parameter set small enough) as to not suffer from
over-fitting. We see this by observing that training and test error
rates are virtually the same for all experiments. In addition to the
optimal value we obtain the variance for setting this parameter. In
rough words, this variance tells us, how important it is to have
this particular value. For instance, a high variance means that
parameters different from the selected one, would also perform well.
Note, since our error function (Eq. 2) is
defined for both, trimaps which are static and dynamic, the above
procedure can be performed for all three different types of trimaps:
“static trimap”, “static brush”, “dynamic brush”.

Table 2 summaries all the results, and Fig. 3 illustrates some results during training and test (caption explains details of the plots). One can observe that the three different trimaps suggest different optimal parameters for each system, and are differently certain about them. This leads to key contribution of this study: A system which is interactive in test time has also to be trained in an interactive way. We see from the test plots that indeed, a system trained with “dynamic trimap” does better than trained with either “static brush” or “static trimap”.

Let us look closer at some learnt settings. For system GCA and
parameter (see Table 2 (first row), and
Fig. (a)a we observe that the optimal value
in a dynamic setting is lower () than in any of the static
settings. This is surprising since one would have guessed that the
true value of lies somewhere in between a loose and very tight
trimap. Interestingly in [18], the authors had
learned a parameter by averaging the performance from two static
trimaps. From the above study, one might have concluded the static
“tight trimap” might give good insights about the choice of
parameters. However, when we now consider the training of the
parameter in the GCA system, we see that such a conclusion
would be wrong, since the “tight trimap” reaches a very different
minimum () than the dynamic trimap ().^{7}^{7}7Note,
the fact that the uncertainty of the “tight trimap” learning is
high, gives an indication that this value can not be trusted very
much. To summarize, conclusions about the optimal parameter setting
of an interactive system should be drawn by a large set of
interaction and cannot be made by looking solely at a few (here two)
static trimaps.

Trimap | |||
---|---|---|---|

dynamic brush | 0.03 0.03 | 4.31 0.17 | 2.21 3.62 |

static trimap | 0.07 0.09 | 4.39 4.40 | 9.73 7.92 |

static brush | 0.22 0.52 | 0.47 8.19 | 3.31 2.13 |

For the sake of completeness, we have the same numbers for the GC system in Table 3. We see the same conclusions as above. One interesting thing to notice here is that the pairwise terms (esp. ) are chosen higher than in GCA. This is expected, since without post-processing a lot of isolated islands may be present which are far away from the true boundary. So post-processing automatically removes these islands. The effect is that in GCA the pairwise terms can now concentrate on modeling the smoothness on the boundary correctly. However, in GC the pairwise terms have to additionally make sure that the isolated regions are removed (by choosing a higher value for the pairwise terms) in order to compensate for the missing post-processing step.

Trimap | |||
---|---|---|---|

dynamic brush | 0.24 0.03 | 4.72 1.16 | 1.70 1.11 |

static trimap | 0.07 0.09 | 4.39 4.40 | 4.85 6.29 |

static brush | 0.57 0.90 | 5.00 0.17 | 1.10 0.96 |

It is interesting to note that for the error metric , we get slightly different values, see Table 4. For instance, we see that for GCA with our active user. This is not too surprising, since it says that larger errors are more important (this is what does). Hence, it is better to choose a larger value of .

In Figure 3d of the paper we plot the actual segmentation error and not the error measure for . In Table 4, we have collected all final error measure values. It is visible from the table that the dynamically adjusted parameters only perform better in terms of the instantaneous error but not in terms of the cumulative error measure.

static brush | static trimap | dynamic brush | |
---|---|---|---|

[shown] | |||

In order to get a complete picture, we provide the full set of plots for the line search experiments. We report results for the two systems GCA and GC on three parameters , and and two error weighting functions in Figures 6 and 7.

#### Novice vs Advanced User

When comparing different interactive systems, we have to decide, whether the system is designed for an advanced or a novice user.

In a user study, one has full control over selecting advanced or novice users. This can be done by changing the amount of introduction given to the participants. However, this process is lengthy and therefore infeasible for learning.

In our robot user paradigm, we can simulate users with different levels of experience. We run the (center) user model to simulate a novice user and evaluate four different systems. The results are shown in Fig. 4. The order of the methods is as expected, GCA is best, followed by GC, then GCS, and GEO. GEO performs badly since it does no smoothing at the boundary, compared to the other systems.

## 5 Interactive Max-margin Learning

The grid-search method used in Section 4 can be used for learning models with few parameters only. Max-margin methods deal which models containing large numbers of parameters and have been used extensively in computer vision. However, they work with static training data and cannot be used with an active user model. In this Section, we show how the traditional max-margin parameter learning algorithm can be extended to incorporate an active user.

### 5.1 Static SVMstruct

Our exposition builds heavily on
[20] and the references therein. The SVMstruct
framework [22] allows to adjust linear
parameters of the segmentation energy
(Eq. 1) from a
given training set of
images and ground truth
segmentations^{8}^{8}8We write images of size as vectors for
simplicity. All involved operations respect the d grid
structure absent in general -vectors. by balancing between empirical
risk and regularisation
by means of a trade-off parameter . A (symmetric) loss function^{9}^{9}9We use the Hamming loss . measures
the degree of fit between two segmentations and .
The current segmentation is given by .
We can write the energy function as an inner product between feature
functions and our parameter vector
: .
With the two shortcuts
and , the
margin rescaled objective [21] reads

(3) | |||||

sb.t. |

In fact, the convex function can be rewritten as a sum of a quadratic regulariser and a maximum over an exponentially sized set of linear functions each corresponding to a particular segmentation . Which energy functions fit under the umbrella of SVMstruct? In principle, in the cutting-planes approach [22] to solve Eq. 3, we only require efficient and exact computation of and . For the scale of images i.e. , submodular energies of the form allow for efficient minimisation by graph cuts. As soon as we include connectivity constraints as in Eq. 1, we can only approximately train the SVMstruct. However some theoretical properties seem to carry over empirically [8].

### 5.2 Dynamic SVMstruct with “Cheating”

The SVMstruct does not capture the user interaction part. Therefore, we add a third term to the objective that measures the amount of user interaction where is a binary image indicating whether the user provided the label of the corresponding pixel or not. One can think of as a partial solution fed into the system by the user brush strokes. In a sense implements a mechanism for the SVMstruct to cheat, because only the unlabeled pixels have to be segmented by our procedure, whereas the labeled pixels stay clamped. In the optimisation problem, we also have to modify the constraints such that the only segmentations compatible with the interaction are taken into account. Our modified objective is given by:

(4) | |||||

sb.t. | |||||

For simplicity, we choose the amount of user interaction or cheating
to be the maximal -reweighted number of labeled
pixels , with uniform weights
. Other formulations based on the average
rather than on the maximal amount of interaction proved feasible
but less convenient. We denote the set of all user interactions for
all images by .
The compatible label set
is given by
where is the ground truth labeling. Note that
is convex in for all values of and efficiently
minimisable by the cutting-planes algorithm. However the dependence
on is horribly difficult – we basically have to
find the smallest set of brush strokes leading to a correct segmentation.
Geometrically, setting one halves the number
of possible labellings and therefore removes half of the label constraints.
The optimisation problem (Eq. 5) can
be re-interpreted in two different ways:

Firstly, we can define a modified energy
with additional cheating potentials
for and otherwise allowing to treat the
SVMstruct with cheating as an ordinary SVMstruct with modified energy
function and extended
weight vector .

A second (but closely related) interpretation starts from the fact
that the true label can be regarded as a feature
vector of the image ^{10}^{10}10In fact, it is probably the most informative feature one can think
of. The corresponding predictor is given by the identity function.. Therefore, it is feature selection in a very particular
feature space. There is a direct link to multiple kernel learning
– a special kind of feature selection.

### 5.3 Optimisation with strategies

We explored two approaches to minimise .
Based on the discrete derivative ,
we tried coordinate descent schemes. Due to the strong coupling of
the variables, only very short steps were possible^{11}^{11}11In the end, we can only safely flip a single pixel at
a time to guarantee descent.. Conceptually, the process of optimisation is decoupled from
the user interaction process, where removal of already known labels
from the cheating does not make sense. At every stage of interaction,
a user acts according to a strategy .
The notion of strategy or policy is also at the core of a robot
user. In order to capture the sequential nature of the human interaction
and assuming a fixed strategy , we relax Eq. 4
to

(5) | |||||

sb.t. | |||||

where we denote repeated application of the strategy by and by the function concatenation operator. Note that we still cannot properly optimise Eq. 5. However, as a proxy, we develop Eq. 5 forward by starting at with . In every step , we interleave the optimisation of the convex objective and the inclusion of a new user stroke yielding as final parameter estimate.

### 5.4 Experiments

We ran our optimisation algorithm with GCS on 5-fold CV train/test splits of the GrabCut images. We used unary potentials (GMM and flux) as well as two pairwise potentials (Ising and contrast) and the center robot user with strokes. Fig. 5b shows, how the relative weight of the linear parameters varies over time. At the beginning, smoothing (high ) is needed whereas later, edges are most important (high ). Also the SVMstruct objective changes. Fig. 5c makes clear that the data fit term decreases over time and regularisation increases. However, looking at the test error in Fig. 5a (averaged over 5 folds) we see only very little difference between the performance of the initial parameter and the final parameter . Our explanation is based on the fact that GCS is too simple as it does not include connectivity or unary iterations. In addition to the Gaussian Mixture Model (GMM) based color potentials, we also experimented with flux potentials [12] as a second unary term. Figure 5b shows one example, where we included a flux unary potential. We get almost identical behavior without flux unaries.

## 6 Conclusion

This paper showed how user interaction models (robot users) can be used to train and evaluate interactive systems. We demonstrated the power of this approach on the problem of parameter learning in interactive segmentation systems. We showed how simple grid search can be used to find good parameters for different segmentation systems under an active user interaction model. We also compared the performance of the static and dynamic user interaction models. With more parameters, the approach becomes infeasible, which naturally leads to the max margin framework.

We introduced an extension to SVMstruct, which allows it to incorporate user interaction models, and showed how to solve the corresponding optimisation problem. However, crucial parts of state-of-the-art segmentation systems include (1) non-linear parameters, (2) higher-order potentials (e.g. enforcing connectivity) and (3) iterative updates of the unary potentials â ingredients that cannot be handled directly inside the max-margin framework. In future work, we will try to tackle these challenges to enable learning of optimal interactive systems.

## References

- [1] Amazon mechanical turk. https://www.mturk.com.
- [2] Grabcut database. http://research.microsoft.com/en-us/um/ cambridge/projects/visionimagevideoediting/segmentation/ grabcut.htm.
- [3] X. Bai and G. Sapiro. A geodesic framework for fast interactive image and video segmentation and matting. 2007.
- [4] A. Blake, C. Rother, M. Brown, P. Perez, and P. Tor. Interactive image segmentation using an adaptive GMMRF model. In ECCV, 2004.
- [5] Y. Boykov and M. Jolly. Interactive graph cuts for optimal boundary and region segmentation of objects in N-D images. 2001.
- [6] O. Duchenne, J.-Y. Audibert, R. Keriven, J. Ponce, and F. Ségonne. Segmentation by transduction. In CVPR, 2008.
- [7] M. Everingham, L. V. Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The PASCAL Visual Object Classes Challenge. http://www.pascal-network.org/challenges/VOC.
- [8] T. Finley and T. Joachims. Training structural SVMs when exact inference is intractable. In ICML, 2008.
- [9] L. Grady. Random walks for image segmentation. PAMI, 28:1–17, 2006.
- [10] P. Kohli, L. Ladicky, and P. Torr. Robust higher order potentials for enforcing label consistency. In CVPR, 2008.
- [11] P. Kohli and P. Torr. Efficiently solving dynamic markov random fields using graph cuts. In ICCV, 2005.
- [12] V. Lempitsky and Y. Boykov. Global optimization for shape fitting. In CVPR, 2007.
- [13] Y. Li, J. Sun, C.-K. Tang, and H.-Y. Shum. Lazy snapping. SIGGRAPH, 23:303–308, 2004.
- [14] J. Liu, J. Sun, and H.-Y. Shum. Paint selection. In SIGGRAPH, 2009.
- [15] S. Nowozin and C. H. Lampert. Global connectivity potentials for random field models. In CVPR, 2009.
- [16] C. Rother, V. Kolmogorov, and A. Blake. Grabcut - interactive foreground extraction using iterated graph cuts. 23(3):309–314, 2004.
- [17] B. C. Russell, A. Torralba, K. P. Murphy, and W. T. Freeman. Labelme: a database and web-based tool for image annotation. IJCV, 77:157–173, 2008.
- [18] D. Singaraju, L. Grady, and R. Vidal. P-brush: Continuous valued mrfs with normed pairwise distributions for image segmentation. In CVPR, 2009.
- [19] A. Sorokin and D. Forsyth. Utility data annotation with amazon mechanical turk. In Internet Vision Workshop at CVPR, 2008.
- [20] M. Szummer, P. Kohli, and D. Hoiem. Learning CRFs using graph cuts. In ECCV, 2008.
- [21] B. Taskar, V. Chatalbashev, and D. Koller. Learning associative markov networks. In ICML, 2004.
- [22] I. Tsochantaridis, T. Hofmann, T. Joachims, and Y. Altun. Support vector learning for interdependent and structured output spaces. In ICML, 2004.
- [23] S. Vicente, V. Kolmogorov, and C. Rother. Graph cut based image segmentation with connectivity priors. In CVPR, 2008.
- [24] L. von Ahn and L. Dabbish. Labeling images with a computer game. In SIGCHI, pages 319–326, 2004.
- [25] L. Wasserman. All of Statistics. Springer, 2004.