Saliency-Guided Perceptual Grouping Using Motion Cues in Region-Based Artificial Visual Attention
Region-based artificial attention constitutes a framework for bio-inspired attentional processes on an intermediate abstraction level for the use in computer vision and mobile robotics. Segmentation algorithms produce regions of coherently colored pixels. These serve as proto-objects on which the attentional processes determine image portions of relevance. A single region—which not necessarily represents a full object—constitutes the focus of attention. For many post-attentional tasks, however, such as identifying or tracking objects, single segments are not sufficient. Here, we present a saliency-guided approach that groups regions that potentially belong to the same object based on proximity and similarity of motion. We compare our results to object selection by thresholding saliency maps and a further attention-guided strategy.
Keywords:Artificial attention, region-based saliency, motion saliency, proto-objects.
Many artificial and natural systems must be able to visually select objects in dynamic scenes to perform cognitive tasks or actions. Selective visual attention, a concept from psychology, enables systems to distribute resources in a way that the relevant portions of a visual scene are processed efficiently. Stimuli that are unimportant in the current situation are disregarded or processed with low priority.
In biological systems, the focus of attention—the location in the visual field on which the available resources are focused—is determined with regard to bottom-up saliency and top-down influences. The latter are guided by knowledge with respect to the currenct task, whereas bottom-up saliency refers to local contrasts in the image that render certain locations conspicuous. Different feature dimensions contribute to bottom-up saliency, such as stimulus intensity or local orientation features. Activation from these different channels propagates towards a common saliency map; their integration is a prerequisite for the percept of a coherent object .
Various approaches have been proposed to implement attention mechanisms for technical systems. The popular model by Itti et al.  uses Difference-of-Gaussian and Gabor filters applied on image pyramids. Local contrasts regarding the features color, intensity, and orientation (also motion and flicker in some versions, see e.g. ) are computed by comparing pixels from finer scales to pixels of coarser scales. Weighting in the combination of the different feature dimensions can be used as a mean of top-down influence. That is, a representation of a search target can be learned as a set of weights for the combination which are then applied in the search (e,.g., see ). Other approaches for guiding attention in technical systems rely on frequency domain representations (see e.g., ), use statistical methods (see e.g., ) or are region-based.
Region-based approaches perform an initial image segmentation to group similar pixels to coherent regions and then determine the focus of attention (FOA) based on theses regions. Different segmentation methods have been applied in different models, such as region growing , super pixel methods , or colorspace quantization . Some approaches allow the integration of top-down influences based template regions [11, 12] or complexes of multiple regions .
An important feature is motion, as it indicates changes within dynamic scenes which may require the system to reorient. It therefore has been integrated in several attention systems applying pixel-based , frequency domain , or region-based methods  in a spatiotemporal context. The attention model by Tsotos et al. has been used to model attention towards motion in a biological plausible hierarchical manner . This includes the progression from sensitivities for simple local translation to those for more complex patterns such as expanding or approaching stimuli. An elaborate survey on artificial attention systems and their biological foundations may be found in , whereas an extensive comparison of different technical systems was performed in .
Typically, the output of technical attention systems is a master saliency map that is a retinotopic mapping of the integrated activation from different feature channels and top-down influences. Additional modulation of this map may result from mechanisms such as inhibition of return (IOR) which suppresses locations (or features) that have been previously attended. The maximum activation at a certain time in the saliency map determines the FOA. Hence, the FOA is the point in the visual field that corresponds to the location of the saliency peak.
Many post-attentional tasks, however, require objects or their boundaries instead of a single point. For example, if the focus of attention shall be used to select an object and pass it to a typical object tracker, its boundaries are needed to obtain a suitable image patch. Also processes within the attention system benefit from more complex proto-objects, for example, when IOR is to be applied at the object level (to suppress attending of a whole object instead of an area that may or may not cover the object). The region-based approach is a step towards this, because instead of a single location the shape of the perceptual entity in focus is known. However, because these systems are based on segmentation with regard to some homogeneity criterion (e.g. similar color) and objects in general are not homogenous (e.g. textured objects), a region can capture full objects only in rare cases.
A common method to obtain objects after the attentional process is to select image portions that exceed a certain saliency threshold (see e.g.,  or ). This requires prior knowledge about what threshold will select a reasonable object. In cases where only a part of an object is salient, the thresholding procedure will select only this part and not the full object. To avoid such difficulties, Walther and Koch  used a method that performs a segmentation at the FOA considering the low-level features which are already computed in the process of saliency computation. Their result is the approximate shape and extent of the object at the FOA.
For region-based attention, where segmentation of the scene already exits, no such method, which groups the pre-attentional segments to form an object at the FOA, is available yet. Only recently have multi-region objects been considered in region-based attention by ; however, this method requires a given multi-region template which is then, on a sustained basis, attended throughout a sequence. A mechanism which is able to group regions around the FOA to form a coherent object independent from such a template is desirable for the mentioned reasons.
In general, features of segments provide no unambiguous information about their relationship with regard to objects. In textured objects, features such as color, orientation, or symmetry carry little information about their membership in objects. Some features, however, such as a shared motion pattern or a common depth are highly indicative for regions of the same object, at least in a lot of situations. Thus, these features are well-suited to group individual regions to represent full objects.
In this paper, we extend the region-based spatiotemporal saliency model which was presented in . The FOA obtained with this model is a single region which in most cases does not represent a full object. Here, we add a further step which, starting at the FOA, groups regions with similar motion to form coherent objects.
Note that this concept differs from motion segmentation, which is a traditional topic in computer vision (see  for a review). In this line of research, the goal is to segment objects based on their motion for the whole scene. The concept we employ here, and which was also used by Walther and Koch  whose approach is not limited to motion, is based on determining a relevant location first by means of saliency computation. An object representation is extracted only at this location. This is in line with studies of biological attention which show that attention is a prerequisite for representing objects (see e.g. ).
2 From Motion Saliency to Multi-region Objects
The concept for grouping regions at the FOA is shown in figure 2: The saliency map contains one most salient region which is fed forward to select the region that represents the FOA. This region functions as a seed for the process that iteratively merges similarly moving regions. The spatiotemporal low-level features are looked up for the seed region and its immediate neighbors. If the absolute difference between the seed and a neighbor does not exceed a given threshold, the region is added to the group that represents the object and becomes a seed for a subsequent step. If for a neighbor the difference exceeds the threshold, the merging is stopped there. In the following, we formally describe the calculation of the motion feature as well as the saliency calculation and the merging process. The general data flow in the system is outlined in figure 1.
2.1 Motion Feature Magnitudes
The motion feature is calculated as described in  by applying the methods of region-based attention  on spatiotemporal slices. Spatiotemporal slices ( and ) are extracted from a pixel volume () which is obtained by stacking ( in our implementation) input frames (, in our experiments). After this step, three stacks of 2-dimensional images are available: the stack with slices which contains the original input frames, the stack with images, and the stacks with images. The and stacks encode spatiotemporal behavior of objects, such as the horizontal and vertical components of motion, respectively. An exemplary volume, spatiotemporal slices, and how moving objects appear in them are shown in figure 2. For continuous input, such volumes are formed and processed one after the other.
Region lists with regions are created for each slice by applying a color segmentation procedure which groups similarly colored pixels to form regions (please refer to  or  for details regarding such segmentation methods which have been used in region-based attention). Additionally, a label map that maps a position in the slice to a region is generated.
The motion feature is based on the concept of spatiotemporal receptive fields that respond to motion. The angle of an edge (or in our case a region) on a spatiotemporal slice is related to the motion of the corresponding object (see [23, 14] for details regarding this concept).
Thus, we determine the orientation for each region on a spatiotemporal slice. This is done by determining the first (lowest t-coordinate) and last (highest t-coordinate) row of pixels from each region. In these rows the center pixels and are located and the region’s spatiotemporal angle is obtained as
where calculates and adjusts the result to give the angle between and the positive axis. If is close to , the corresponding object did not move with regard to the motion component represented in the slice ( or ). Motion towards the left (or upwards, respectively) produces angles between and , while motion towards the right (or downwards, respectively) results in values between and .
This spatiotemporal feature is used to calculate the region-based saliency on each spatiotemporal slice as described in the next section. Eventually the spatiotemporal saliency is brought back into the image domain to select the FOA.
2.2 Motion Saliency
The bottom-up motion saliency is computed as the sum of the normalized differences of the angles associated with region and every other region, that is,
where ( …) is proportional to the distance between the centers of and and thus models that closer neighbors contribute more to a region’s saliency. This motion saliency is obtained independently for the and stacks and is then transformed back into the domain by applying algorithm 1.
The algorithm loops over all pixels within the bounding box of a each region (which is obtained during the segmentation process). A check is performed if the pixel belongs to the current region based on the label map. If this check is positive, the spatiotemporal saliencies are looked up in the and stacks at the corresponding positions. These values are summed up for the region and the result is normalized by the region size.
The result is a motion saliency value for every region that includes and contributions. Here, motion is the only contributing feature. However, spatial features such as color, orientation, and size, as well as top-down influences in the form of search targets or feature weighting (see ) can also be integrated.
The FOA for frame is obtained as the region with the maximum overall saliency , with .
2.3 Grouping Regions at the FOA
In order to group similarly moving regions at the FOA to form a coherent object, we apply a strategy that starts with the most salient region as a seed and selects its immediate neighbors (this relationship is established during the segmentation) as candidates which are added to a list of open regions . These are tested and if they fulfill a set of conditions, they are added to the list that represents the object. Their neighbors are added to the list . Once a region was tested, it is removed from and the procedure terminates when is empty.
The conditions to be fulfilled for a region to be added to the object representation are given in the following list:
. Where and are the averaged spatiotemporal angles which are obtained by looping over all pixels of a region and summing and normalizing the and contributions as shown in algorithm 2. Threshold is set to in our implementation.
and . Thus, the thresholds enforce a minimal difference to the spatiotemporal angle of (no motion) to exclude contributions from noise. We use .
The size of the group of a candidate merged with all regions currently in (current size) must be less than times the current maximum size that has been discovered for on previous frames (this check is not performed for the first frame of each volume). In our tests we used .
Thus, besides proximity, which is enforced by the neighborhood relation, condition (a) represents the check for having a similar motion signature in the spatiotemporal domain. Condition (b) is the requirement for a minimum deviation of the spatiotemporal angle from that of a perfect static object. The parameters can be regarded as thresholds below which the magnitudes of the spatiotemporal angles are considered not reliable. Condition (c) is a sanity check that ensures that an object cannot dramatically increase its size from one instance to the next. Note that this check assumes that the FOA focuses on the same object for the whole volume. Given the fact that we use short volumes of ten frames, this is usually the case.
Depending on the application, a frame-based inhibition of return (IOR) may be useful to attend further objects within a frame (as we demonstrate in section 3.2). To achieve this, after the first object at the FOA is selected, the saliency of the corresponding regions is tuned down and the procedure is repeated to select the next object on the current frame. Such an object-based IOR is highly advantageous compared to a region-based IOR. This can be seen in figure 3c where the second cycle discovers a new object while in figure 3b only another part of the same object is revealed. Note that inhibiting object selection over a range of frames requires location-based inhibition which could be implemented by selecting and inhibiting regions and their projected future locations (using the present information about their motion). Such a mechanism is not implemented in the current version.
An important aspect of the proposed concept is that correspondence between a perceptual entity (region) and a low-level feature (here spatiotemporal angle, i.e., motion) is established “on demand” starting at the FOA. Other features, such as depth, may be integrated in the same manner, restricting their computation to the locations which have proven to be salient.
3 Experimental Results
In the tests described in the following sections, we run our algorithm with the parameters as described in section 2.3. The region-growing segmentation procedure which is used as part of the computation of the motion feature and the motion saliency is controlled by a set of thresholds. These define the allowed color difference between seed and candidates as well as the current border and candidates. The concrete values of these thresholds depend on the hue–saturation–intensity values found in the images (e.g. different thresholds at different hues). However, the parameter set was kept constant for all tests. We compare our results with object selections by thresholding saliency maps from the motion saliency approach  on which the FOA determination in the proposed system is based. Furthermore, we use saliency maps generated by the model by Itti et al. [4, 3]111The ezvision software [from http://ilab.usc.edu/toolkit, June, 2013] was used to obtain results for the saliency model from Itti et al. [4, 3] and Walther and Koch .. Additionally to the default version which uses color-, flicker-, intensity-, orientation-, and motion-channels (CFIOM), we generated saliency maps using only the motion channel (M), because we are dealing only with moving objects.
In the thresholding procedure, a value is incremented and locations in the saliency maps with higher values are considered to belong to the object. This is then compared with ground truth from frames in which the moving foreground objects have been manually marked. The true positive rate is plotted against the false positive rate for the different saliency models providing a rough context for judging the performance of the proposed algorithm and the approach from Walther and Koch  that also aims at selecting objects at the FOA.
3.1 Test 1: One Moving Object
In the first scene ( frames), there is a single moving foreground object (a fish) which is reliably attended by the region-based motion saliency approach. We refer to  where this scene has also been processed with the motion saliency procedure and the results show that only small parts of the object are selected. In figure 4d and 4dd results from the proposed procedure are shown; results from Walther and Koch  are shown in 4e and 4ee. The input and ground truth is shown in figure 4a and saliency maps from  and  in figure 4b and 4c, respectively.
It can be seen that the proposed approach sometimes captures the entire object, but in general, it is rather conservative which is also reflected in a false positive rate close to zero (see figure 4f). The algorithm from  selects the full target more often (higher true positive rate), but also selects a substantial amount of background which is reflected in a larger false positive rate. A reason for the rough and frequently too large shape boundaries of this method might be the low resolution of the internal feature maps used by the model.
Figure 4f also shows that the object selection methods in average perform similarly as their related saliency approaches at certain thresholds. Still, the object selection approaches are advantageous in practice when the optimal threshold is unknown or must be chosen conservatively.
3.2 Test 2: Two Moving Objects
The second test scene ( frames) features two moving foreground object (two fish). All parameters were kept the same as in the previous test except that the proposed algorithm is allowed a second cycle per frame to make use of the object-based IOR (see figure 3). The model of Walther and Koch  was also applied two times for each frame. The timing of the attention system was adjusted so that—with the help of shape-based IOR (see )—two objects were selected on each frame. Again it can be seen (selections shown in figure 5d, 5dd, 5e, and 5ee; performance plot 5f) that the proposed region-based method is more likely to miss parts of the target whereas the method by  returns the targets merged with some surrounding background. It also sometimes selects the plant which is no moving foreground object, probably because their method is not limited to objects in motion and the plant is highly contrasting in intensity.
4 Conclusion and Outlook
A conclusion drawn from our tests is that the region-based grouping strategy, which considers shared motion patterns, yields rather conservative results. In some frames (see figure 5ee) objects are selected in close agreement with their actual shape. The evaluated scenes were rather simple, containg no ego motion of the system. An additional test was performed with an object which is static with respect to the image in front of a dynamic background (relative motion; the same scene can be seen in figure 10 in ). The test revealed that such situations are rather difficult for the current implementation because the background regions, which are crossed by the target, show noisy spatiotemporal angles (note that condition (a) described in section 2.3 had to be ignored, because the target was static). The model performed with 17 % true positives and no false positives. However, the model by Walther and Koch  also showed a weak performance for this scenario (26 % TP and 23 % FP). For future improvement of our system, we are investigating the optimization of segmentation methods for spatiotemporal slices and more stable calculations of the spatiotemporal angles.
In figure 6, we show results of an object tracking that has been initialized with such a selection from the proposed system. A direction for further future work is to determine the reliability of a selection and only feed good representations to higher-level processes such as object detection or tracking.
Further improvements could be gained by including additional features. In agreement with the proposed concept, depth could be estimated locally (e.g., by determining local stereo correspondences) starting at the FOA.
-  Rensink, R.A.: Change Blindness: Implications for the Nature of Visual Attention. Vision and Attention (2001) 169–188
-  Treisman, A.M., Gelade, G.: A Feature-Integration Theory of Attention. Cognitive Psychology 12(1) (1980) 97–136
-  Itti, L., Koch, C., Niebur, E.: A Model of Saliency-Based Visual Attention for Rapid Scene Analysis. In: IEEE PAMI. Volume 20. (1998) 1254–1259
-  Itti, L.: Automatic Foveation for Video Compression Using a Neurobiological Model of Visual Attention. IEEE: Trans. Img. Proc. 13(10) (2004) 1304–1318
-  Kouchaki, Z., Nasrabadi, A.M.: A Nonlinear Feature Fusion by Variadic Neural Network in Saliency-Based Visual Attention. In: VISAPP. Volume 1. (2012) 457–461
-  Cui, X., Liu, Q., Metaxas, D.N.: Temporal Spectral Residual: Fast Motion Saliency Detection. In: ACM Multimedia’09. (2009) 617–620
-  Bruce, N.D.B., Tsotsos, J.K.: Saliency Based on Information Maximization. In: NIPS. (2005)
-  Aziz, M.Z., Mertsching, B.: Fast and Robust Generation of Feature Maps for Region-Based Visual Attention. In: IEEE Trans. Img. Proc. Volume 17. (2008) 633–644
-  Perazzi, F., Krahenbuhl, P., Pritch, Y., Hornung, A.: Saliency filters: Contrast Based Filtering for Salient Region Detection. In: IEEE CVPR. (2012) 733–740
-  Cheng, M.M., Zhang, G.X., Mitra, N.J., Huang, X., Hu, S.M.: Global Contrast Based Salient Region Detection. In: IEEE CVPR. (2011) 409–416
-  Aziz, M.Z., Mertsching, B.: Visual Search in Static and Dynamic Scenes Using Fine-Grain Top-Down Visual Attention. In: ICVS. Volume 5008. (2008) 3–12
-  Tünnermann, J., Mertsching, B.: Region-Based Artificial Visual Attention in Space and Time. Cognitive Computation (2013) doi: 10.1007/s12559-013-9220-5.
-  Tünnermann, J., Born, C., Mertsching, B.: Top-Down Visual Attention with Complex Templates. In: VISAPP. Volume 1. (2013) 370–377
-  Belardinelli, A., Pirri, F., Carbone, A.: Attention in Cognitive Systems. Springer (2009) 112–123
-  Tünnermann, J., Mertsching, B.: Continuous Region-Based Processing of Spatiotemporal Saliency. In: VISAPP. Volume 1. (2012) 230–239
-  Tsotsos, J.K., Liu, Y., Martinez-Trujillo, J.C., Pomplun, M., Simine, E., Zhou, K.: Attending to Visual Motion. Computer Vision and Image Understanding 100(1-2) (2005) 3–40
-  Frintrop, S., Rome, E., Christensen, H.I.: Computational Visual Attention Systems and Their Cognitive Foundations. ACM Transactions on Applied Perception 7(1) (2010) 1–39
-  Borji, A., Itti, L.: State-of-the-art in Visual Attention Modeling. IEEE Transactions on Pattern Analysis and Machine Intelligence 35(1) (2013) 185–207
-  Walther, D., Koch, C.: Modeling Attention to Salient Proto-Objects. Neural Networks 19(9) (2006) 1395–1407
-  Zappella, L., Lladó, X., Salvi, J.: Motion Segmentation: A Review. In: Proceedings of the 2008 Conference on Artificial Intelligence Research and Development: Proceedings of the 11th International Conference of the Catalan Association for Artificial Intelligence. (2008) 398–407
-  Aziz, M.Z., Shafik, M.S., Mertsching, B., Munir, A.: Color Segmentation for Visual Attention of Mobile Robots. In: Proceedings of the IEEE Symposium on Emerging Technologies, 2005. (2005) 115–120
-  Backer, M., Tünnermann, J., Mertsching, B.: Parallel k-Means Image Segmentation Using Sort, Scan and Connected Components on a GPU. In Keller, R., Kramer, D., Weiss, J.P., eds.: Facing the Multicore-Challenge III. Volume 7686 of Lecture Notes in Computer Science. Springer (2013) 108–120
-  Adelson, E.H., Bergen, J.R.: Spatiotemporal Energy Models for the Perception of Motion. J. Opt. Soc. Am. A 2(2) (1985) 284–299
-  Kalal, Z., Matas, J., Mikolajczyk, K.: P-N Learning: Bootstrapping Binary Classifiers by Structural Constraints. In: IEEE CVPR. (2010) 49–56