Fast and Accurate Light Field Saliency Detection through Feature Extraction

Fast and Accurate Light Field Saliency Detection through Feature Extraction

Abstract

Light field saliency detection—important due to utility in many vision tasks—still lack speed and can improve in accuracy. Due to the formulation of the saliency detection problem in light fields as a segmentation task or a “memorizing” tasks, existing approaches consume unnecessarily large amounts of computational resources for (training and) testing leading to execution times is several seconds. We solve this by aggressively reducing the large light-field images to a much smaller three-channel feature map appropriate for saliency detection using an RGB image saliency detector. We achieve this by introducing a novel convolutional neural network based features extraction and encoding module. Our saliency detector takes s to process a light field of size in a CPU and is significantly faster than existing systems, with better or comparable accuracy. Our work shows that extracting features from light fields through aggressive size reduction and the attention results in a faster and accurate light-field saliency detector.

\name

Sahan Hemachandra , Ranga Rodrigo, Chamira Edussooriya \addressDepartment of Electronic and Telecommunication Engineering,
University of Moratuwa, Sri Lanka {keywords} Light fields, saliency detection, feature extractor, fast algorithms, convolutional neural networks.

1 Introduction

Light fields capture both spatial and angular information of light emanating from a scene compared to spatial-only information captured by images. The additional angular information available with light fields paves the way for novel applications such as post-capture refocusing [11, 3] and depth-based filtering [2, 10], which are not possible with images. Furthermore, light fields support numerous computer vision tasks which are traditionally based on images and videos [16, 17].

Saliency detection is a prerequisite for many computer vision tasks such as semantic segmentation, image retrieval, and scene classification. Saliency detection using light fields provides better accuracy compared to what is provided by RGB images, in particular, for challenging scenes having similar foreground and background, and complex occlusions [8, 19]. However, data available with light fields are significantly higher than data available with a single RGB image, e.g., a light field having sub-aperture images contains times more data (with the same resolution). Therefore, computational time of light field saliency detection algorithms are substantially higher compared to those of RGB image saliency detection algorithms [19].

We can categorize existing light field saliency detectors in to three classes depending on the input: focal stack and all focus images[15], RGB-D images and light fields[18][21]. Recent algorithms of these categories predominately use convolutional neural networks (CNNs) to learn the relationship between the image features and saliency of light fields. Even though, the available light field datasets are limited in size, we can freely augment focal stack and RGB-D data for the first two classes. On the other hand, inability to freely augment light field images prevents training deep CNNs from scratch for the third class. These constraints demand the use of pre-trained networks, of course, followed by fine tuning.

In this paper, we propose a novel feature extraction and encoding (FEE) module for fast light field saliency detection by employing an two-dimensional (2-D) RGB image saliency detection algorithm. Our FEE module takes light field as the input (so, belongs to the third class), and provides an RGB encoded feature map. The proposed FEE module comprises of a CNN with five convolutional layers. We employ the 2-D saliency detector proposed in [24] with our FEE module. Furthermore, we employ the LYTRO ILLUM saliency detection dataset [18] for the training and testing the performance of our light field saliency detector. Experimental results obtained with five-fold cross validation confirms that our saliency detector provides a significant improvement in computational time with accuracy comparable or better than state-of-the-art light field saliency detectors [18, 21].

2 Related Works

2.1 Saliency Detection on Light Fields

Light field saliency detection [8] improves the accuracy of saliency detection in challenging scenes having similar foreground/background and complex occlusions. This improvement achieves in [8] exploiting the refocusing capability available with light fields which provides focusness, depths, and objectness cues. [19] employs depth map, all focus image and focal stack available with a light field for saliency detection. [20] further exploits light field flow fields over focal slices and multi-view sub-aperture images improve the accuracy in saliency detection by enhancing depth contrast. [6] employs a dictionary learning based method to combine various light field features for a universal saliency detection framework using sparse coding. This method handles various types of input data by building saliency and non-saliency dictionaries using focusness cues of focus stack as features for light fields. All these methods works on super-pixel level features of light fields, and do not exploit high-level semantic information properly in order to have robust performance in complex scenarios.

2.2 Deep Learning for Saliency Detection

There is rich body of work in saliency detection in RGB images: pyramidal, feature-based, recurrent network based, and attention based. Most non-recurrent methods use VGG-16- or VGG-19-like feature extractors [12] pre-trained on ImageNet dataset for feature extraction. Pyramidal saliency detectors [22, 5, 24] have the advantage of the ability to use information from multiple layers. Some that build up on CNN feature computers defer the actual saliency detection to latter layers or combine features from many layers [23, 9]. Methods that employ recurrent networks generally work well [14, 4] with the possible disadvantage of slowness. RGB saliency detectors greatly benefit from attention models, by focusing on features that truly capture saliency without the interference of unnecessary features.

Although these methods show success in RGB images, they are unsuitable for direct use with light field images because their architecture and input are not specifically designed to extract the geometry information of light fields embedded in angular dimensions. This information is vital to improve the quality of predicted saliency maps.

2.3 Light Field Saliency Detection with Deep Learning

Recent advances in light field saliency detection successfully use deep networks. However, in general, the light field propagates further into the network, which is a major hindrance to speed. [15] has introduced a two-stream neural network architecture with two VGG-19 feature extractors and ConvLSTM based attention module to process the all focus image and focal stack to generate the saliency maps. Similarly, [21] has used a multi-task collaborative network(MTCN) for light field saliency detection with two streams for central view image and multi-view images by exploring the spatial, depth and edge information. [18] has introduced a “model angular changes block” to process light field images with a modified version of Deeplabs v-2 segmentation network(LFNet), which is a computationally heavy backbone, considering the similarity between the segmentation and saliency detection. On the other hand, the suitability of a semantic segmentation network, not specifically trained on light fields, may affect accuracy. All these methods have the inherent disadvantage of slowness due to use of heavy segmentation networks, Several feature extractors, recurrent blocks and several streams.

3 Light-Field Saliency Detection Architecture

Speeding-up light-field saliency detection require avoiding computationally heavy one or more backbones and predominantly working in bulky light-field features maps. On the other hand, inability to freely augment light field images prevent training deep light field saliency detectors from scratch. These constraints demand using a pre-trained network (of course, followed by fine tuning). There are well-known pre-trained networks that detect saliency in 2-D RGB images. In this paper we propose a FEE module that can be integrated into 2-D saliency detectors without any architectural changes to the base model, to extract and encode the features in light fields. Fig. 1 shows an overview of the architecture our system. The input to this neural network is a light field of size in the form of a micro-lens image array of of size , where and . Here, denotes the spatial resolution and denotes the angular resolution. Then the extracted feature maps can be fed into the 2-D saliency detector to get the saliency maps. This whole network can be trained end-to-end manner after the integration.

LF fieature extraction and encoding (FEE) block

VGG-16

CPFE

CA

Conv 5-3

Conv 4-3

Conv 3-3

conv.

US

Low-level features

Conv 1-2

Conv 2-2

conv.

SA

conv.

Loss

Groud truth

Figure 1: System architecture: light field (LF) feature extraction and encoding (FEE) block receives the LFs and computes features. Spatial attention block (SA) and channel-wise attention block (CA) receives low level (Conv 1-2 and Conv 2-2) and high level (Conv3-3, Conv 4-3 and Conv 5-3) features, respectively. VGG-16, or a similar block, produces these feature maps. Note that LF processing happens only in the light field (LF) FEE block. CA block gives attention to more informative kernel outputs. CPFE: context aware pyramid feature extraction.

3.1 2-D Saliency Detector

Task of saliency detection in regular images is similar to binary semantic segmentation, and for this task requires both high level contextual information and low level spatial structural information. However, all of the high-level and low-level features are not suitable for saliency detection, and some features might even cause interference [24]. An attention mechanism can avoid such situations.

The 2-D saliency detector proposed by [24] is such a system which we select as the saliency detector. This work especially uses channel wise attention module (CA) for high-level feature maps and spatial attention (SA) module for low-level feature maps with edge preserving loss function to preserve the edges of a saliency map. Along with the CA and SA modules, the pyramid feature network of the architecture leads to the state-of-the-art accuracy for RGB image saliency detection. However, using a light field to feed the input to a 2-D saliency detector is ineffective as angular information of the light field gets lost. We solve this problem by using a carefully designed novel light field FEE module integrated in to the input of the network. Due to the limited space, we do not describe the architecture of the 2-D saliency detector, and the we refer the reader to [24] for more details.

3.2 Novel Feature Extraction Block

The 2-D saliency detector accepts inputs with resolution of and produces saliency maps with resolutions of . Starting from this, our FEE module must extract and encode the pixel-wise angular information stored in an light field and produce and RGB image. In order to do that, by arranging a light field as a 2-D image of size , we run a kernel with the stride of to exploit the angular information related to each pixel as mentioned in [18]. Here, we consider the light fields in the LYTRO ILLUM [18], where and leading to and . Because our light filed saliency detector shown in Fig. 1 processes light fields only in the FEE module and prevents subsequent processing in the 2-D saliency detector, we can achieve significant saving of computational time.

The FEE module (the very first block in in Fig. 1) is the key component that leads to significant speed improvements. It aggressively down samples the LF and encodes it with features suitable to be fed to a regular CNN. As the input LF is a micro lens array image, adjacent pixels in the first block comprises the first pixel of each of the 81 sub-aperture images. Therefore, by using a stride of we capture the same pixel for all the sub-aperture images at each convolution step. Following this, we designed the hidden convolutional layers choosing layer-size parameters to be compatible with the VGG-16 1 network with decreasing number of filters at each layer to encode the light field in to a feature map of resolution.

4 Experimental Results

We employ the LYTRO ILLUM [18] dataset in the experiments with a computing platform comprising of an Intel Core i9-9900K (3.60 GHz) CPU, GB RAM and Nvidia RTX-2080TI GPU. Note that even though two other light field saliency datasets, HFUT-Lytro [20] and LFSD [7], are available, they are not suitable for evaluation of our light field saliency detector because of the differences in light field representations. There are light fields in the LYTRO ILLUM dataset, and we compare the performance the proposed light field saliency detector with the state-of-the-art light field saliency detectors LFNet [18] and MTCN [21] in terms of the accuracy achieved with five-fold cross validation and computational time.

4.1 Implementation and Training of the Proposed Light Field Saliency Detector

To facilitate the proposed FEE module to encode a light field into feature map, we crop the initial micro-lens array image of size into four images of size , cropping with diffident borders. This leads to a dataset of light fields and we incorporate data augmentation, such as random rotations of and , random brightness, saturation and contrast changing, and random shuffling of the colour channels without affecting the angular information available with a light field. We train our saliency detector in three steps: first training 2-D saliency detector [24] using DUTS-TR [13] dataset, then training the FEE module with the overall architecture shown in Fig 1 using the light field dataset with the 2-D saliency detector frozen for 10 epochs, and lastly training both FEE module and the 2-D saliency detector for another 40 epochs. For all the training, we employ the SGD optimizer with a momentum of , decay of , and initial learning rate of with a batch size . We use the loss function [24]

where is the predicted saliency map, is the ground truth saliency map, is the batch size, and  [24].

4.2 Comparison with State-of-the-Art Light Field Saliency Detectors

We employ the evaluation metrics measure (with as suggested in[1], mean absolute error (MAE), and measure, where is a weighting function, to compare the performance of the saliency detectors. We present the performance achieved with the proposed, LF Net [18] and MTCN [21] light field saliency detectors in Table 1. Accordingly, performance of our saliency detector is superior compared to LFNet while is slightly behind compared to MTCN in terms of all the three metrics. We show the saliency maps of four light fields obtained with the proposed and LFNet saliency detectors in Fig. 2 for qualitative comparison. Our saliency maps are closer to the ground truth compared to those of LFNet.

Metric LFNet [18] MTCN [21] Ours
0.8116 0.8729 0.8558
0.7540 0.8534 0.7671
MAE 0.0551 0.0483 0.0541
Table 1: Comparison with state-of-the-art saliency detectors. Our results surpass LFNet [18], and are slightly behind MTCN [21].
Method i9-9900K RTX-2080TI
LFNet[18] 10.4813 s 0.5321 s
MTCN[21] - 0.3989 s
Ours 0.4175 s 0.2381 s
approximated value
Table 2: Computational time required to process a light field: our saliency detector is significantly faster than state-of-the-art light field saliency detectors.

We present the computational time required by each light field saliency detector to process a light field in the LYTRO ILLUM dataset. Our saliency detector is times faster than the LFNet in the CPU implementation, and require and less time compared to LFNet and MTCN, respectively, for GPU implementation. Here, we present an approximated value for MTCN obtained based on the computational time reported in [21] ( s) for an implementation using a Nvidia Tesla P100 GPU.

(a) Center SAI
(b) GT
(c) LFNet [18]
(d) Ours
Figure 2: Comparison of saliency maps: (a) centre sub-aperture image (SAI) of the light field, (b) ground truth (GT), (c) LFNet results [18], (d) our results. Our saliency maps are closer to the ground truth compared to those of LFNet.

5 Conclusion and Future Work

We proposed a fast and accurate light field saliency detector that feeds carefully computed light field features to a saliency detector with an attention mechanism. It is fast and runs on an i9 CPU at approximately light fields/s and on a 2080TI GPU at 4 light fields/s. Its accuracy surpasses most of the existing methods, and is only slightly inferior to a very recent work. The speed is due to faster feature extraction constraining the light-field processing to just this FEE module and using a single stream without resorting to recurrent networks. The high accuracy is due to the light-field saliency specific feature extractor and the use of an attention mechanism. Our works brings light field saliency detection closer to real-time implementations which would enable, e.g., cameras to refocus on objects of interest. Future directions include making the network faster and more accurate by changing or improving the 2-D detector backbone and FEE module. Adapting this method to other computer vision tasks which benefit from the angular information embedded in the light fields and lack reasonably-sized datasets—such as, material recognition, segmentation, and object detection—which use 2-D-input neural networks would also be interesting.

Footnotes

  1. VGG-16 is just one choice of the back bone. Other backbones, e.g., ResNets are also suitable.

References

  1. R. Achanta, S. Hemami, F. Estrada and S. Susstrunk (2009) Frequency-tuned salient region detection. In IEEE Conf. Comput. Vision and Pattern Recogn, pp. 1597–1604. Cited by: §4.2.
  2. D. Dansereau and L. T. Bruton (2007-Feb.) A 4-D dual-fan filter bank for depth filtering in light fields. IEEE Trans. Signal Process. 55 (2), pp. 542–549. Cited by: §1.
  3. D. G. Dansereau, O. Pizarro and S. B. Williams (2015-Feb.) Linear volumetric focus for light field cameras. ACM Trans. Graph. 34 (2), pp. 15:1–15:20. Cited by: §1.
  4. J. Kuen, Z. Wang and G. Wang (2016) Recurrent attentional networks for saliency detection. In Proc. of the IEEE Conf. comput. Vision and Pattern Recogn., pp. 3668–3677. Cited by: §2.2.
  5. G. Li and Y. Yu (2016) Deep contrast learning for salient object detection. In Proc. of the IEEE Conf. Comput. Vision and Pattern Recogn., pp. 478–487. Cited by: §2.2.
  6. N. Li, B. Sun and J. Yu (2015) A weighted sparse coding framework for saliency detection. In Proc. of IEEE Conf. Comput. Vision and Pattern Recogn., pp. 5216–5223. Cited by: §2.1.
  7. N. Li, J. Ye, Y. Ji, H. Ling and J. Yu (2014) Saliency detection on light field. In Proc. of IEEE Conf. Comput. Vision and Pattern Recogn., pp. 2806–2813. Cited by: §4.
  8. N. Li, J. Ye, Y. Ji, H. Ling and J. Yu (2017-Aug.) Saliency detection on light field. IEEE Trans. Pattern Anal. Mach. Intell. 39 (8), pp. 1605–1616. Cited by: §1, §2.1.
  9. N. Liu, J. Han and M.H. Yang (2018) Picanet: learning pixel-wise contextual attention for saliency detection. In Proc. of the IEEE Conf. Comput. Vision and Pattern Recogn., pp. 3089–3098. Cited by: §2.2.
  10. N. Liyanage, C. Wijenayake, C. Edussooriya, A. Madanayake, P. Agathoklis, L. Bruton and E. Ambikairajah (2020-Feb.) Multi-depth filtering and occlusion suppression in 4-D light fields: algorithms and architectures. Signal Process. 167, pp. 1–13. Cited by: §1.
  11. R. Ng, M. Levoy, M. Brédif, G. Duval, M. Horowitz and P. Hanrahan (2005) Light field photography with a hand-held plenoptic camera. Computer Science Technical Report Stanford Univ.. Cited by: §1.
  12. K. Simonyan and A. Zisserman (2015) Very deep convolutional networks for large-scale image recognition. In Proc. Int. Conf. Learning Represent., pp. 1–14. Cited by: §2.2.
  13. L. Wang, H. Lu, Y. Wang, M. Feng, D. Wang, B. Yin and X. Ruan (2017) Learning to detect salient objects with image-level supervision. In Proc. of the IEEE Conf. Comput. Vision and Pattern Recogn., pp. 136–145. Cited by: §4.1.
  14. L. Wang, L. Wang, H. Lu, P. Zhang and X. Ruan (2016) Saliency detection with recurrent fully convolutional networks. In Proc. European Conf. Comput Vision, pp. 825–841. Cited by: §2.2.
  15. T. Wang, Y. Piao, X. Li, L. Zhang and H. Lu (2019-10) Deep learning for light field saliency detection. In Proc. of the IEEE/CVF Int. Conf. Comput. Vision, pp. 8838–8848. Cited by: §1, §2.3.
  16. J. Yu (2017) A light-field journey to virtual reality. IEEE MultiMedia 24 (2), pp. 104–112. Cited by: §1.
  17. N. Zeller, F. Quint and U. Stilla (2017) From the calibration of a light-field camera to direct plenoptic odometry. IEEE J. Sel. Topics Signal Process. 11 (7), pp. 1004–1019. Cited by: §1.
  18. J. Zhang, Y. Liu, S. Zhang, R. Poppe and M. Wang (2020) Light field saliency detection with deep convolutional networks. IEEE Trans. Image Process. 29 (), pp. 4421–4434. Cited by: §1, §1, §2.3, §3.2, 1(c), Figure 2, §4.2, Table 1, Table 2, §4.
  19. J. Zhang, M. Wang, J. Gao, Y. Wang, X. Zhang and X. Wu (2015) Saliency detection with a deeper investigation of light field.. In IJCAI, pp. 2212–2218. Cited by: §1, §2.1.
  20. J. Zhang, M. Wang, L. Lin, X. Yang, J. Gao and Y. Rui (2017) Saliency detection on light field: a multi-cue approach. ACM Trans. Multimedia Comput., Commun., and Appl. 13 (3), pp. 1–22. Cited by: §2.1, §4.
  21. Q. Zhang, S. Wang, X. Wang, Z. Sun, S. Kwong and J. Jiang (2020) A multi-task collaborative network for light field salient object detection. IEEE Trans. Circuits Syst. Video Technol.. Cited by: §1, §1, §2.3, §4.2, §4.2, Table 1, Table 2, §4.
  22. X. Zhang, T. Wang, J. Qi, H. Lu and G. Wang (2018) Progressive attention guided recurrent network for salient object detection. In Proc. of the IEEE Conf. Comput. Vision and Pattern Recogn., pp. 714–722. Cited by: §2.2.
  23. R. Zhao, W. Ouyang, H. Li and X. Wang (2015) Saliency detection by multi-context deep learning. In Proc. of the IEEE Conf. Comput. Vision and Pattern Recogn., pp. 1265–1274. Cited by: §2.2.
  24. T. Zhao and X. Wu (2019) Pyramid feature attention network for saliency detection. In Proc. of IEEE Conf. Comput. Vision and Pattern Recogn., pp. 3085–3094. Cited by: §1, §2.2, §3.1, §3.1, §4.1.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
420068
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description