Core Sampling Framework for Pixel Classification
The intermediate map responses of a Convolutional Neural Network (CNN) contain information about an image that can be used to extract contextual knowledge about it. In this paper, we present a core sampling framework that is able to use these activation maps from several layers as features to another neural network using transfer learning to provide an understanding of an input image. Our framework creates a representation that combines features from the test data and the contextual knowledge gained from the responses of a pretrained network, processes it and feeds it to a separate Deep Belief Network. We use this representation to extract more information from an image at the pixel level, hence gaining understanding of the whole image. We experimentally demonstrate the usefulness of our framework using a pretrained VGG-16  model to perform segmentation on the BAERI dataset  of Synthetic Aperture Radar(SAR) imagery and the CAMVID dataset .
Pixel-wise prediction/classification has a lot of applications  in scene understanding. It helps to generate a fine grain segmentation compared to existing techniques ,, that segment at a coarser level. The rise of machine learning has led to novel and automatic segmentation techniques that require very little user input ,,. Deep learning,,  in particular, has enabled this as it makes it possible to learn data representations without supervision.
Core sampling has been used in engineering and science to understand the properties of natural materials, climatic record from ice cores  etc. Convolutional Neural Networks (CNNs), particularly useful for image understanding, work on parts of the input locally usually at different pyramidal levels. In doing so, the network gathers both low level and high level information about an input image. So, the lower layers encode the pixels while the higer layers provide representation of objects comprising of those pixels that eventually help in understanding the entire image. Pixel wise classification and image understanding can be improved with the local and global information that are encoded in the different layers of a CNN; this information, stacked at different pyramidal levels, can be viewed as a core sample that can enable better understanding of an image. A CNN consists of different layers that convolve the input image and/or parts of the image with filters. Training a CNN involves determining what these filters need to be in order to get the desired output for a given input. We can obtain map responses by feeding in test images through these (pre)trained layers. Fig. 3 shows some of the map responses that we get when we feed an input image of a cow through the various layers of a CNN.
In this paper, we present a core sampling framework that is able to use these activation maps from several layers as features to another neural network using transfer learning to provide an understanding of an input image. Our framework creates a representation that combines features from the test data and the contextual knowledge gained from the responses of a pretrained network, processes it and feeds it to a separate Deep Belief Network. We use this representational model to extract more information from an image at the pixel level, thereby gaining understanding of the whole image.
Image pyramids have already been used to extract or learn more information from images ,  as well as for image segmentation . In a sense, the response from the layers of a CNN can also be viewed as the different pyramidal levels of the image viewed at different locations of that image. The notion of Hypercolumn introduced in  builds upon this. The term Hypercolumn is used to describe a column of a group of map responses from the convolution layers of a network, aligned together such that each value in that column refers to the output of different layers from each of these maps, for an individual pixel as shown in Fig 4. In our framework, we accumulate such hypercolumns from each pixel of every training image and use that as training data to a deep belief network.
Transfer Learning allows the use of the knowledge gained from solving one problem to improve the solution to another . Our core sampling framework makes use of transfer learning, where the core sample acquired from a previously trained network is used for training a second network. This strategy helps increase the overall performance and speeds up training. The other advantage of using this approach is that it avoids the need to train a very large network on a huge dataset. Often there is not enough training data for a particular task; this can be overcome by using the knowledge learned from a similar task  or a different task .
The two-stage architecture of our core sampling framework is given in Figure 1. We use the VGG-16  model to bootstrap our framework. This model has been trained on the ImageNet dataset  which consists of more than a million training images and 1000 classes. The intermediate maps, that are generated for each pixel during the testing phase when images are input to the above model, when stacked together and resized to a uniform size, form hypercolumns. A hypercolumn for a pixel in an input image is a vector with columns, where is the number of intermediate maps in the VGG-16 model, with each component of the vector being a map. A hypercolumn does not preserve any spatial correlation between the constituent maps. A core is a collection of hypercolumns, one per pixel in an input image. A random sample drawn from a core is called a core sample. Core samples generated from input images a fed to the second stage of our framework.
The second stage of our framework consists of a Deep Belief Network (DBN). Deep Belief Networks are unsupervised deep learning models . Since no spatial correlation among the maps is preserved in the hypercolumns comprising the input core samples, a CNN cannot be used in the second stage as the filters in a CNN presume spatial correlation between adjacent maps. The DBN interprets the input core samples to provide an understanding of the original input image.
This paper makes the following contributions.
We present a novel core sampling framework that is able to use activation maps from several layers of a CNN as features to another neural network using transfer learning to provide an understanding of an input image. It creates a representation that combines features from the test data and the contextual knowledge gained from the responses of a pretrained network. This model can be used to extract more information from an image at the pixel level, thereby gaining understanding of the whole image.
Ii Related Work
CNNs have been used to extract information from images; deep CNNs have enabled recognition of objects in images with high accuracy without any human intervention , , , . There has been some research in using the information acquired from the intermediate layers of a CNN to solve tasks such as classification, recognition, segmentation or a combination of these , , , , . Image segmentation has been studied for decades. Haralick and Shapiro  describe classical image segmentation techniques such as thresholding, multidimensional space clustering, region growing, etc. Comanciu and Meer  use the mean shift algorithm to provide automatic segmentation, where human intervention is needed to choose the class of a segment. There are well known graph-based algorithms for image segmentation; e.g., Normalized Cuts , where segmentation is achieved by measuring the similarity of graph partitions; Graph Cuts , where a segmentation is defined as set of regions; this set of regions is repeatedly combined based on the similarity between neighboring regions. Rother et. al.  implement iterative estimation on top of graph cuts algorithm  to define a boundary for the segmentation of objects where a user selects the broader region.
Recent approaches to segmentation also include various ways to use a CNN to segment images. Girshick et. al.  use Region-based CNNs (R-CNNs) where category-independent region proposals are defined during the pre-processing stage, that are input to a CNN to generate feature vectors. A linear SVM (Support Vector Machine) is then used to classify the regions. Unsupervised Sparse auto-encoders have been used in  on Synthetic Aperture Radar (SAR) data to classify different types of vehicles. They only deal with classification of images already segmented into smaller regions containing the objects. We deal with segmentation by classification at pixel level. Ladicky et. al.  use energy function for the conditional Random Fields (CRF) model which they use to aggregate results from different recognizers. Zhang et. al.  recover dense depth map images and other information about the frames in video sequences such as height above ground, global and local parity, surface normal, etc. They use graph cut based optimization and decision forests to evaluate their features. The need of video sequences limits their application. Both the previous approaches need manual feature extraction. Another approach is to use deconvolution layers after the convolution layers as a way to reconstruct segmented images as done in  and . SegNet  uses an encoder architecture similar to VGG-16. The decoder is constructed by removing the fully connected layers and adding deconvolution layers. It is used to transform low resolution maps to high resolution ones. The original paper that proposed the notion of hypercolumns  also uses the maps from the intermediate layers of a CNN to segment and localize objects in images. They use a location specific approach; in particular, they use x classifiers across different positions of the images where is a constant. A linear combination of these classifiers is used to classify each pixel. They also use bounding boxes of size 50 x 50 and try to predict the heat map to localize objects in an image.
Deep Belief Networks (DBN) consist of multiple layers of stochastic, latent variables trained using an unsupervised learning algorithm followed by a supervised learning phase using feedforward backpropagation-based Neural Networks. In the unsupervised pre-training stage, each layer is trained using a Restricted Boltzmann Machine (RBM). Once trained, the weights of the DBN are used to initialize the corresponding weights of a Neural Network . A Neural Network initialized in this manner converges much faster than an otherwise uninitialized one.
In , the authors use transfer learning for segmentation of hyperspectral images by training on an unlabeled dataset and then using transfer learning to improve separability based on the learned knowledge.
Iii Core Sampling Framework for Pixel Classification
We use hypercolumns introduced in  as a data structure for representing the layer outputs from a CNN. The first few layers are used for accurate localization of an object and the layers close to the output layer help to distinguish between different objects. We use the pre-trained VGG-16  model for bootstrapping the core sampling framework. Fig. 2 shows the different layers of the VGG-16 network. The architecture includes pooling layers between the convolution layers and fully connected layers at the end. This network is trained on the ImageNet dataset which contains a large variety of objects. This makes it a perfect model to construct a framework that works for a variety of datasets . Our framework uses the intermediate maps that are acquired during the testing phase when images are input to this model. The individual pixels of the intermediate maps, when stacked together and resized to a uniform size, form hypercolumns. Each pixel’s value, combined with the map values produced using the pretrained model, is used as a data point. The map values are thus the features for their respective pixels. While using the maps from just 5 convolutional layers of the pretrained model, the number of maps per image is already around 1500. As the size of the core sampling data (processed output data from the pretrained network, described in III-C) gets large, we came up with two ways to handle the problem: a) saving the entire data into the hard disk, b) using the data from a randomly sampled subset of pixels to train the DBN in the second stage.
Iii-a Preprocessing and Data Augmentation
The input to the pre-trained VGG-16 model needs to be of the size 224 by 224. The BAERI dataset (see Section III-F) consists of raw images at inconsistent intensity levels and variable image sizes. Resizing the images would create images that are at different scales. So we added padding around smaller images to create 224 by 224 images. Because we did not resize them, the scale information remained intact. For the same reason, we created sub-images (tiles) from larger images before extracting the map responses. These image tiles are created by using a sliding window of 224 by 224 and a smaller stride size. As before, there is no resizing of images; hence there is no need for scale normalization. We also generated more data by varying contrast to improve robustness to images that the framework might not have seen and to create more training data for the next stage. The map responses, which are now used as features, are individually normalized and the same normalization parameters are used for the corresponding features during testing. All the images in the CAMVID dataset are of the same size (480 x 360) and are at the same scale, being a standard dataset; hence not much preprocessing or data augmentation needs to be done.
Iii-B Response Maps
The layers of a deep neural network learn different features at different layers or combinations of layers. The first layer of the network learns features that are similar to Gabor Features or Color blobs . The deeper layers help to discriminate objects and parts of objects while losing spatial and local information . Hence, the combination of maps at different layers helps to capture the spatial as well as the discriminative features. In , it is pointed out that removing the fully connected layers as features had resulted in very little increase in the error rate. Since the response from the fully connected layers is a vector (either of size 1024 or 1000) resizing it would drastically increase the size of the data without any significant increase in performance. We can see from Fig. 3 that the deeper maps extract more and more abstract features. For example, the 1183rd map identifies the eyes/ears of the cow whereas the first few maps detect the edges of the image. The deeper maps, however, lose the detailed spatial information about the objects.
Iii-C Core Sample: Intermediate Data Representation
Since, we use a pre-trained model to extract the maps, we normalize the images by subtracting constants from the R, G and B values (the same procedure that was used while training that model). From each image we then acquire the map responses from each layer. Most of the map responses are n x n shapes (n , i = positive integer). Each of these map responses of various sizes are then resized to the size of the original image using bilinear interpolation. These map responses are then stacked along with original input image. From this point onwards, each pixel is a distinct data point with the map response values as its features. We are expecting a data point to have a single label value, when we train on this data in the second stage of our framework. We define a core as a collection of hypercolumns, one per pixel for an input image. Core samples are random samples drawn from a core. We feed the core samples to the second stage of our framework.
Iii-D Stage 2: Pixel Prediction
We use unsupervised pretraining using RBMs followed by supervised learning using DBNs for the final pixel-wise prediction. Our framework is flexible; depending upon the task at hand, different types of output layers can be used. We have implemented two different types of layers: a regression layer that implements linear regression and uses the mean-squared error as the loss function and a logistic regression layer that can classify pixels using negative log likelihood as the loss function. For example, if two pixels have red (R) values 0.5 and 0.6 they are more similar than if they were 0.1 and 0.6. The loss function that we typically use on regression problems is the mean squared error:
where is vector of predicted values and is vector of actual values for observations. On most neural networks that classify rather than perform regression, the distance between any two labels would be the same. In such cases, the likelihood () and loss ()  functions are given by:
where are weights and biases respectively and is the dataset. Given an input, the weights matrix, and a bias vector, it outputs the likelihood that the input belongs to a certain class . Since this equation is based on probability values rather than distance measure, it is more suitable for classification.
Iii-E Core Sampling Algorithm
Algorithm 1 implements the core sampling framework. There are two modes of running the framework: training (Line 2) and testing (Line 6). In both cases, we generate core samples from images in a folder and normalize them before training or testing begins. The function CoreSample() loads the pretrained model and the images and iterates through the images. The aformentioned preprocessing/normalizing of the images is done and then each image is fed into the model. GetFeature() extracts the feature maps and then each map is upscaled to a uniform size of using bilinear interpolation where and are the width and height of the input image. Then we concatenate all the cores into a single array and normalize them. The normalization parameters for each feature map are separately maintained. During training, an array of target vectors is created by using the labeled images. Normalized samples from the cores and the labels are used as training data for the Deep Belief Network and the trained model is saved to hard disk. During testing, we load the model that we trained using our interpretive network (DBN) feed the core data in and finally make our prediction.
The CAMVID dataset ,  consists of 32 semantic classes of objects out of which, like most of the other approaches , , we evaluate our algorithm on the 11 major classes and 1 class that includes the rest. These classes are Building, Tree, Sky, Car, Sign-Symbol, Road, Pedestrian, Fence, Column-Pole, Side-walk and Bicyclist. The training set includes input images, that are regular three channel color images and the targets are segmented single channel images. The images are from a few videos taken from inside a car driven on streets. These consist of labeled images with 367 training, 101 validation and 233 test images of consistent sizes at the same scale.
The BAERI dataset  that we introduce here consists of imagery collected from a Synthetic Aperture Radar (SAR). Both the input and output are treated as single channel images. The input single channel image consists of SAR values and the labels are the ground truth values at each of the pixels. Labels were obtained by morphological image processing techniques for noise removal, i.e., opening and closing. The erosion and dilation on the images fail to remove all the noise in the ground truth images. Therefore, there are certain areas in the ground truth images that still contain some noise with incorrect labels. As, we are classifying each pixel separately, the noise must be taken into account during training. The values in the training images were in the range between -40 and 25. In the BAERI dataset, the pixel classes are those belonging to the ship class and the rest. There are only 55 images of variable sizes available in this dataset with the total size of 68 megabytes.
Iv Experimental Results and Discussion
Since we only deal with classification in this paper, we use negative log likelihood as our cost function. We trained the DBN in the second stage by first performing a pre-training step with persistent chain contrastive divergence (P-CD) and then fine tuning the network using a deep feedforward neural network trained by backpropagation. Keeping the pretrained CNN and the DBN interpreting the core samples distinct makes it easier to preprocess the data and also to normalize the testing set with respect to the training data. We used L1 and L2 norms for regularization and implemented dropout on the hidden layers. The input to the DBN are the core samples that we described above. The map responses from each layer of the CNN are normalized using standard feature scaling. The unsupervised training helps us to cluster the features together further, and helps to converge the training faster. We used the theano  deep learning library and an Intel i7 six core server with TITAN X GPU for our experiments.
Between the CAMVID dataset and the BAERI one, there are more objects in the CAMVID dataset that are also present in the ImageNet dataset. On the other hand, the BAERI dataset is quite different from ImageNet because it contains SAR data not present in ImageNet. As a result of transfer learning, the knowledge acquired from ImageNet based the wide variety of features abstracted at various levels by the pretrained VGG-16 network prevented the sparsity of the BAERI dataset from creating any problem in training the DBN in the second stage. On the BAERI dataset, our frame work was able to slightly outperform SegNet (see Table I). The output images of SegNet had less blobs that could be classified as noise but it also missed a few of the smaller ships and had a larger area of pixels incorrectly classified as ships around the main clusters compared to our algorithm as can be seen on Fig. 5.
|Mean Squared Error (MSE)||.0115||.0142|
On the CAMVID dataset, the core sampling framework outperformed both ,  on 10 of the 11 classes in terms of accuracy and had a better per class accuracy (see Table II). Both ,  and our framework were trained on 367 labeled training images. As can be seen from Table II, our framework could not match the performance of SegNet on the CAMVID dataset in terms of accuracy except for the Sky, Column-Pole, and Bicyclist classes where it outperformed SegNet. However, while our framework was trained on 367 labeled images, SegNet was trained on 3500 labeled images.
|367 Training Images||3.5 K Images|
We presented a core sampling framework that is able to use the activation maps from several layers of a CNN as features to another neural network using transfer learning to provide an understanding of an input image. We experimentally demonstrate the usefulness of our framework using a pretrained VGG-16  model to perform segmentation on the BAERI dataset  of Synthetic Aperture Radar(SAR) imagery and the CAMVID dataset. In the future, we intend to use the core sampling framework to facilitate compression of images, texture synthesis, etc.
-  K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
-  S. Ganguly, “Baeri dataset.” [Online]. Available: https://drive.google.com/open?id=0B0gFcrqVCm9peTdMdndTV0pQMFE
-  G. J. Brostow, J. Shotton, J. Fauqueur, and R. Cipolla, “Segmentation and recognition using structure from motion point clouds,” in ECCV (1), 2008, pp. 44–57.
-  B. Hariharan, P. Arbeláez, R. Girshick, and J. Malik, “Hypercolumns for object segmentation and fine-grained localization,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 447–456.
-  R. M. Haralick and L. G. Shapiro, “Image segmentation techniques,” Computer vision, graphics, and image processing, vol. 29, no. 1, pp. 100–132, 1985.
-  J. T. Barron, M. D. Biggin, P. Arbelaez, D. W. Knowles, S. V. Keranen, and J. Malik, “Volumetric semantic segmentation using pyramid context features,” in Proceedings of the IEEE International Conference on Computer Vision, 2013, pp. 3448–3455.
-  J. Shi and J. Malik, “Normalized cuts and image segmentation,” IEEE Transactions on pattern analysis and machine intelligence, vol. 22, no. 8, pp. 888–905, 2000.
-  V. Badrinarayanan, A. Handa, and R. Cipolla, “Segnet: A deep convolutional encoder-decoder architecture for robust semantic pixel-wise labelling,” arXiv preprint arXiv:1505.07293, 2015.
-  J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 3431–3440.
-  R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Region-based convolutional networks for accurate object detection and segmentation,” IEEE transactions on pattern analysis and machine intelligence, vol. 38, no. 1, pp. 142–158, 2016.
-  Y. Bengio, P. Lamblin, D. Popovici, H. Larochelle et al., “Greedy layer-wise training of deep networks,” Advances in neural information processing systems, vol. 19, p. 153, 2007.
-  G. E. Hinton, S. Osindero, and Y.-W. Teh, “A fast learning algorithm for deep belief nets,” Neural computation, vol. 18, no. 7, pp. 1527–1554, 2006.
-  Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, no. 7553, pp. 436–444, 2015.
-  N. Lotter, D. Kowal, M. Tuzun, P. Whittaker, and L. Kormos, “Sampling and flotation testing of sudbury basin drill core for process mineralogy modelling,” Minerals Engineering, vol. 16, no. 9, pp. 857–864, 2003.
-  V. Kotlyakov, “A 150,000-year climatic record from antarctic ice,” Nature, vol. 316, p. 591596, 1985.
-  E. H. Adelson, C. H. Anderson, J. R. Bergen, P. J. Burt, and J. M. Ogden, “Pyramid methods in image processing,” RCA engineer, vol. 29, no. 6, pp. 33–41, 1984.
-  P. Burt and E. Adelson, “The laplacian pyramid as a compact image code,” IEEE Transactions on communications, vol. 31, no. 4, pp. 532–540, 1983.
-  L. Torrey and J. Shavlik, “Transfer learning,” Handbook of Research on Machine Learning Applications and Trends: Algorithms, Methods, and Techniques, vol. 1, p. 242, 2009.
-  S. J. Pan and Q. Yang, “A survey on transfer learning,” IEEE Transactions on knowledge and data engineering, vol. 22, no. 10, pp. 1345–1359, 2010.
-  J. Yosinski, J. Clune, Y. Bengio, and H. Lipson, “How transferable are features in deep neural networks?” in Advances in neural information processing systems, 2014, pp. 3320–3328.
-  A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in neural information processing systems, 2012, pp. 1097–1105.
-  Y. LeCun, L. Jackel, L. Bottou, C. Cortes, J. S. Denker, H. Drucker, I. Guyon, U. Muller, E. Sackinger, P. Simard et al., “Learning algorithms for classification: A comparison on handwritten digit recognition,” Neural networks: the statistical mechanics perspective, vol. 261, p. 276, 1995.
-  Y. Bengio, “Learning deep architectures for ai,” Foundations and trends® in Machine Learning, vol. 2, no. 1, pp. 1–127, 2009.
-  O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein et al., “Imagenet large scale visual recognition challenge,” International Journal of Computer Vision, vol. 115, no. 3, pp. 211–252, 2015.
-  K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” arXiv preprint arXiv:1512.03385, 2015.
-  S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,” in Advances in neural information processing systems, 2015, pp. 91–99.
-  D. Comaniciu and P. Meer, “Robust analysis of feature spaces: color image segmentation,” in Computer Vision and Pattern Recognition, 1997. Proceedings., 1997 IEEE Computer Society Conference on. IEEE, 1997, pp. 750–755.
-  P. F. Felzenszwalb and D. P. Huttenlocher, “Efficient graph-based image segmentation,” International Journal of Computer Vision, vol. 59, no. 2, pp. 167–181, 2004.
-  C. Rother, V. Kolmogorov, and A. Blake, “Grabcut: Interactive foreground extraction using iterated graph cuts,” in ACM transactions on graphics (TOG), vol. 23, no. 3. ACM, 2004, pp. 309–314.
-  Y. Y. Boykov and M.-P. Jolly, “Interactive graph cuts for optimal boundary & region segmentation of objects in nd images,” in Computer Vision, 2001. ICCV 2001. Proceedings. Eighth IEEE International Conference on, vol. 1. IEEE, 2001, pp. 105–112.
-  S. Chen and H. Wang, “Sar target recognition based on deep learning,” in Data Science and Advanced Analytics (DSAA), 2014 International Conference on. IEEE, 2014, pp. 541–547.
-  L. Ladickỳ, P. Sturgess, K. Alahari, C. Russell, and P. H. Torr, “What, where and how many? combining object detectors and crfs,” in European conference on computer vision. Springer, 2010, pp. 424–437.
-  C. Zhang, L. Wang, and R. Yang, “Semantic segmentation of urban scenes using dense depth maps,” in European Conference on Computer Vision. Springer, 2010, pp. 708–721.
-  H. Noh, S. Hong, and B. Han, “Learning deconvolution network for semantic segmentation,” in Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 1520–1528.
-  B. Du, L. Zhang, D. Tao, and D. Zhang, “Unsupervised transfer learning for target detection from hyperspectral images,” Neurocomputing, vol. 120, pp. 72–82, 2013.
-  M. D. Zeiler and R. Fergus, “Visualizing and understanding convolutional networks,” in European Conference on Computer Vision. Springer, 2014, pp. 818–833.
-  “Classifying mnist digits using logistic regression.” [Online]. Available: http://deeplearning.net/tutorial/logreg.html
-  G. J. Brostow, J. Fauqueur, and R. Cipolla, “Semantic object classes in video: A high-definition ground truth database,” Pattern Recognition Letters, vol. xx, no. x, pp. xx–xx, 2008.
-  Theano Development Team, “Theano: A Python framework for fast computation of mathematical expressions,” arXiv e-prints, vol. abs/1605.02688, May 2016. [Online]. Available: http://arxiv.org/abs/1605.02688