# An efficient deep learning hashing neural network for mobile visual search

## Abstract

Mobile visual search applications are emerging that enable users to sense their surroundings with smart phones. However, because of the particular challenges of mobile visual search, achieving a high recognition bitrate has becomes a consistent target of previous related works. In this paper, we propose a few-parameter, low-latency, and high-accuracy deep hashing approach for constructing binary hash codes for mobile visual search. First, we exploit the architecture of the MobileNet model, which significantly decreases the latency of deep feature extraction by reducing the number of model parameters while maintaining accuracy. Second, we add a hash-like layer into MobileNet to train the model on labeled mobile visual data. Evaluations show that the proposed system can exceed state-of-the-art accuracy performance in terms of the MAP. More importantly, the memory consumption is much less than that of other deep learning models. The proposed method requires only MB of memory for the neural network and achieves a MAP of on the mobile location recognition dataset used for testing.

## 1Introduction

With the proliferation of mobile devices, it is becoming possible to use mobile perception functionalities (e.g., cameras, GPS, and Wi-Fi) to perceive the surrounding environment [1]. Among such techniques, mobile visual search plays a key role in mobile localization, mobile media search, and mobile social networking. However, rather than simply porting traditional visual search methods to mobile platforms, for mobile visual search, one must face the challenges of a large aural-visual variance of queries, stringent memory and computation constraints, network bandwidth limitations, and the desire for an instantaneous search experience.

Most research on mobile visual search has predominantly focused on achieving high recognition bitrates [2]. Recently, an increasing number of researchers are attempting to exploit feature signatures produced through hashing in mobile visual search because of the good balance that can be achieved among computation and memory requirements, training efficiency, quantization complexity, and search performance. However, most of the existing hashing-based mobile visual search methods attempt to compress existing classical handcrafted features into binary code. These methods all focus on how to decrease the loss suffered during compression. Only a few of them attempt to automatically learn effective binary code features from a large-scale image dataset using a deep neural network. There are two main reasons for this: 1) the lack of effective deep learning hashing methods for mobile visual search and 2) the high computational complexity of existing deep neural networks.

For traditional visual search, convolutional neural networks have become ubiquitous [5][?]. Studies have shown that the deep features learned using such networks capture rich image representations and enable better performance than handcrafted features in visual classification [7][8][9] [10], object detection [11][12], semantic segmentation [13], and image retrieval [14][15]. The general trend in research on deep learning methods is to construct deeper and more complicated networks to achieve higher accuracy [16][17][18]. In the field of mobile visual search, however, a deeper neural network model consumes more memory and time, which cannot be easily supplied by mobile devices. To adapt deeper neural networks for mobile devices, a new network architecture called MobileNet has recently been proposed by Google [19]. In the MobileNet model, a standard convolutional layer is decomposed into a depthwise convolutional layer and a pointwise convolutional layer, thereby greatly reducing the amount of calculation necessary and the model size. This model works very well on image classification problems.

Inspired by the works presented in [1] and [19], to develop a more effective and efficient mobile visual search system, this paper proposes to combine a few-parameter, low-latency, and high-accuracy network architecture with a hash function. We incorporate the hash function as a latent layer between the image representations and the classification outputs in MobileNet, which allows us not only to maintain the accuracy of the visual search but also to adapt the model to the mobile environment. An overview of the system is illustrated in Figure 1. The final binary hash codes can be learned by minimizing an objective function defined over the classification error. Experimental results on a mobile location recognition dataset show that our method achieves superior performance compared with other hashing approaches.

## 2Architecture

In this section, we first describe the MobileNet network structure and the convolution layer structure, which is based on depthwise separable filters and pointwise separable filters. Then, we describe the hash layer in our convolutional neural network and train a convolutional neural network model that exploits semantic labels to automatically create binary codes.

Type / Stride | Filter Shape | Input Size |
---|---|---|

Conv / s2 | 3 3 3 32 | 224 224 3 |

Conv dw / s1 | 3 3 32 dw | 112 112 32 |

Conv / s1 | 1 1 32 64 | 112 112 32 |

Conv dw / s2 | 3 3 64 dw | 112 112 64 |

Conv / s1 | 1 1 64 128 | 56 56 64 |

Conv dw / s1 | 3 3 128 dw | 56 56 128 |

Conv / s1 | 1 1 128 128 | 56 56 128 |

Conv dw / s2 | 3 3 128 dw | 56 56 128 |

Conv / s1 | 1 1 128 256 | 28 28 128 |

Conv dw / s1 | 3 3 256 dw | 28 28 256 |

Conv / s1 | 1 1 256 256 | 28 28 256 |

Conv dw/ s2 | 3 3 256 dw | 28 28 256 |

Conv / s1 | 1 1 256 512 | 14 14 256 |

Conv dw / s1 | 3 3 512 dw | 14 14 512 |

Conv / s1 | 1 1 512 512 | 14 14 512 |

Conv dw / s2 | 3 3 512 dw | 14 14 512 |

Conv / s1 | 1 1 512 1024 | 7 7 512 |

Conv dw/ s2 | 3 3 1024 dw | 7 7 1024 |

Conv / s1 | 1 1 1024 1024 | 7 7 1024 |

Avg Pool / s1 | Pool 7 7 | 7 7 1024 |

FC / s1 | 1024 64 | 1 1 64 |

Sigmoid | In place | 1 1 64 |

FC / s1 | 1024 162 | 1 1 162 |

Softmax | Classifier | 1 1 162 |

### 2.1MobileNet

The MobileNet model is built on depthwise separable convolutions. These convolutions factorize a standard convolution into a depthwise convolution and a convolution called a pointwise convolution. The MobileNet architecture is defined in Table 1. In MobileNet, the depthwise convolution involves the application of a single filter to each input channel. The pointwise convolution then involves the application of a convolution to combine the outputs of the depthwise convolution. A standard convolutional layer takes as input a feature map F and produces a feature map G, where is the spatial width and height of the square input feature map, is the number of input channels (input depth), is the spatial width and height of the square output feature map, and is the number of output channel.

A standard convolution has the following computational cost:

The corresponding cost of depthwise separable convolutions is as follows:

which is the sum of the costs of the depthwise and pointwise convolutions. By expressing a standard convolution as a two-step process of filtering and combining, we achieve the following reduction in computational cost:

MobileNet uses depthwise separable convolutions, which require 89 times less computation than standard convolutions with only a small reduction in accuracy.

Table 2 compares MobileNet with other popular models in terms of accuracy as assessed on the ImageNet database, number of multi-adds (million), and number of parameters (million). MobileNet is nearly as accurate as VGG-16 [16] while being 32 times smaller and 27 times less compute intensive. It is more accurate than GoogleNet [20] while being smaller and requiring more than 2.5 times less computation. MobileNet [19] is also more accurate than Squeezenet [21] and AlexNet [14] with a nearly identical model size and less computation.

AlexNet | 57.2% | 720 | 60 |

Squeezenet | 57.5% | 1700 | 1.25 |

GoogleNet | 69.8% | 1550 | 6.8 |

VGG-16 | 71.5% | 15300 | 138 |

MobileNet | 70.6% | 569 | 4.2 |

### 2.2Hash Function

Hash mapping, which is required for learning from images, is based on the following principles. The hash codes should respect the semantic similarity between image labels. Images that share the same class labels should be mapped to similar binary codes. Let denote images, and let be their associated label vectors, where is the total number of class labels. An entry in has a value of 1 if the image belongs to the corresponding class and is 0 otherwise. Our goal is to learn a mapping that maps the images to their -bit binary codes while preserving the semantic similarity relationships among the image data [22].

Our network is built on MobileNet. Each layer is followed by a batchnorm layer and a nonlinear ReLU layer, with the exception of the fully connected layer for classification. A final average pooling layer reduces the spatial resolution to 1 before the fully connected layer. To incorporate the learned features into the binary codes, we add a latent layer with units to the top of layer , as illustrated in Figure 1. This latent layer is fully connected to and uses sigmoid units so that the activations are bounded between 0 and 1 [23]. Let denote the weights (i.e., the projection matrix) in the latent layer. For a given image with the feature vector in layer , the activations of the units in can be computed as . Here, is the bias term, and is the logistic sigmoid function, which is defined as , where is a real value. The binary encoding function is given by

where if and otherwise. performs element-wise operations on a matrix or a vector.

## 3Experiments

Method | |||

16-bit | 32-bit | 64-bit | |

VHB [4] | - | - | 19.36 |

SSFS [24] | - | - | 20.22 |

DLBH [25] | 59.80 | 78.68 | 87.15 |

SSDH [22] | 78.26 | 91.82 | 92.43 |

Our method |
96.66 |
97.61 |
97.80 |

To evaluate the proposed method, we conducted experiments on the mobile location recognition dataset presented in [24]. The dataset contains 8,062 images, which were captured from 162 locations. We used the same experimental parameters as in [24] and [1]. We implemented our approach using the open-source CAFFE [26] package. We initialized the network parameters by adopting the parameters of a MobileNet trained on the 14 million images of the ImageNet [14] dataset. The parameters in the hash layer were randomly initialized. The learning rate was initialized as 0.01 and was decreased to 1/10 of its previous value after every 10,000 iterations. The entire training procedure terminated after 30,000 iterations. For the learning of the network parameters, in conjunction with backpropagation, we exploited the mini-batch stochastic gradient descent algorithm with a mini-batch size of 32 images to minimize the classification error. Our model is a lightweight modification of MobileNet and thus is easy to implement.

In the evaluation, we used the Mean Average Precision (MAP) as the evaluation criterion. We ranked all of the images according to their Hamming distances from the query image, selected the top k images from the ranked list as the retrieval results, and computed the MAP on these retrieved images. We set k to 100 in these experiments. We used the class labels as the ground truth and adopted the common settings for computing the MAP by examining whether the retrieved images and the query shared common class labels. We compared our method with Visual Hash Bits (VHB) [4] and Space-Saliency Fingerprint Selection-based hash codes (SSFS) [24], which are both traditional hashing methods for mobile location recognition. The other two methods considered for comparison, Deep Learning of Binary Hash Codes (DLBH) [25] and Supervised Semantics-Preserving Deep Hashing (SSDH) [22], are both deep-learning-based hashing methods.

Table 3 compares the performances on the dataset of the different hashing methods for different hash code lengths. From the results, we can see that the accuracies of the deep learning hashing methods greatly exceed those of the traditional hashing methods, which demonstrates the power of deep learning technology for binary code learning. Furthermore, because SSDH learns the feature representations and binary codes simultaneously and imposes more constraints for binary code learning, it achieves higher performance than DLBH. Our method achieves the best results on this dataset for all of the different hash code lengths. Especially when the hash code length is relatively short, our method can achieve extremely high accuracy. This can be attributed to the fact that MobileNet networks enable the joint learning of representations and hash functions from images. Moreover, the learned representations are more effective and stable than those of AlexNet-based models such as DLBH and SSDH. Finally, the sizes of the DLBH and SSDH models are greater than 230 MB, whereas for MobileNet, the model size is only 13 MB. Thus, the memory demand for a mobile device is reduced by a factor of approximately 18.

## 4Conclusion

We present an effective deep learning framework based on MobileNet for creating hash-like binary codes for mobile visual search. In this framework, we incorporate the hash function as a latent layer between the feature layer and the output layer in the MobileNet network. By optimizing an objective function defined over the classification error, our method jointly learns the binary codes, features, and classification results. To evaluate the performance of the proposed network, we applied it to a mobile location recognition dataset. The experimental results demonstrate that it can achieve a MAP improvement of compared with state-of-the-art methods. The model requires only 13 MB of memory.

### References

**“Deep learning hashing for mobile visual search,”**

W. Liu, H. Ma, H. Qi, D. Zhao, and Z. Chen,*EURASIP J. Image and Video Processing*, vol. 2017, pp. 17, 2017.**“Instant mobile video search with layered audio-video indexing and progressive transmission,”**

W. Liu, T. Mei, and Y.D. Zhang,*IEEE Trans. on Multimedia*, vol. 16, no. 8, pp. 2242–2255, 2014.**“Compressed histogram of gradients: A low-bitrate descriptor,”**

V. Chandrasekhar, G. Takacs, and D. Chen,*IJCV*, vol. 96, no. 3, pp. 384–399, 2012.**“Mobile product search with bag of hash bits and boundary reranking,”**

J. He, J. Feng, and X. Liu, in*CVPR*, 2012, pp. 3005–3012.**“Imagenet large scale visual recognition challenge,”**

O. Russakovsky, J. Deng, H. Su, and J. Krause,*IJCV*, vol. 115, no. 3, pp. 211–252, 2015.**“Multi-task deep visual-semantic embedding for video thumbnail selection,”**

W. Liu, T. Mei, Y.D. Zhang, C. Che, and J.B. Luo, in*CVPR*, 2015, pp. 3707–3715.**“Supervised hashing for image retrieval via image representation learning,”**

R. Xia, Y. Pan, and H. Lai, in*AAAI*, 2014, pp. 2156–2162.**“You lead, we exceed: Labor-free video concept learning by jointly exploiting web videos and images,”**

C. Gan, T. Yao, K.Y. Yang, Y. Yang, and T. Mei, in*IEEE Computer Vision and Pattern Recognition*, 2016, pp. 923–932.**“Recognizing an action using its name: A knowledge-based approach,”**

C. Gan, Y. Yang, L.C. Zhu, D.L. Zhao, and Y.T. Zhuang,*International Journal of Computer Vision*, vol. 120, no. 1, pp. 61–77, 2016.**“Concepts not alone: Exploring pairwise relationships for zero-shot video activity recognition,”**

Chuang Gan, Ming Lin, Yi Yang, Gerard de Melo, and Alexander G Hauptmann, in*AAAI*, 2016.**“Multi-column deep neural networks for image classification,”**

D. Ciresan, U. Meier, and J. Schmidhuber, in*CVPR*, 2012, pp. 3642–3649.**“Rich feature hierarchies for accurate object detection and semantic segmentation,”**

R. Girshick, J. Donahue, T. Darrell, and J. Malik, in*CVPR*, 2014, pp. 580–587.**“Overfeat: Integrated recognition, localization and detection using convolutional networks,”**

P. Sermanet, D. Eigen, and X. Zhang,*CoRR*, vol. abs/1312.6229, 2013.**“Imagenet classification with deep convolutional neural networks,”**

A. Krizhevsky, I. Sutskever, and G. Hinton, in*NIPS*, 2012, pp. 1106–1114.**“Robust spatial consistency graph model for partial duplicate image retrieval,”**

L.Y. Chu, S.Q. Jiang, S.H. Wang, Y.Y. Zhang, and Q.M. Huang,*IEEE Trans. on Multimedia*, vol. 15, no. 8, pp. 1982–1996, 2013.**“Very deep convolutional networks for large-scale image recognition,”**

K. Simonyan and A. Zisserman,*CoRR*, vol. abs/1409.1556, 2014.**“Rethinking the inception architecture for computer vision,”**

C. Szegedy, V. Vanhoucke, and S. Ioffe,*CoRR*, vol. abs/1512.00567, 2015.**“Devnet: A deep event network for multimedia event detection and evidence recounting,”**

C. Gan, N.Y. Wang, Y. Yang, D.Y. Yeung, and A.G. Hauptmann, in*CVPR*, 2015, pp. 2568–2577.**“Mobilenets: Efficient convolutional neural networks for mobile vision applications,”**

A. Howard, M. Zhu, B. Chen, and D. Kalenichenko,*CoRR*, vol. abs/1704.04861, 2017.**“Going deeper with convolutions,”**

C. Szegedy, W. Liu, and Y. Jia, in*CVPR*, 2015, pp. 1–9.**“Squeezenet: Alexnet-level accuracy with 50x fewer parameters and <1mb model size,”**

F. N. Iandola, M. Moskewicz, and K. Ashraf,*CoRR*, vol. abs/1602.07360, 2016.**“Supervised learning of semantics-preserving hashing via deep neural networks for large-scale image search,”**

H. Yang, K. Lin, and C. Chen,*CoRR*, vol. abs/1507.00101, 2015.**“Deep sparse rectifier neural networks,”**

X. Glorot, A. Bordes, and Y. Bengio, in*AIS*, 2011, pp. 315–323.**“SSFS: A space-saliency fingerprint selection framework for crowdsourcing based mobile location recognition,”**

H. Wang, D. Zhao, H.D. Ma, and H. Xu, in*AMIP*, 2016, pp. 650–659.**“Deep learning of binary hash codes for fast image retrieval,”**

K. Lin, H. Yang, J. Hsiao, and C.Chen, in*CVPR Workshops*, 2015, pp. 27–35.**“Caffe: Convolutional architecture for fast feature embedding,”**

Y. Jia, E. Shelhamer, and J. Donahue, in*ACM MM*, 2014, pp. 675–678.