Efficient Privacy Preserving Viola-Jones Type Object Detection
via Random Base Image Representation
A cloud server spent a lot of time, energy and money to train a Viola-Jones type object detector  with high accuracy. Clients can upload their photos to the cloud server to find objects. However, the client does not want the leakage of the content of his/her photos. In the meanwhile, the cloud server is also reluctant to leak any parameters of the trained object detectors. 10 years ago, Avidan & Butman introduced Blind Vision, which is a method for securely evaluating a Viola-Jones type object detector. Blind Vision uses standard cryptographic tools and is painfully slow to compute, taking a couple of hours to scan a single image. The purpose of this work is to explore an efficient method that can speed up the process. We propose the Random Base Image (RBI) Representation. The original image is divided into random base images. Only the base images are submitted randomly to the cloud server. Thus, the content of the image can not be leaked. In the meanwhile, a random vector and the secure Millionaire protocol are leveraged to protect the parameters of the trained object detector. The RBI makes the integral-image enable again for the great acceleration. The experimental results reveal that our method can retain the detection accuracy of that of the plain vision algorithm and is significantly faster than the traditional blind vision, with only a very low probability of the information leakage theoretically.
Xin Jin, Peng Yuan, Xiaodong Li, Chenggen Song, Shiming Ge, Geng Zhao, Yingya Chen ††thanks: This work is partially supported by the National Natural Science Foundation of China (Grant NO.61402021, 61402023, 61640216), the Science and Technology Project of the State Archives Administrator (Grant NO. 2015-B-10), the open funding project of State Key Laboratory of Virtual Reality Technology and Systems, Beihang University (Grant NO. BUAA-VR-16KF-09), and the Fundamental Research Funds for the Central Universities (NO. 2016LG03, 2016LG04).
Beijing Electronic Science and Technology Institute,
Beijing 100070, China
Xidian University, Xi’an 710071, China
Institute of Information Engineering, Chinese Academy of Sciences, Beijing 100095, China
*Corresponding Author: firstname.lastname@example.org
Blind Vision, Random Base Image, Privacy Preserving, Object Detection
Recently, widespread smart phones with cameras enable people to shot images and videos nearly anytime and anywhere. Millions of surveillance cameras including the driving recorders captures images and videos every second. All these conveniences devices are producing the large-scale visual media data, which is considered as the biggest big data.
Due to the limited storage space of these terminal devices, large-scale visual media data is being uploaded and stored in the cloud servers. Not only the storage, but also the processing of large-scale visual media data are being outsourced to the cloud servers.
The cloud servers have some strong algorithms such as face/object detection, face/object recognition, intelligent video surveillance. Nowadays, people can easily find all the faces in their photos stored in the cloud servers using the powerful face detection algorithms maintained by the cloud servers. However, the cloud servers are always third party entities. Thus the privacy of the users’ visual media data may be leaked to the public or unauthorized parties.
In the meanwhile, the powerful cloud services for visual media analysis and processing need a lot of money, data and time from the cloud server producers. The cloud servers are also reluctant to leak any parameters of the trained models or some protected details of their algorithms with copyrights.
Thus, the privacy of both the content of the visual media from the clients and the parameters of the vision algorithms from the cloud servers should be protected. 10 years ago, Avidan & Butman introduced Blind Vision , which is a method for securely evaluating a Viola-Jones type face detector. Blind Vision uses standard cryptographic tools and is painfully slow to compute, taking a couple of hours to scan a single image.
After that, rich literatures have been proposed in this field. The cryptographic tools such as secret sharing (SS) , security multi-party computation (SMC) , homomorphic encryption (HE) , garbed circuit (GC) , Chaotic System (CS)  are heavily used. Plenty of computer vision applications have been modified to the privacy preserving or secure versions such as private face detection , face recognition , content based image retrieval , visual media search on public datasets , intelligent video surveillance [3, 5, 6, 7].
However, most of these work rely heavily on cryptographic tools, which are painfully slow to compute or need bit by bit interaction between the clients and the cloud servers. In this paper, we revisit the Blind Vision  and attempt to make the blind vision towards cryptographic-free, without losing the security properties. We use randomness and only a little cryptographic operations to protect the visual media data of the clients and the parameters of the trained models in the cloud servers.
A novel image representation called Random Base Image(RBI) representation is proposed. In this work, we also investigate the object detection in the cloud. We apply our RBI to the famous Viola and Jones object detection method and propose a novel blind object detection method. We separate an image into random base images. The weight of each base image is only known by the client. The base images are sent randomly to the cloud server. The cloud server cannot recover anything from the random base images. A random vector and the secure Millionaire protocol  are leveraged to protect the parameters of the trained object detector. The RBI makes the integral-image enable again for the great acceleration. The experimental results reveal that our method is significantly faster than the traditional blind vision, with only a very low probability of the information leakage theoretically.
2 Secure Object Detection
In this section we develop a secure object detector with the random base image representation.
Our scenario and the notations are the same as that of traditional Blind Vision , as show in Figure 1. Denote some dimensions finite field that is large enough to represent all the intermediate results. Denote by the image that Alice owns. A particular detection window within the image will be denoted by and will be treated in vector form. Bob owns a strong classifier of the form
where is a threshold function of the form
and is the hyperplane of the threshold function . The parameters and of are determined during training; is the number of weak classifiers used.
2.2 The Random Base Image Representation
The core idea of our RBI is to separate the original image into some random base images with fixed weights. The original image can be recovered by all the base images. The sparse representation can be considered as the one has such ability. However, they need another image dataset for learning the base images. Further more, there could be reconstruction error. Thus, we fix the weights and randomize the base images themselves.
The detection window can be represented as:
where is the base image with weight . As is shown in Figure 2, each base image has a fixed weight. The base image itself is randomly determined. The number of the base image is set to . Thus, each base image can be a binary image, which is easy for network transfer and fast to compute. In addition, there are permutation of the base image which is not easy to guess. The process of the RBI generation is described in Algorithm 1.
2.3 Secure Object Detection with RBI
2.3.1 Secure Object Classifier Protocol
The core of our method is the secure object classifier protocol as is described in Algorithm 2 and Figure 3. For secure object detection, Alice first divides the test image into detection windows . Then the detection windows are randomly sent to Bob as the inputs of the secure face classifier protocol one by one. Using the Algorithm 2, Alice and Bob know which detection windows are the target objects. Because the detection windows are randomly sent to Bob, only Alice learns the location of all the detected faces in the original image. Bob does not learn the contents including where the faces are in the image of Alice. Alice learns nothing about the parameters of the face detector of Bob.
The body of Algorithm 2 is described as follows:
(1): Alice factorizes the detection window into random base images with weight through Algorithm 1.
(2): Alice randomly shuffles the weight to . The random base images are permuted with the same order of that of to , which is sent to Bob.
(3): In one cascade, Bob has weak classifiers with parameter vectors . Bob randomly add fake weak classifiers and set their parameters and to zero. Bob randomly shuffles the true and fake weak classifiers to form . Then, Bob generates random positive numbers . For each parameter vector . Bob and Alice repeat the following 3 steps.
(3.1): Bob computes the feature responses for all the base image in by . All the responses of base images on each parameter vector are sent back to Alice.
(3.2): Alice computes the feature responses of the detection window by .
(3.3): Alice and Bob use the secure Millionaire protocol  to determine which number is larger: or . Bob send or to Alice. Alice store it as .
(4): Alice and Bob use the secure Millionaire protocol  to determine which number is larger: or . If Alice has a larger number then x is positively classified, otherwise x is negatively classified.
The protocol protects the security of both parties. The protocol protects the contents of the image from Alice and the parameters of the face detector from Bob. We analyse the security of Algorithm 2 in the following paragraph.
From Alice to Bob
In step 2, Alice send randomly shuffled base images to Bob. Bob only knows the randomly generated base images and do not know the weight of each base image. The probability of guessing out the right permutation is . Even Bob guesses out the right permutation, he does not know the weight of each base image. Thus, it is almost impossible for Bob to recover the detection window of Alice.
In the 3th sub-step of step 3 and the step 4. Alice and Bob engage in secure Millionaire protocol . so Bob can learn nothing about Alice’s data.
From Bob to Alice
In the 1st sub-step of step 3, Alice can not learn the number of the weak classifiers or the true filters from the received feature responses. The true filters are obfuscated by the fake filters.
In the 3rd sub-step of step 3, Alice and Bob engage in a secure Millionaire protocol so Alice only learns if . She can not learn anything about the parameter . Moreover, at the end of the Millionaire protocol Alice learns either or . In both cases, the real parameter ( or ) is obfuscated by the random number .
In step 4, Alice and Bob use the secure Millionaire protocol to determine which number is larger: or . If Alice has a larger number then is positively classified, otherwise x is negatively classified.
Multiple Cloud Servers
The random base images can be also sent to multiple cloud server with the same object detector to increase security.
2.3.3 Complexity and Efficiency
The complexity of the protocol is , where is the number of the base images. and are the numbers of the true and fake weak classifiers, respectively. is the dimensionality of the detection window .
Unlike the traditional Blind Vision , in which the OT operation is used extensively, the proposed method only use OT operation to compare 2 numbers. In the secure dot-product protocol, each pixel of each detection window uses a operation, which needs RSA encryption and RSA decryption with 128-bit long encryption keys. We leverage our random images, whose computation is much faster than the RSA encryption and decryption operations.
In addition, in the traditional Blind Vision , they convert the integral-image representation to regular dot-product operation, a step that clearly slows down their implementation as they no longer take advantage of the integral-image representation. In our RBI based protocol, the integral-image representation is enabled again, which accelerates the computation obviously.
We convert the Viola-Jones type object detector [1, 11] to our secure object detector. We implement our RBI based object detector using Microsoft Visual Studio 2012 and OpenCV 2.4.3/10. 111http://opencv.org/ package for computer vision in a 64 bits Windows 7 operating system. The hardware configuration is 3.5GHz AMD A10 Pro-7800 R7 CPU with 12 compute Cores and 8GB Memory.
The face detector is from the OpenCV 2.4.3 package and consists of a cascade of 22 rejectors, where each rejector is of the form presented in Eq. 1. The first rejector consists of 3 weak classifiers. The most complicated rejector consists of 213 weak classifiers. There is a total of 2135 weak classifiers. We also test the nose detector, the eye detector and the full body detector from OpenCV 2.4.10.
3.1 The Detection Accuracy
We randomly select 100 face images from each of the 3 datasets. The detection accuracy (88.46%) of our secure face detector is the same as that of the OpenCV 2.4.3 face detector (88.46%).
The nose and the eye detectors are tested on the FDDB dataset . The full body detector is tested on the INRIA Person dataset . We randomly select 100 images from each of the 2 datasets. The detection accuracy of our secure object detectors is the same as that of the OpenCV 2.4.10 nose, eye and full body detectors.
3.2 Comparison with Other Methods
We compare our method with the Viola-Jones method implemented by the OpenCV package and the method of the traditional Blind Vision . 50 test images with size of are randomly selected from each of the 3 datasets. The average running time is shown in Table 1. All the methods are running in client and server mode. For the Viola-Jones, Alice send the original image to Bob. Then, Bob runs the Viola-Jones method and return the detected windows to Alice. Our method is slower than the Viola-Jones method, which is running on plain images without protecting any privacy. According to the traditional Blind Vision method , the time-consuming OT operation is heavily used and the integral-image representation is disabled. Thus, they have to take a couple of hours to scan a single image, which is painfully slow. Although in our method, the only information that Bob learns is that how many faces are in the image of Alice, our cryptographic-free method is significantly faster than the previous work towards practical usage of blind vision applications.
In addition, we compare our method with the Viola-Jones method implemented by the OpenCV package and the method of the traditional Blind Vision . 50 test images with size of are randomly selected from each of the FDDB and the INRIA Person datasets. The average running times are shown in the last 3 rows of Table 1. All the methods are running in client and server mode.
|Dataset||Our||Our + Comm. Delays||VJ ||VJ  + Comm. Delays||Blind Vision |
|FDDB-face||143.852s||380.992s||0.380s||0.843s||A couple of hours |
|FDDB-nose||113.398s||294.845s||0.372s||0.836s||A couple of hours |
4 Conclusions and Discussions
We propose a novel random base image representation (RBI) for efficient object detection applications. The traditional blind vision method applies secure multi-party techniques to vision algorithm. Their method reveals no information to either party at the expanse of heavy computation load. Our method is an attempt towards cryptographic-free. Alice learns nothing about the parameters of the face detector of Bob. Bob does not know the contents of the image of Alice. The only information may be leaked is that Bob have a probability to guess out the right permutation of the base images. This is just a theoretical event. Even Bob guesses out the right permutation, he does not know the weight of each base image. Thus it is almost impossible for Bob to learn the information of the detection window of Alice. Because the heaviest cost of OT operation in the secure dot-production of  is avoided by our RBI based dot-production, the Millionaire version protocol of ours need much less time than the traditional blind vision protocol does.
There are several extensions to this work. First is the need to accelerate the secure blind vision to practical use, i.e. to reduce the time cost to near that of the vision algorithm without security consideration. Second is to make both the training and the test blind. This will make the client users to upload more visual data to the cloud without worrying about the privacy leakage.
-  Paul A. Viola and Michael J. Jones, “Robust real-time face detection,” in IEEE 8th International Conference On Computer Vision ICCV 2011, Vancouver, British Columbia, Canada, July 7-14, 2001, 2001, p. 747.
-  Shai Avidan and Moshe Butman, “Blind vision,” in Computer Vision - ECCV 2006, 9th European Conference on Computer Vision, Graz, Austria, May 7-13, 2006, Proceedings, Part III, 2006, pp. 1–13.
-  Maneesh Upmanyu, Anoop M. Namboodiri, Kannan Srinathan, and C. V. Jawahar, “Efficient privacy preserving video surveillance,” in IEEE 12th International Conference on Computer Vision, ICCV, Kyoto, Japan, September 27 - October 4, 2009, pp. 1639–1646.
-  Margarita Osadchy, Benny Pinkas, Ayman Jarrous, and Boaz Moskovich, “Scifi - A system for secure face identification,” in 31st IEEE Symposium on Security and Privacy, S&P 2010, 16-19 May 2010, Berleley/Oakland, California, USA, 2010, pp. 239–254.
-  Hosik Sohn, Konstantinos N. Plataniotis, and Yong Man Ro, “Privacy-preserving watch list screening in video surveillance system,” in Advances in Multimedia Information Processing - PCM 2010 - 11th Pacific Rim Conference on Multimedia, Shanghai, China, September 21-24, 2010, Proceedings, Part I, 2010, pp. 622–632.
-  Chun-Te Chu, Jaeyeon Jung, Zhicheng Liu, and Ratul Mahajan, “strack: Secure tracking in community surveillance,” in Proceedings of the ACM International Conference on Multimedia, MM’14, Orlando, FL, USA, November 03 - 07, 2014, 2014, pp. 837–840.
-  Xin Jin, Kui Guo, Chenggen Song, Xiaodong Li, and et al., “Private video foreground extraction through chaotic mapping based encryption in the cloud,” in MultiMedia Modeling - 22nd International Conference, MMM 2016, Miami, FL, USA, January 4-6, 2016, Proceedings, Part I, 2016, pp. 562–573.
-  Shai Avidan and Moshe Butman, “Efficient methods for privacy preserving face detection,” in Advances in Neural Information Processing Systems 19, Proceedings of the Twentieth Annual Conference on Neural Information Processing Systems, Vancouver, British Columbia, Canada, December 4-7, 2006, 2006, pp. 57–64.
-  Jagarlamudi Shashank, Palivela Kowshik, Kannan Srinathan, and C. V. Jawahar, “Private content based image retrieval,” in 2008 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2008), 24-26 June 2008, Anchorage, Alaska, USA, 2008.
-  Giulia C. Fanti, Matthieu Finiasz, and Kannan Ramchandran, “One-way private media search on public databases: The role of signal processing,” IEEE Signal Process. Mag., vol. 30, no. 2, pp. 53–61, 2013.
-  Paul A. Viola and Michael J. Jones, “Robust real-time face detection,” International Journal of Computer Vision, vol. 57, no. 2, pp. 137–154, 2004.
-  Vidit Jain and Erik Learned-Miller, “Fddb: A benchmark for face detection in unconstrained settings,” Tech. Rep. UM-CS-2010-009, University of Massachusetts, Amherst, 2010.
-  Libor Spacek, “Collection of facial images: Faces96,” http://cswww.essex.ac.uk/mv/allfaces/faces96.html.
-  Carlos E. Thomaz and Gilson Antonio Giraldi, “A new ranking method for principal components analysis and its application to face image analysis,” Image Vision Comput., vol. 28, no. 6, pp. 902–913, 2010.
-  Navneet Dalal and Bill Triggs, “Histograms of oriented gradients for human detection,” in 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2005), 20-26 June 2005, San Diego, CA, USA, 2005, pp. 886–893.