Concealing the identity of faces in oblique images with adaptive hopping Gaussian mixtures
Abstract
Cameras mounted on Micro Aerial Vehicles are increasingly used for recreational photography. However, aerial photographs of public places often contain faces of bystanders thus leading to a perceived or actual violation of privacy. To address this issue, we propose to pseudorandomly modify the appearance of face regions in the images using a privacy filter that prevents a human or a face recogniser from inferring the identities of people. The filter, which is applied only when the resolution is high enough for a face to be recognisable, adaptively distorts the face appearance as a function of its resolution. Moreover, the proposed filter locally changes its parameters to discourage attacks that use parameter estimation. The filter exploits both global adaptiveness to reduce distortion and local hopping of the parameters to make their estimation difficult for an attacker. In order to evaluate the efficiency of the proposed approach, we use a stateoftheart face recognition algorithm and synthetically generated face data with 3D geometric image transformations that mimic faces captured from an MAV at different heights and pitch angles. Experimental results show that the proposed filter protects privacy while reducing distortion and exhibits resilience against attacks.
1 Introduction
MAVs are becoming common platforms for a number of civilian applications such as search and rescue (Waharte and Trigoni, 2010), disaster management (Quaritsch et al., 2010) and news reporting (Babiceanu et al., 2015). Moreover, individuals use MAVs equipped with high resolution cameras for recreational photography and videography in public places during sports activities and social gatherings (Hexo+, 2018; AirDog, 2018). Such use in public places raises privacy concerns as bystanders who happen to be within the field of view of the camera are captured as well. The identity of bystanders could be protected by locating and removing (or sufficiently distorting) key image regions, such as faces, using algorithms called privacy filters. However, in order to maintain the aesthetic value of an image, only a minimal distortion of the image content should be allowed.
A privacy filter for recreational aerial photography should satisfy the following properties: (a) introduce only a minimal distortion; (b) be robust against attacks; and (c) be computationally efficient. Minimal distortion is necessary to maintain quality of a protected image close to the unprotected one so that the attention of a viewer is not diverted. Therefore blanking out a face (Schiff et al., 2007) is not a desirable option. Robustness is important to avoid privacy violations by various attacks, e.g. bruteforce, naïve, parrot and reconstruction attacks (Kundur and Hatzinakos, 1996; Boult, 2005; Newton et al., 2005; Dufaux and Ebrahimi, 2008; Erdelyi et al., 2014; Korshunov and Ebrahimi, 2014; Dong et al., 2016). A bruteforce attack tries to decipher the protected probe images by an exhaustive search (Boult, 2005; Dufaux and Ebrahimi, 2008). Other attacks use gallery images in addition to the protected probe images (Newton et al., 2005; Erdelyi et al., 2014; Korshunov and Ebrahimi, 2014; Dong et al., 2016). In a naïve attack, the protected probe images are compared against the unprotected gallery images (Newton et al., 2005; Erdelyi et al., 2014; Korshunov and Ebrahimi, 2014). In a parrot attack, the attacker has knowledge about the privacy filter and can transform the gallery images into the distorted domain (Newton et al., 2005). In a reconstruction attack, the attacker has some knowledge of how to (partially) reconstruct the probe image from the protected to the unprotected domain (Kundur and Hatzinakos, 1996). Examples of reconstruction methods include inverse filtering and superresolution techniques (Kundur and Hatzinakos, 1996; Dong et al., 2016). Finally, computational efficiency is desirable when the filter operates using the limited computational and battery power of a MAV.
Privacy filters for aerial photography need to face challenges caused by the egomotion of the camera, changing illumination conditions, and variable face orientation and resolution. Recent frameworks that support facial privacypreservation in airborne cameras are Generic Data Encryption (Kim et al., 2014), Unmanned Aircraft SystemsVisual Privacy Guard (Babiceanu et al., 2015) and Adaptive Gaussian Blur (Sarwar et al., 2016). Generic Data Encryption sends an encrypted face region to a privacy server that Gaussian blurs or mosaics the face and then forwards it to an enduser. Unmanned Aircraft SystemsVisual Privacy Guard (Babiceanu et al., 2015) and Adaptive Gaussian Blur (Sarwar et al., 2016) are aimed instead at onboard implementation with an objective to reduce latency and discourage bruteforce attacks on the server (Kim et al., 2014). Adaptive Gaussian Blur adaptively configures the Gaussian kernel depending upon the face resolution in order to minimise distortion, while Unmanned Aircraft SystemsVisual Privacy Guard blurs faces with a fixed filter. These methods are prone to parrot attacks (Newton et al., 2005) on the Gaussian blur.
In this paper, we present a novel privacy protection filter to be used onboard an MAV. The proposed filter distorts a face region with secret parameters to be robust to naïve, parrot and reconstruction attacks. The distortion is minimal and adaptive to the resolution of the captured face: we select the smallest Gaussian kernel that reduces the face resolution below a certain threshold. The selected threshold protects the face against the naïve attack as well as maintains its resolution at a specified level. To prevent other attacks, we then insert supplementary Gaussian kernels in the selected Gaussian kernel and hop their parameters locally using a pseudorandom number generator (PRNG) so their estimation is difficult from the filtered face image. The block diagram of the proposed filter is shown in Figure 1.
In contrast to airborne photography, an updated work based on the proposed filter is presented in Sarwar et al. (2018), specifically for the airborne videography. The main contributions of this paper are: (1) basic idea of the Gaussian hopping kernels and their details, (2) a largescale synthetic face image data set emulating faces captured from an MAV, and (3) extensive experiments to validate the proposed Gaussian hopping kernels, including the reconstruction attacks.
The paper is organised as follows. Sec. 2 covers the stateoftheart in visual privacy protection filters. Sec. 3 defines the problem. Sec. 4 describes the proposed algorithm, and discusses its computational complexity and security level. Sec. 5 presents our face data set generation and Sec. 6 discuss the experimental results. Finally, Sec. 7 concludes the paper.
2 Background
Visual privacy protection filters can be applied as preprocessing or postprocessing (Fig. 2).
Preprocessing privacy filters are irreversible and operate during image acquisition to prevent a camera from capturing sensitive regions. These filters disable the software or hardware of the camera or notify about photography prohibition (Safe Haven, 2003). Hardware based filters prevent the camera from taking images for example by bursting back an intense light for flash photography (Eagle Eye, 1997; Zhu et al., 2017) or by detecting human faces using an infrared sensor and then obfuscating using a spatial light modulator sensor placed in front of the Charge Coupled Device (CCD) sensor (Zhang et al., 2014).
Postprocessing privacy filters protect sensitive regions after image acquisition and can be reversible or irreversible. Reversible filters conceal sensitive regions using a private key, which can later be used to recover the original sensitive region. Irreversible filters deform the features of a sensitive region permanently. Both reversible and irreversible filters can be nonadaptive or adaptive.
Reversible nonadaptive filters are based on generic encryption (Boult, 2005; Chattopadhyay and Boult, 2007; Rahman et al., 2010; Winkler and Rinner, 2011; Zhang et al., 2018). Reversible adaptive filters include scrambling (Dufaux and Ebrahimi, 2006, 2008; Baaziz et al., 2007; Sohn et al., 2011; Ruchaud and Dugelay, 2017), warping (Korshunov and Ebrahimi, 2013b) and morphing (Korshunov and Ebrahimi, 2013a). While reversible adaptive filters are robust against a parrot attack, their protected faces can be compromised by spatialdomain (Jiang et al., 2016a, b) or frequencydomain attacks (Rashwan et al., 2015).
Irreversible nonadaptive filters blank out (Schiff et al., 2007; Koelle et al., 2018) or replace a face with a deidentified representation (Newton et al., 2005). For example, to maintain kanonymity, the algorithm ”kSame” (Newton et al., 2005) replaces k faces with their average face. Variants of this algorithm use additional specialised detectors to then preserve attributes such as facial expressions, pose, gender, race, age (Gross et al., 2006; Du et al., 2014; Lin et al., 2012; Letournel et al., 2015; Meden et al., 2018). Irreversible nonadaptive filters are robust to parrot attacks. Irreversible adaptive filters lower the resolution of a sensitive region so that humans or algorithms cannot recognise the identity. Examples include pixelation (Chinomi et al., 2008), Gaussian blur (Wickramasuriya et al., 2004) and cartooning (Erdelyi et al., 2014). The kernel size of the privacy filters can be manually selected (Korshunov and Ebrahimi, 2014; Erdelyi et al., 2014) or the centre kernel size is manually selected and then the Space Variant Gaussian Blur (SVBG) filter (Saini et al., 2012) automatically decreases the kernel size from the centre to the boundary of the detected face. AGB (Sarwar et al., 2016) exploits the different horizontal and vertical resolutions that are typical in aerial photography, and automatically adapts an anisotropic kernel based on the resolution of the detected face. However, irreversible adaptive filters are vulnerable to parrot attacks.
DCTS 
PICO 
GARP 
UASVPG 
Cartooning 
SVGB 
ODBVP 
AGB 
Proposed 

Distortion  adaptive control  image based  ✓  ✓  ✓  ✓  
navigation sensors  ✓  ✓  
2D kernel  isotropic  ✓  ✓  ✓  
anisotropic  ✓  ✓  
Robustness  to bruteforce attack  ✓  ✓  ✓  ✓  ✓  ✓  ✓  
to naïve attack  ✓  ✓  ✓  ✓  ✓  ✓  ✓  ✓  ✓  
to inverse filter attack  ✓  ✓  ✓  
to superresolution attack  ✓  ✓  ✓  ✓  
to parrot attack  with detectors  ✓  
without detectors  ✓  ✓  ✓  
Computational simplicity  ✓  ✓  ✓  ✓ 
As a summary, Table 1 compares representative filters for the following categories: reversible & adaptive (Dufaux and Ebrahimi, 2008), reversible & nonadaptive (Boult, 2005), and irreversible & nonadaptive filters (Du et al., 2014). The rest (Babiceanu et al., 2015; Erdelyi et al., 2014; Saini et al., 2012; Korshunov and Ebrahimi, 2014; Sarwar et al., 2016) and proposed are irreversible & adaptive filters.
3 Problem Definition
Let the set contain face data of subjects, where represents the identity (labels). Let each subject appear in at most images, i.e. . Let be the gallery and probe sets, respectively. Usually , where is the cardinality of a set, and .
Let a privacy filter distort image features in order to reduce the probability for an attacker to correctly predict labels. This operation produces a protected probe set , whose distortion depends on , where indicates the horizontal and vertical direction in an image. Let the distortion generated by be measured by the Peak Signal to Noise Ratio (PSNR):
(1) 
where is the dynamic range of the pixel values. The mean square error, MSE, between the pixel intensities of an unprotected, , and protected, , face is
(2) 
where and are width and height of , respectively.
We express the privacy level of a face region as the accuracy of a face recogniser (Erdelyi et al., 2014; Korshunov and Ebrahimi, 2014). The value of is the commutative rankn in face identification or the Equal Error rate (EER) in face verification. We consider in this paper face verification, thus
(3) 
where and are true positives and true negatives, respectively. Our target is to force a face recogniser of an attacker to have the accuracy of random classifier, which for face verification is .
We therefore aim to design that irreversibly but minimally distorts the appearance of so that the identity is not recognisable with a probability higher than a random guess. If , the ideal distortion parameter, , should be derived as:
(4) 
where . The first term aims to introduce a minimal distortion, whereas the second term leads the classification results to be equivalent to that of a random classifier, irrespective of whether the filtered or reconstructed face is compared against the unprotected, filtered or reconstructed gallery data sets. The second term objective is dependent upon the recognition capability of a face recogniser and is heuristically addressed for a given face recogniser (Chriskos et al., 2016; Erdélyi et al., 2017; Pittaluga and Koppal, 2017).
The content of should be protected against naïveT, parrotT and reconstruction attacks. Let an attacker have access to , where is the filtered gallery data set and is the filtered and reconstructed gallery data set. An attacker can modify , , or both, to correctly predict of . In a naïve attack (here referred to as naïveT attack), a privacy filter is applied on to generate a protected probe data set , while the unaltered is used for training (Newton et al., 2005). A parrot attack (here referred to as parrotT attack), learns the privacy filter type and its parameters (e.g. Gaussian blur of certain standard deviation used to generate ). Then, the learned filter is applied on to generate a privacy protected gallery data set . Finally, and are used for training and testing, respectively (Newton et al., 2005). In a reconstruction attack, the discriminating features of are first restored (e.g. using an inverse filter or a superresolution algorithm) to generate a reconstructed probe data set and then compared against or a reconstructed gallery data set . An inverse filter first estimates the parameters of a privacy filter using and then performs an inverse operation to reconstruct the original faces (Kundur and Hatzinakos, 1996). Similarly, a superresolution algorithm first learns embeddings between the highresolution and their corresponding lowresolution faces and then reconstructs the highresolution faces for (Dong et al., 2016).
4 Proposed Approach
In order to minimally distort as well as to achieve robustness against bruteforce, naïveT, parrotT and reconstruction attacks, we propose the Adaptive Hopping Gaussian Mixture Model (AHGMM) algorithm. The AHGMM consists of a globally estimated optimal Gaussian Point Spread Function (PSF) and supplementary Gaussian PSFs added inside the optimal Gaussian PSF. For a single supplementary Gaussian PSF inside an optimal Gaussian PSF, the AHGMM is illustrated in Fig. 3, while the pseudocode is given in Algorithm 1. A list of important notations is presented in Appendix 7.
Figure 1 shows the processing diagram of our proposed framework and the different blocks of it are explained in more details in the following subsections.
4.1 Pixel Density Estimation
Let an MAV capture an image while flying at an altitude of meters. Let the principal axis of its onboard camera be tilted by from the nadir direction (see Figure 4). We assume that height and tilt angle of the camera can be estimated.
A value of generates an oblique image. Let be the height of the face above ground^{1}^{1}1While each image could contain faces, for simplicity we consider in this paper only the case .. We represent the face region in the image as , which is viewed at an angle .
Let represent the pixel density (px/cm) around the centre of . If and represent the physical dimensions of a pixel in the horizontal and vertical direction, respectively and is the focal length of the camera, the horizontal density for a pixel around (Sarwar et al., 2016) is
(5) 
and the vertical density , by exploiting the small angle approximation for a single pixel of the image sensor (Sarwar et al., 2016), is
(6) 
Let define whether is naturally protected () because of a low horizontal and vertical density, or not () (Sarwar et al., 2016):
(7) 
where and are pixel densities at which a stateoftheart machine algorithm starts recognising human faces, and simply called thresholds. If , then the original frame can be transmitted without any modifications. Otherwise, should be protected by a privacy filter to reduce its pixel densities below and . When is not inherently protected, we assume that the corresponding bounding box is given.
4.2 Optimal Gaussian PSF
A 2D PSF , or impulse response, is the output of a filter when the input is a point source. In the discrete domain (Oppenheim et al., 1996), it is given as , where is the convolution operation and
(8) 
In the case of Gaussian blur, is an approximated Gaussian function of mean and standard deviation (Saini et al., 2012; Korshunov and Ebrahimi, 2014; Sarwar et al., 2016), and thus called a Gaussian PSF of parameter . More specifically, the parameter controls the distortion strength of and provides pixel density in , respectively.
As a higher results into lower , we first find the minimum value called optimal parameter of that makes . As a result, provides the minimum distortion in while making it robust against the naïveT attack (i.e. ). Increasing beyond increases the distortion without improving the privacy level as the recogniser performance is already at the level of the random classifier. For a face captured from an MAV with pixel densities , we calculate of an optimal Gaussian PSF (lines 25 in Algorithm 1), where like in traditional Gaussian blur (Saini et al., 2012; Korshunov and Ebrahimi, 2014; Sarwar et al., 2016) and (Sarwar et al., 2016) is estimated as follows:
A Gaussian PSF of standard deviation in the spatial domain is another Gaussian PSF of standard deviation in the frequency domain and both the Gaussian PSFs are related as
(9) 
where is measured in cycles/cm, in px and in px/cm. Let represents the Nyquist frequency of . Let is the highest spatial frequency component that we want to completely remove using a low pass filter, i.e. Gaussian blur. In other words, is the Nyquist frequency of , i.e. pixel density after filtering. Both and are related as
(10) 
As we are interested in removing frequency components beyond , we can select because the amplitude response of a Gaussian PSF at three times of its standard deviation is very close to zero and multiplication (convolution in space domain) with such a Gaussian PSF will suppress frequencies larger than . Substituting in Eq. 10, in the resulting relation Eq. 9 and finally rearranging gives the optimal standard deviation of Gaussian PSF as
(11) 
4.3 Hopping GMM Kernels
Filtering with the optimal Gaussian PSF defined by would only protect from a naïveT attack but not from a parrotT attack and a reconstruction attack. To ensure that the probability of correctly predicting the label of is not increased in case of the parrotT attack (i.e. ) as well as the reconstruction attack (i.e. or ), we secretly modify to while generating so that an adversary is unable to accurately reconstruct face region , or even generate and . For this purpose, we generate a set which consists of subregions in such a way that each subregion covers a small area of :
(12) 
The size of (in pixels) affects the total number of subregions per face region , which could influence its privacy level. Smaller values of (larger subregions) result in a reduced distortion.
After finding , and generating , we make a hopping mixture of Gaussian for each subregion, i.e. we pseudorandomly change to for each . Moreover, we select supplementary Gaussian PSFs inside this optimal Gaussian PSF and vary their parameters based on pseudorandom weights (lines 917 in Algorithm 1).
Let set contains the parameters of the modified optimal and supplementary Gaussian PSFs for each subregion, and is represented as
(13) 
where is the number of the supplementary Gaussian PSFs. The element represents the modified optimal Gaussian PSF given by
(14) 
(15) 
while the remaining elements (i.e. ) belong to the supplementary Gaussian PSFs. These elements are calculated as
(16) 
(17) 
where, and are normalised pseudorandomly generated numbers and control the local distortion in filtering. The variable controls the relative size of the supplementary Gaussian PSF w.r.t. the optimal Gaussian PSF.
After generating the parameters of the Gaussian PSFs, a set representing 2D anisotropicdiscretised Gaussian PSFs corresponding to is created as
(18) 
where each is calculated (line 19 in Algorithm 1) as (Popkin et al., 2010)
(19) 
(6.21, 4.63)  (6.21, 4.56)  (3.11, 2.17)  (3.11, 2.00)  (1.55, 0.89)  (1.55, 0.74)  (0.78, 0.29)  (0.78, 0.20) 
(a)  
(b) px/cm  
(c) px/cm  
(d) px/cm 


where
(20) 
and
(21) 
with . In order to develop a mixture model from the discretised Gaussian PSFs of each subregion, a set of weights is required. We again utilise a PRNG to generate such that
(22) 
Finally, a set of mixture models is generated for each subregion (line 22 in Algorithm 1) as
(23) 
where each element is calculated as
(24) 
4.4 Local and Global Filtering
We have now discretised Gaussian mixture models in for subregions of . We locally convolve each subregion (Eq. 12) with their respective to make a protected subregion :
(25) 
where . Changing the convolutional kernel for each subregion generates blocking artefacts (see Fig. 5). To smooth these artefacts, we apply a global convolution filter (line 25 in Algorithm 1) with a Gaussian kernel of zero mean and standard deviation
(26) 
where represents the subregion size in pixels. As a result, a smoothed protected face is developed which is replaced in the captured image to generate a privacy protected image . Fig. 6 shows few sample images filtered by AHGMM at different thresholds.
4.5 Computational Complexity
The generation of a convolutional kernel is more complex in AHGMM than in the adaptive Gaussian blur filter (Sarwar et al., 2016). In fact, the latter only needs to compute a single Gaussian function, while AHGMM requires the computation of Gaussian functions. Moreover, the adaptive Gaussian blur exploits the separability property of 2D convolutional kernels, i.e. , to reduce the number of multiplications and additions from to + ( and represent the width and height of in pixels, respectively). Instead, AHGMM dynamically reconfigures the convolutional kernel after processing each subregion and therefore requires exactly multiplications and additions.
5 Dataset Generation
To the best of our knowledge, there is no large publicly available face dataset collected from an MAV. We therefore generate face images as if they were captured from an MAV via geometric transformation and downsampling of the LFW dataset (Huang et al., 2007). The LFW dataset was collected in an unconstrained environment with extreme illumination conditions and extreme poses. We use the standard verification benchmark test of the LFW dataset (12000 images of 4281 subjects), divided into 10folds for crossvalidation. Each fold contains 600 images of the same subject and 600 images of different subjects. We use the deep funnelled version of the LFW dataset.
Figure 8 shows sample images of the stages of the dataset generation pipeline. We fit a 3D Morphable Model (3DMM) (Bas et al., 2016) on an input image to detect 68 facial landmarks (Zhu and Ramanan, 2012) and then iteratively fit a 3DMM to generate a 3D image representation^{2}^{2}2Among the 12000 images, the landmark detector (Zhu and Ramanan, 2012) was unable to detect 68 facial landmarks on 74 images. Therefore, we were unable to fit a 3DMM and used the original 74 images in order to comply with the standard verification test script of the LFW data set.. As there may be only a few degrees pitch of the subject captured in the images (e.g. a person looking slightly downward or upward), we rotate the 3D image at pitch by applying a geometric transformation computed from the estimated pose of the fitted 3DMM. This disturbs the image alignment of the original data set, so a realignment is required, which we perform after generating the pitch effect. The synthetic pitch angles start from to with a step size of and project it back to generate a corresponding 2D image. In order to align this image so that the eyes and nose appear at the same place among the images belonging to the same pitch angle, we apply an affine transformation computed by detecting eyes and nose tip using Dlib library (King, 2009) such that the transformed face has a resolution of pixels. As the detection accuracy of the eyes and nose decrease with increasing pitch angle, we generate a ground truth (location of eyes and nose tip) of the pitch angle images and uses it for the higher pitch angle images.
Finally, to introduce different height effects for the synthetically generated images, we downsample them with a factor of , , and generating images of , , , pixels, respectively. Thus, we increase the size of the original standard verification test of the LFW data set by times, i.e. from images to images. Fig. 7 shows the 40 sample images belonging to the same and different subjects.
We manually determined the values of and by
(27) 
(28) 
where is the cropped face size in pixels, is the pitch angle of the image and and are the average human face dimensions, i.e. the bitragion breadth of 15.45 cm and mentoncrinion length of 20.75 cm, respectively (DoD, 2000).
6 Experimental Results
6.1 Experimental Set up
We compare AHGMM against Space Variant Gaussian Blur (SVGB) (Saini et al., 2012), Adaptive Gaussian Blur (AGB) (Sarwar et al., 2016) and Fixed Gaussian Blur (FGB), which uses a constant Gaussian kernel defined with respect to the highest resolution face. Thus, we estimate the kernel for FGB as in (Sarwar et al., 2016) for the face with pixels at pitch angle. For the SVGB filter, we divide the face into four concentric circles and reduce the kernel size by while radially moving out between two consecutive regions as in (Saini et al., 2012). Although the kernel for the innermost region was manually selected in the original work, we choose the anisotropic kernel as estimated by the AGB (Sarwar et al., 2016) and convert it into an isotropic kernel for a fair comparison. We use a block size of and for the AHGMM.
To compare privacy filters, we measure the face verfication accuracy using OpenFace (Amos et al., 2016), an open source implementation of Google’s face recognition algorithm FaceNet (Schroff et al., 2015). OpenFace uses a deep Convolutional Neural Network (CNN) as a feature extractor, which is trained by a large face data set (500k images). This feature extractor is applied on the training and test images for their representations (embeddings) which are used for classification (Schroff et al., 2015).
To measure distortion as in (Erdelyi et al., 2014; Nawaz and Ferryman, 2015), we apply the PSNR, the power ratio of the original image with respect to the filtered image.
We perform experiments with 480,000 images (consisting of 5 different resolutions and 8 different pitch angles) to determine the validity of the proposed AHGMM to protect the identity information of an individual. For this purpose, we analyse the effect of a naïveT attack, a parrotT attack, an inverse filter attack and a superresolution attack. Moreover, we quantify the corresponding fidelity degradation caused by the AHGMM.
As AGB and SVGB do not use any secret key, we evaluate them only using their accurate parameters in the parrotT, inverse filter and superresolution attacks. In contrast, any of these attacks on AHGMM can be further divided into three subattacks: optimal kernel, pseudo AHGMM and accurate AHGMM. In the optimal kernel subattack, we assume that an attacker is able to estimate the parameters of the optimal kernel and applies the optimal kernel to the entire face. In the pseudo AHGMM subattack, we assume that the attacker knows the optimal kernel and randomly modifies the filter parameter for the subregions. In the accurate AHGMM subattack, we assume that the attacker has access to the secret key and can decipher all filter parameters for the subregions. As this priorknowledge can be exploited for both probe and gallery images, we therefore evaluate AHGMM under 13 different scenarios stated in Table 2.
Gallery images  
unprotected  protected  
unchanged  reconstructed  
IF  SR  
Probe images 
unprotected 
naïveBL  N/A  N/A  N/A  
protected 
unchanged 
naïveT 

—  —  
reconstructed 
IF 

— 

—  
SR 

—  — 

We assume that an attacker is able to determine the pitch angle of a protected face using the background information of an image captured from an MAV and can apply a geometric transformation to transform the gallery images at that pitch angle. Therefore, in all the following attacks, both the gallery and the probe images are at the same pitch angle which can be protected or unprotected depending upon the attack type. Moreover, we use the same resolution for both the gallery images and the probe images.
6.2 NaïveT Attack
First of all, we perform a naïveBL attack which shows the baseline face verfication accuracy when both the probe data set and the gallery data set are unprotected. The results of the naïveBL attack are given in Fig. 9. After that we perform a naïveT attack in which the gallery images are unprotected, while the probe images are protected using FGB, SVGB (Saini et al., 2012), AGB (Sarwar et al., 2016) and AHGMM. The results of this attack are given in Fig. 10 at different thresholds .
The naïveBL attack shows that the accuracy of our synthetically generated data set decreases with the decrease of the face resolution and with the increase in the face pitch angle. However, this trend vanishes at high pitch angles, i.e. and , where it shows slight randomness. Finally, for the low resolution faces ( pixels), the accuracy does not show any effect of the pitch angle and slightly oscillates. Therefore, we consider pixels inherently privacy protected and remove these images from the analysis of the privacy filters.
NaïveT  ParrotT  ParrotT  ParrotT 

(optimal kernel)  (pseudo AHGMM)  (accurate AHGMM)  
From the naïveT attack, we are interested in finding the optimal threshold which defines the optimal kernel for AGB (Sarwar et al., 2016) (see Section 4 and Eq. 11). It is clear from Fig. 10 that the accuracy of the naïveT attack decreases while decreasing the threshold. When the threshold reaches px/cm, the difference between the accuracy achieved by AGB (Sarwar et al., 2016) and a random classifier () becomes very small except, unexpectedly, at high pitch angles. This difference further decreases at px/cm and px/cm. Thus, the optimal threshold defining the optimal kernel can be px/cm, px/cm and px/cm. The later two thresholds decreases the accuracy negligibly but distort the images severely. Therefore, we decide to perform a tradeoff analysis of the accuracy (under naïve, parrot attack and reconstruction attacks) and the distortion at these three thresholds.
At these three thresholds under the naïveT attack, the accuracy of the AHGMM is higher as compared to the AGB (Sarwar et al., 2016). The main reason for this slightly higher accuracy is due to the under blurred subregions of the AHGMM filtered face as it hops its kernel below and above the optimal Gaussian kernel. In contrast, the accuracy of the Space Variant Gaussian Blur (Saini et al., 2012) is always lower than AGB and AHGMM. This is because SVGB uses an isotropic Gaussian kernel which deteriorates a face more severely as compared to the anisotropic kernel of the AGB and AHGMM filter. FGB possess the lowest accuracy at any threshold due to over blurring of all images except pixels images at pitch angle.
6.3 ParrotT Attack
In the parrotT attack, we filter both gallery and probe images and then evaluate the achieved accuracy. We study the parrotT attack on AHGMM under three subattacks: optimal kernel parrotT subattack, pseudo AHGMM parrotT subattack and accurate AHGMM parrotT subattack. The accuracy results of these subattacks are given in Fig. 10 at different thresholds , while Receiver Operating Curves for the accurate AHGMM parrotT subattack at px/cm are presented in Fig. 11.
The parrotT attack on stateoftheart privacy filters increases the accuracy as compared to the naïveT attack. Under the optimal kernel parrot subattack, our AHGMM shows the least accuracy improvement at any of the three thresholds. This is because the optimal kernel Gaussian blur is a spatially invariant blur that is not helpful in recognising spatially varying Gaussian blurred images, e.g. the AHGMM filtered images. Thus, our AHGMM provides the lowest accuracy against the parrotT attack using the optimal kernel.
The pseudo AHGMM parrotT subattack slightly improves the accuracy further as compared to the optimal kernel parrotT subattack. The main reason is that both the gallery and the probe images are now filtered using spatially varying Gaussian blur. However, under the pseudo AHGMM subattack, the accuracy of AHGMM remains below the other three stateoftheart privacy filters. Thus, our AHGMM provides the highest privacy protection even against the pseudo AHGMM parrot subattack.
Finally, the accurate AHGMM subattack improves the accuracy as compared to the optimal kernel and almost eqivalent to the pseudo AHGMM subattacks. Comparatively, even under the accurate AHGMM subattack, AHGMM performs better than FGB, AGB (Sarwar et al., 2016) and SVGB (Saini et al., 2012) at these three thresholds with the least improvement at px/cm.
From the accurate AHGMM subattack, it is apparent that our AHGMM permanently removes the sensitive information from the face and an attacker can not recognise it with a high accuracy even when he/she has access to the secret key. This is in contrast to the reversible filters, e.g. encryption/scrambling based filters, which can reconstruct the original face after having the secret key. Thus, our AHGMM is robust against the bruteforce attack.
6.4 Inverse Filter Attack
In the inversefilter (IF) attack, we reconstruct the probe images by deconvolving the protected face with an accurate or estimated kernel. We evaluate the IF attack under four subattacks: optimal kernel naïveIF subattack, pseudo AHGMM naïveIF subattack, accurate AHGMM naïveIF subattack and accurate AHGMM parrotIF subattack. Fig. 12 depicts the effect of inverse filtering on selected sample images protected with AGB, SVGB and AHGMM. Fig. 13 shows the achieved accuracies under the different subattacks at different values of , while Fig. 14 presents ROCs for the accurate AHGMM parrotIF subattack at px/cm.
As can be seen in Fig. 12, the face reconstruction quality decreases when the threshold increases (increasing the filter kernel) even if the filter parameters are known. This is true for both space invariant Gaussian blur (AGB) and linear space variant Gaussian blur (SVGB). The main reason is that the boundaries of the face start propagating towards the center of the face as the threshold is decreased. Thus, it becomes difficult to distinguish between reconstructed faces at the lower thresholds (see Fig. 13).
In case of nonlinear space variant blur (AHGMM), the reconstruction becomes more challenging even when the same hopping kernels are used as for the protection. The main reason, in addition to the boundary propagation, is that while deconvolving a subregion, the IF incorrectly treats the adjacent subregions as if they were filtered with the same kernel, thus not enabling it to reconstruct the original face (see Fig 12). Consequently, it becomes difficult to accurately predict the label of the reconstructed face.
Threshold  AGB (Sarwar et al., 2016)  SVGB (Saini et al., 2012)  AHGMM  

()  filtered  reconstructed  filtered  reconstructed  filtered  reconstructed  
optimal  pseudo  accurate  
px/cm  
px/cm  
px/cm 
NaïveIF  NaïveIF  NaïveIF  ParrotIF 

(optimal kernel)  (pseudo AHGMM)  (accurate AHGMM)  (accurate AHGMM) 
In contrast to naïveIF attacks, parrotIF attack is more severe and increases significantly the accuracy, especially for AGB, FGB and SVGB. AHGMM also shows the accuracy improvement but less than AGB, FGB and SVGB; and is more robust to an inverse filter attack even when using an accurate secret key.
6.5 Superresolution Attack
In this attack, we reconstruct the filtered probe images with SRCNN (Dong et al., 2016). SRCNN first learns a mapping between the highresolution images and their corresponding lowresolution version, and then applies this mapping to enhance the details of a lowresolution image. We learn the SRCNN mapping for iterations between the protected images (i.e. the low resolution) and their corresponding unprotected images (i.e. the high resolution) using the same data sets (91images and Set5) as used in (Dong et al., 2016). As learning of the mapping is a time consuming process, we investigate the superresolution attack for a single point of our synthetic data set: 12000 images each with pixels and pitch angle.
We evaluate the superresolution (SR) attack under four subattacks: optimal kernel naïveSR subattack, pseudo AHGMM naïveSR subattack, accurate AHGMM naïveSR subattack and accurate AHGMM parrotSR subattack. Tab. 3 summarises the achieved accuracies under the different subattacks, while Fig. 15 presents the ROC for the accurate AHGMM parrotSR subattack. Fig. 16 depicts a visual comparison of the superresolution reconstruction for three sample faces protected by AGB, SVGB and AHGMM filters.
Attack type  AGB  SVGB  AHGMM 

optimal naïveSR  0.592 (0.012)  0.566 (0.016)  0.515 (0.014) 
pseudo AHGMM naïveSR  –  –  0.520 (0.006) 
accurate AHGMM naïveSR  –  –  0.532 (0.018) 
accurate AHGMM parrotSR  0.634 (0.015)  0.583(0.034)  0.546 (0.018) 
Original  AGB  SVGB  AHGMM  
filtered  restored  filtered  restored  filtered  restored  
optimal  pseudo  accurate  
For the space invariant Gaussian blur (AGB), it is apparent from Fig. 16 that the SR attack can reconstruct the faces more effectively, even when the kernel size is quite high (i.e. px/cm). Therefore, the faces protected by AGB achieves a higher accuracy (see Tab. 3). In contrast, faces protected by linear space variant Gaussian blur (SVGB) are difficult to reconstruct. The main reason is that the SR mapping becomes erroneous especially for patches which contain parts processed by different kernels. However, SR can effectively reconstruct patches where the Gaussian blur is locally invariant (e.g. compare the areas around eyes of the SVGB restored faces in Fig. 16). The overall reconstruction is worse than for AGB and thus the achieved accuracy is lower.
Reconstruction by superresolution is even more challenging for AHGMM protected faces. The main reason is that a single patch for learning the mapping contains several subregions each filtered with pseudorandomly correlated Gaussian mixture models. Thus, the error in the learned SR mapping increases resulting in the lowest accuracy as compared to AGB and SVGB.
Similarly to parrotIF attack, the accuracy improves for the parrotSR attack where SRreconstruction is also performed for the gallery images. Especially for AGB and SVGB, the similarity between (protected and reconstructed) gallery images and the (reconstructed) probe images increases. Thus, the accuracy increases. As for the other attacks, AHGMM is more robust to parrot attacks than AGB and SVGB, and achieves the lowest accuracy.
6.6 Distortion Analysis
We measure the distortion of the FGB, SVGB (Saini et al., 2012), AGB (Sarwar et al., 2016) and AHGMM using PSNR. For a tradeoff analysis between distortion and privacy, we plot the face verification accuracy against PSNR. The results of this tradeoff analysis are presented in Fig. 17.
AGB (Sarwar et al., 2016) has the highest average PSNR values followed by SVGB (Saini et al., 2012), AHGMM and FGB. The main reason is that AGB uses a single anisotropic kernel instead of spatially linearly varying kernel used by SVGB (Saini et al., 2012). Although AHGMM also uses an anisotropic kernel like AGB, the spatial hopping phenomena of the Gaussian mixture model of the AHGMM results in high distortion (PSNR values) as compared to AGB and SVGB (see Fig. 6). FGB has the highest distortion as it does not change its parameters depending upon the resolution of the face.
7 Conclusion
We presented an irreversible visual privacy protection filter which is robust against a parrot, an inversefilter and a superresolution attack that are faced by an adhoc blurring of sensitive regions. The proposed filter is based on an adaptive hopping Gaussian mixture model. Depending upon the captured resolution of a sensitive region, the filter globally adapts the parameters of the Gaussian mixture model to minimise the distortion, while locally hop them pseudorandomly so that an attacker is unable to estimate these parameters. We evaluated the validity of the AHGMM using a stateoftheart face recognition algorithm and a synthetic face data set with faces at different pitch angles and resolutions emulating faces as captured from an MAV. The proposed algorithm provides the highest privacy level under a parrot, an inversefilter and a superresolution attack and an almost equivalent level of privacy to stateoftheart privacy filters under a naïve attack.
Unlike facedeidentification approaches ((Newton et al., 2005; Gross et al., 2006; Du et al., 2014; Lin et al., 2012; Letournel et al., 2015; Chriskos et al., 2015)), we do not depend on an auxiliary visual detector (i.e. pose, facial expression, age, gender, race) to counter a parrot, an inversefilter or a superresolution attack. Moreover, unlike the encryption/scrambling filters ((Dufaux and Ebrahimi, 2006, 2008; Baaziz et al., 2007; Sohn et al., 2011; Korshunov and Ebrahimi, 2013b, a; Boult, 2005; Chattopadhyay and Boult, 2007; Rahman et al., 2010; Winkler and Rinner, 2011)), AHGMM prevents the recovery of the original face even with access to the seed of the PRNG.
We will make available to the research community the face dataset of 4281 subjects we generated to emulate faces captured from an MAV under varying poses and illumination conditions.
All the symbols used in the paper along with their meanings are summarised in Table 4.
Notation  Meaning 

unprotected, protected and reconstructed face region  
centre of  
width and height of  
A data set including both gallery and probe data sets  
unprotected, protected and reconstructed gallery data set  
unprotected, protected and reconstructed probe data set  
original and predicted identity labels  
a privacy filter of parameter  
a function that an attacker exploits  
distortion introduced by  
probability of predicting the label of a face  
face verification accuracy  
verification accuracy of a random classifier  
focal length of the camera  
physical dimension of a pixel in direction  
height of a camera and face from ground level  
vectors representing Nadir and principal axis of a camera  
angle between , and ,  
Number of subregions of  
Number of supplementary Gaussian functions  
pixel density (px/cm), where  
threshold pixel density for privacy filtering  
mean and standard deviation of a Gaussian PSF  
mean and standard deviation of an optimal Gaussian PSF  
randomly modified and for Gaussian PSF  
randomly generated numbers for and  
a tuple (, ), (, ) and ()  
Nyquist frequency of and  
frequency domain standard deviation corresponding to  
scaling factor for  
set of tuple containing parameters of Gaussian functions  
a set of Gaussian functions  
an element of  
a set of weights for Gaussian mixture model  
an element of  
Gaussian mixture model  
an element of  
subregion size in pixels  
standard deviation of global smoothing filter 
Acknowledgment
O. Sarwar was supported in part by Erasmus Mundus Joint Doctorate in Interactive and Cognitive Environment, which is funded by the Education, Audiovisual & Culture Executive Agency under the FPA no 20100015.
