# Robust Deep Appearance Models

###### Abstract

This paper presents a novel Robust Deep Appearance Models to learn the non-linear correlation between shape and texture of face images. In this approach, two crucial components of face images, i.e. shape and texture, are represented by Deep Boltzmann Machines and Robust Deep Boltzmann Machines (RDBM), respectively. The RDBM, an alternative form of Robust Boltzmann Machines, can separate corrupted/occluded pixels in the texture modeling to achieve better reconstruction results. The two models are connected by Restricted Boltzmann Machines at the top layer to jointly learn and capture the variations of both facial shapes and appearances. This paper also introduces new fitting algorithms with occlusion awareness through the mask obtained from the RDBM reconstruction. The proposed approach is evaluated in various applications by using challenging face datasets, i.e. Labeled Face Parts in the Wild (LFPW), Helen, EURECOM and AR databases, to demonstrate its robustness and capabilities.

## I Introduction

Active Appearance Models (AAMs) [1] have been used successfully in several areas of facial interpretation over the last two decades. Given a new face image, the method aims to âdescribeâ that image by synthesizing a new image similar to it as much as possible. Indeed, AAMs are statistical models of appearance, generated by combining a shape model that represents the facial structure, and a quasi-localized texture model that represents the pattern of pixel intensities, i.e. skin texture, across a facial image patch. However, their capability of generalization is limited by the nature of Principal Component Analysis (PCA) used in both shape and texture models. Besides, since AAMs naively combine the shape and texture features to represent the facial appearance also by using PCA, it can only reveal the linear relationship between these features. There have been numerous improvements and adaptations using, for example, the probabilistic PCA [2], nonlinear Deep Boltzmann Machines (DBM) [3], etc., to model large and non-linear variations in shapes and textures.

^{1}

^{1}footnotetext: These two authors contribute equally to the work.

Duong et al. [4] recently proposed the Deep Appearance Models (DAMs) approach to model face images using a DBM network. Their main ideas are first to learn the shape and the texture models of sample faces separately using the DBM approach. The relationships between these two modalities are then pursued to generate the final appearance model using Restricted Boltzmann Machines (RBM) at the top layer. Given an unseen face, DAMs find the optimal facial shape using the forward compositional algorithm. This algorithm minimizes the non-linear least squares error between the warped and the reconstructed textures from the models. This network architecture enables the non-linear modeling capability to overcome the limitations presented in the original AAMs method.

However, there are still some limitations of DAMs in both face modeling and shape fitting. Firstly, the DAMs method still takes into account numerous appearance variations of face images, e.g. facial poses, occlusions, lighting, etc. in their fitting procedure resulting in undesirable fitting performance. Their minimization method using the squared error is good enough for constrained face images rather than for the problem of unconstrained face modeling with occlusions, poses and noise. Secondly, the texture models of DAMs method cannot distinguish between occluded and non-occluded areas since it treats all regions in the same way during model learning phase. DAMs will capture both “good” and “bad” regions in the learned models. Thus, it will give undesirable reconstruction texture images (as shown in Fig. 1).

To overcome the above modeling and fitting issues, we propose a novel Robust Deep Appearance Models (RDAMs) to learn an additional appearance variation mask that could be used in the fitting procedure to ignore those variations. This mask is modeled by the visible and hidden unit in Robust Boltzmann Machines (RoBM) [5]. This proposed model not only learns compact representation for recognition/prediction tasks, but also reconstructs better shape and texture.

The contributions of this work can be summarized as follows. Firstly, we propose a new texture modeling approach named Robust Deep Boltzmann Machines described in section III-B. It can model “good” and “bad” regions separately via a DBM and a binary RBM, respectively. Then, for example, given a face with sunglasses, RDBM can recover a “clean” face without sunglasses (as shown in Fig. 2). Secondly, the proposed RDAMs approach models shape using a DBM since it has non-linear property and can be setup in deep modeling to give more robust representation for shapes. Thirdly, we propose to use the learned binary RBM to generate a mask for shape model fitting using inverse compositional algorithm described in section III-C.

## Ii Related Work

This section reviews Restricted Boltzmann Machines [6] and its extensions. Recent advances in AAMs-based facial modeling and fitting approaches are reviewed in this section.

### Ii-a RBM and Its Extensions

Restricted Boltzmann Machines (RBM) [6] are an undirected graphical model with two layers of stochastic units, i.e. visible and hidden units, which represent the observed data and the conditional representation of that data, respectively. Visible and hidden units are connected by weighted undirected edges. Gaussian RBM [7] models real-valued data by assuming the visible units have real values normally distributed with mean and variance . Moreover, a set of RBMs can be stacked on top of another to capture more complicated correlations between features in the lower layer. This approach produces a deeper network called Deep Boltzmann Machines [3]. RoBM [5] were proposed to estimate noise and learn features simultaneously by distinguishing corrupted and uncorrupted pixels to find optimal latent representations.

### Ii-B AAMs-based Fitting Approaches

The fitting steps in AAMs can be formulated as an image alignment problem iteratively solved using the Gaussian-Newton (GN) optimization. Mathews et al. [8] presented the Project Out Inverse Compositional (POIC) algorithm that runs very fast due to pre-computation of the Jacobian and the Hessian matrices. Subsequently, many variants of the IC algorithm have been proposed [9]. Gross et al. [10] introduced the Simultaneous Inverse Compositional (SIC) algorithm simultaneously updating the warp and the texture parameters. Tzimiropoulos et al. [11] presented the Fast-SIC and the Fast-forward algorithms to efficiently solve the AAMs fitting problem in both forward and inverse fashions. An alternative formulation of model fitting is to solve as a classification problem (i.e. distinguish correct and incorrect alignment) or a regression problem. Along this direction, Liu [12] [13] proposed to extend GentleBoost classifier for learning discrimination between correct and incorrect alignment; and modeling the nonlinear relationship between texture and parameter updates.

Due to the holistic nature, AAMs methods are still far from achieving good performance in face images in the wild conditions, e.g. partial occlusions, poses, illumination, etc. To handle these problems, Sung et al. [14] combined Active Shape Models (ASM) with the AAMs to give a united objective function since ASM can find correct landmark points based on local texture descriptors. Tzimiropoulos et al. [15] proposed to solve a robust and efficient objective function aiming to detect points under occlusion and illumination changes. Martins et al. [16] presented two robust fitting methods based on the Lucas-Kanade forwards additive method [17] to handle partial and self-occlusions. Recently, Antonakos et al. [18] introduced a graph-based model, called Active Pictorial Structures (APS). This model uses Gaussian Markov Random Field (GMRF) to model the appearance of the objects. Antonakos et al. [19] also proposed to use higher level features in face modeling and fitting instead of modeling the raw pixels.

## Iii The Proposed Robust Deep Appearance Models

This section presents our proposed RDAMs method. The structure of RDAMs consists of three main components, i.e. the shape model, the texture model and the appearance representation layer. Section III-A presents the shape modeling steps using DBM. The robust texture modeling using RDBM is introduced in section III-B. Finally, our proposed robust fitting algorithms are presented in section III-C. The schematic diagram of our proposed method is given in Fig. 2.

### Iii-a Deep Boltzmann Machines for Shape Modeling

An -point shape is modeled using a DBM with a visible layer and two hidden layers. Given a shape , the energy of the configuration of the corresponding layers in shape modeling is as follows,

(1) |

where are the shape model parameters. The bias terms of hidden units in two layers in Eqn. (1) are ignored to simplify the equation. The probability distribution of the configuration is computed as:

(2) |

where is the normalization constant. This shape model is pre-trained using one-step contrastive divergence (CD).

### Iii-B Robust Deep Boltzmann Machines for Texture Modeling

We propose a new texture model approach named Robust Deep Boltzmann Machines. Far apart from the texture model of DAMs, this model consists of a visible layer with three gating components: , , and m, a binary RBM for the mask variable m and a Gaussian DBM with the real-valued input variable . The motivation for using this gating term is to improve modeling and fitting of the DAMs by eliminating the effects of missing, occluded or corrupted pixels. Our approach uses a Gaussian DBM to model “clean” data instead of using one Gaussian RBM. There are good reasons for using DBM here. Firstly, it can efficiently capture variations and structures in the input data. Secondly, DBM can deal with ambiguous inputs more robustly due to its top-down feedback.

#### Iii-B1 Texture Modeling

Given a shape-free image , the energy function of the configuration in facial texture modeling is optimized as follows:

(3) |

where are the texture model parameters. It is noted that all the bias terms in Eqn. (3) are ignored for simplicity. The probability distribution of the configuration is computed as follow:

(4) |

Given the input variables , the states of all layers can be inferred by computing the posterior probability of the latent variables, i.e. . Therefore, the sampling can be divided into two folds, i.e. one for the visible units and one for the hidden units. For the visible variables and m, the conditional distributions can be sampled as,

(5) |

For the hidden variables , the conditional distributions can be sampled as follows,

(6) |

The sampling process can be applied on each unit separately since the distribution is factorial. Section III-B2 will discuss the learning procedure of this texture model.

#### Iii-B2 Model Learning for RDBM

To pre-train our presented RDBM model, the DBM, which models “clean” faces, is first trained with some “clean” images and then the parameters in the RDBM model are optimized to maximize the log likelihood as follows,

(7) |

The optimal parameter values can then be obtained using a gradient descent procedure given by,

(8) |

where and are the expectations respecting to data distribution and distribution estimated by the RDBM. The two terms can be approximated using mean-field inference and Markov Chain Monte Carlo (MCMC) based stochastic approximation, respectively.

In our method, pre-training the parameters of the DBM on “clean” data first will make the process of learning the texture model faster and much easier. Similarly, we also propose to first learn the parameters of the binary RBM (to represent the mask m) on pre-defined and extracted training masks (as shown in Fig. 3) instead of randomizing the parameters. Then, we propose an automatic technique to extract such training masks for learning the binary RBM in the next section.

#### Iii-B3 Learning Binary Mask RBM

This section aims to generate masks from the training images having poses and occlusions, e.g. sunglasses and scarves. We consider learning three types of binary mask, i.e. sunglasses, scarves and pose stretching. A binary RBM is learned to represent each type of mask. Binary masks for sunglasses or scarves can be extracted by applying a global threshold on shape-free images having sunglasses or scarves with a prior knowledge of their locations. We will focus on the last and hardest type, i.e. pose stretching.

In 2D texture model, warping faces with a large pose (e.g. larger than ) will likely cause stretching effects on half of the faces since the same pixel values are copied over a large region (see Fig. 4). Therefore, we propose a technique that can detect such stretching regions during warping process. The main idea is to count the number of unique pixels in the source triangle that are mapped to the pixels in the target triangle. As we know, a source pixel can be mapped to multiple target pixels due to interpolation. The degree of a target triangle being stretched is equivalent to , where means there is no stretching and the stretching is visible when (as from our experiments), and are the number of unique pixels and the total number of pixels in the corresponding source triangle, respectively. Finally, we can use the detected regions as a mask to pre-train the above robust texture model.

### Iii-C Model Fitting Algorithms in RDAMs

With the trained shape and texture models, the process of finding an optimal shape of a new image can be formulated as finding an optimal shape that maximizes the probability of the shape-free image as .

During the fitting steps, the states of hidden units are estimated by clamping both the current shape and the warped texture to the model. The Gibbs sampling method is then applied to find the optimal estimated “clean” texture of the testing face given the current shape . Let be the mean of the Gaussian distribution, we have where I is the identity matrix. The maximum likelihood can then be estimated as .

This brings us to the non-linear least squares problem solved in image alignment. Notice that a is the reconstructed “clean” texture while is the warped texture from the input image. If the input image contains occlusion or corruption, it is clearly that the above square error will not reflect the goodness of the current shape . Thus, solely using -norm may limit the performance of shape fitting and reconstruction of the models. Since our proposed model can generate a mask of corrupted pixels, we propose to incorporate the mask into the original objective function as:

(9) |

where is the component-wise multiplication.

The inverse compositional algorithm tries to minimize the incremental warp computed with respect to the model image instead of with respect to .

(10) |

with respect to and then updating the parameters as , where denotes the composition of two warps. The solution of the least squares problem above is where is the Jacobian matrix of the model image . The Hessian matrices H are then given by .

## Iv Experiments

In this section, we evaluate the performance of our proposed framework in face modeling tasks using data “in the wild” (sections IV-B and IV-C ). Then we demonstrate its robustness in model fitting steps (section IV-D).

### Iv-a Databases

The LFPW [20] database consists of 1400 images but only about 1000 images are available (811 for training and 224 for testing). For each image, we have 68 landmark points provided by 300-W competition [21].

The Helen [22] database contains about 2300 high-resolution images (2000 for training and 300 for testing). 68 landmark points are annotated for all faces. The facial images contain different poses, expressions and occlusions.

The AR database [23] contains 134 people (75 males and 59 females) and each subject has 26 frontal images (14 normal images with different lighting and expressions, six occluded images with sunglasses and six for scarves).

The EURECOM database [24] consists of facial images of 52 people (38 males and 14 females). Each person has different expressions, lighting and occlusion conditions. We only use images wearing sunglasses in our experiments.

### Iv-B Facial Occlusion Removal

In this section, we demonstrate the ability of RDAMs to handle extreme cases of occlusions such as sunglasses or scarves. RDAMs are trained in two steps: pre-train each layer and train the whole model. The training set includes 1000 “clean” and 200 posed images from LFPW and Helen, 534 “clean”, 95 sunglasses, and 95 scarf images from 95 subjects in AR, 104 images from 52 subjects in EURECOM. For the pre-training steps, we first train shape DBM using all shapes. Then, we train RDBM by first separately training GRBM with clean images and learning binary mask RBM with masks generated from occluded and posed images in AR, EURECOM or LFPW. After that, we can train the RDBM with pre-initialized weights of GRBM and mask RBM. The joint layer is later trained with all training images. Each step above is trained using Contrastive Divergence learning in 600 epochs on a system of Xeon@3.6GHz CPU, 32.00GB RAM. The computational costs (without parallel processing) are as follows. The training time is 14.2 hours. Fitting on average is 17.4s. Reconstructing faces on average is 1.53s.

As shown in Fig. 5, RDAMs can remove those occlusions successfully without leaving any severe artifact comparing with the baseline AAMs method and the state-of-the-art DAMs method. We also compare with RPCA-based method [25] (See Fig. 6). We measure the reconstruction quality in terms of Root Mean Square Error (RMSE) on LFPW, Helen, AR and EURECOM databases in different ways.

Methods | AAMs [11] | DAMs [4] | RDAMs |
---|---|---|---|

LFPW & Helen | 12.91 (18.98) | 11.15 (14.98) | 8.58 (23.98) |

AR - Sunglasses | 56.55 | 55.48 | 41.67 |

AR - Scarf | 63.16 | 60.96 | 47.65 |

We choose from AR two subsets of 210 images with sunglasses and 210 images with scarves from 38 subjects (30 males and eight females) not in the training set. The corresponding normal face images, i.e. frontal and without occlusions, of the same person are used as the references to compute the RMSE. We select a subset of 23 images with sunglasses and 100 images with some occlusions from LFPW and Helen. A mask is used to ignore occluded/corrupted pixels in the testing images so that we have an unbiased metrics. The average masked-RMSEs of AAMs, DAMs and our RDAMs are shown in Table I. The average unmasked-RMSEs are also reported for reference (i.e. the numbers inside the brackets). Our RDAMs achieve the best reconstruction results compared against AAMs and DAMs. Note that the unmasked-RMSE is always higher than masked-RMSE since some corrupted pixels are recovered during reconstruction. Since our RDAMs can recover more corrupted pixels, it makes the un-masked RMSE higher than the ones from AAMs and DAMs.

### Iv-C Facial Pose Recovery

This section illustrates the capability of RDAMs to deal with facial poses. Using the same pre-trained model presented in Section IV-B, the texture model was trained using 280 images with different pose variations from LFPW and Helen databases. The reconstruction results of facial images with different poses are presented in Fig. 7. In this experiment, our RDAMs also achieve the best reconstruction results comparing to AAMs and DAMs especially in the cases of extreme poses (more than ). Our proposed RDAMs method can handle those extreme poses in a more natural way. From Fig. 7, RDAMs give reconstructed faces that look more similar to the original faces while DAMs or AAMs make the face look younger or change its identity.

Another experiment is performed to demonstrate our RDAMs approach on the face frontalization problem. Given an input face with poses, the process of âfrontalizationâ is to synthesize the frontal view of that face. Our RDAMs approach is compared with the state-of-the-art frontalization method [26] on LFPW and Helen databases as shown in Fig. 8. RDAMs only model certain facial areas not including hair, forehead, neck and ears. For the ease of comparison, the reconstructed texture of RDAMs (the last row) is put on top of the corresponding images in the middle row. Although we lost some color consistency with the background, RDAMs can produce more natural looking faces than images in [26].

### Iv-D Model Fitting in RDAMs

We compared our results with Active Orientation Models [27] and Fast-SIC [11] in the following modeling fitting experiment. We evaluated model fitting using the LFPW and the AR databases with about 300 images (23 images from LFPW database, and 268 images from AR database. The average errors are reported in Table II. The initial shape is the mean shape placed inside the faceâs bounding box. RDAMs achieve comparable performance compared to other methods.

## V Conclusion

In this paper, the novel Robust Deep Appearance Models have been proposed to deal with large variations in the wild such as occlusions and poses. Comparing with the previous DAMs model, the proposed approach can produce remarkable reconstruction results even when faces are occluded or having extreme poses. Moreover, the proposed fitting algorithms fit well with the new texture model such that it can make use of the occlusion mask generated by the proposed model. Experimental results in occlusion removal, pose correction and model fitting have shown the robustness of the model against large occlusions and poses.

## Acknowledgment

This work is supported in part by the Natural Sciences and Engineering Research Council (NSERC) of Canada

## References

- [1] T. F. Cootes, G. J. Edwards, and C. J. Taylor, “Interprettting Face Images using Active Appearance Models,” in FG, 1998, pp. 300–305.
- [2] J. Alabort-i Medina and S. Zafeiriou, “Bayesian active appearance models,” in CVPR. IEEE, 2014, pp. 3438–3445.
- [3] R. Salakhutdinov and G. E. Hinton, “Deep boltzmann machines,” in AISTATS, 2009, pp. 448–455.
- [4] C. N. Duong, K. Luu, K. G. Quach, and T. D. Bui, “Beyond principal components: Deep boltzmann machines for face modeling,” in CVPR. IEEE, June 2015, pp. 4786–4794.
- [5] Y. Tang, R. Salakhutdinov, and G. Hinton, “Robust boltzmann machines for recognition and denoising,” in CVPR. IEEE, 2012, pp. 2264–2271.
- [6] G. E. Hinton, “Training products of experts by minimizing contrastive divergence,” Neural computation, vol. 14, no. 8, pp. 1771–1800, 2002.
- [7] A. Krizhevsky and G. Hinton, “Learning multiple layers of features from tiny images,” 2009.
- [8] I. Matthews and S. Baker, “Active appearance models revisited,” IJCV, vol. 60, no. 2, pp. 135–164, 2004.
- [9] S. Baker and I. Matthews, “Lucas-kanade 20 years on: A unifying framework: Part 2,” The Robotics Institute, CMU, 2003.
- [10] R. Gross, I. Matthews, and S. Baker, “Generic vs. person specific active appearance models,” IVC, vol. 23, no. 12, pp. 1080–1093, 2005.
- [11] G. Tzimiropoulos and M. Pantic, “Optimization problems for fast AAM fitting in-the-wild,” in ICCV. IEEE, 2013, pp. 593–600.
- [12] X. Liu, “Generic face alignment using boosted appearance model,” in CVPR. IEEE, 2007, pp. 1–8.
- [13] ——, “Discriminative face alignment,” PAMI, vol. 31, no. 11, pp. 1941–1954, 2009.
- [14] J. Sung, T. Kanade, and D. Kim, “A unified gradient-based approach for combining ASM into AAM,” IJCV, vol. 75, no. 2, pp. 297–309, 2007.
- [15] G. Tzimiropoulos, S. Zafeiriou, and M. Pantic, “Robust and efficient parametric face alignment,” in ICCV. IEEE, 2011, pp. 1847–1854.
- [16] P. Martins, R. Caseiro, and J. Batista, “Generative face alignment through 2.5D active appearance models,” CVIU, vol. 117, no. 3, pp. 250–268, 2013.
- [17] B. D. Lucas, T. Kanade et al., “An iterative image registration technique with an application to stereo vision.” IJCAI, vol. 81, pp. 674–679, 1981.
- [18] E. Antonakos, J. Alabort-i Medina, and S. Zafeiriou, “Active pictorial structures,” in CVPR. IEEE, 2015, pp. 5435–5444.
- [19] E. Antonakos, J. Alabort-i Medina, G. Tzimiropoulos, and S. P. Zafeiriou, “Feature-based lucas–kanade and active appearance models,” TIP, vol. 24, no. 9, pp. 2617–2632, 2015.
- [20] P. N. Belhumeur, D. W. Jacobs, D. Kriegman, and N. Kumar, “Localizing parts of faces using a consensus of exemplars,” in CVPR. IEEE, 2011, pp. 545–552.
- [21] C. Sagonas, G. Tzimiropoulos, S. Zafeiriou, and M. Pantic, “A semi-automatic methodology for facial landmark annotation,” in CVPRW. IEEE, 2013, pp. 896–903.
- [22] V. Le, J. Brandt, Z. Lin, L. Bourdev, and T. S. Huang, “Interactive facial feature localization,” in ECCV. Springer, 2012, pp. 679–692.
- [23] A. M. Martinez, “The ar face database,” CVC Tech. Rep., vol. 24, 1998.
- [24] R. Min, N. Kose, and J.-L. Dugelay, “Kinectfacedb: A kinect database for face recognition,” TSMC, vol. 44, no. 11, pp. 1534–1548, Nov 2014.
- [25] K. G. Quach, C. N. Duong, and T. D. Bui, “Sparse representation and low-rank approximation for robust face recognition,” in ICPR. IEEE, 2014, pp. 1330–1335.
- [26] T. Hassner, S. Harel, E. Paz, and R. Enbar, “Effective face frontalization in unconstrained images,” in CVPR. IEEE, June 2015.
- [27] G. Tzimiropoulos, J. Alabort-i Medina, S. Zafeiriou, and M. Pantic, “Active orientation models for face alignment in-the-wild,” TIFS, vol. 9, no. 12, pp. 2024–2034, 2014.