BioFaceNet: Deep Biophysical Face
In this paper we present BioFaceNet, a deep CNN that learns to decompose a single face image into biophysical parameters maps, diffuse and specular shading maps as well as estimating the spectral power distribution of the scene illuminant and the spectral sensitivity of the camera. The network comprises a fully convolutional encoder for estimating the spatial maps with a fully connected branch for estimating the vector quantities. The network is trained using a self-supervised appearance loss computed via a model-based decoder. The task is highly underconstrained so we impose a number of model-based priors. Skin spectral reflectance is restricted to a biophysical model, we impose a statistical prior on camera spectral sensitivities, a physical constraint on illumination spectra, a sparsity prior on specular reflections and direct supervision on diffuse shading using a rough shape proxy. We show convincing qualitative results on in-the-wild data and introduce a benchmark for quantitative evaluation on this new task.
Sarah Alotaibi 1,2https://www.cs.york.ac.uk/cvpr/member/sarah1
\addauthorWilliam A. P. Smith 1https://www-users.cs.york.ac.uk/wsmith1
1Department of Computer Science
University of York
2Computer Science Department
King Saud University
Riyadh, KSA BioFaceNet: Deep Biophysical Face Interpretation
Providing a physical explanation of the appearance of a face is a longstanding goal in computer vision. From 3D face capture in computer graphics to extracting identity specific information for face recognition, there are clear benefits to being able to separate intrinsic properties of the face from extrinsic scene conditions when the image was captured. It is therefore surprising that the vast majority of methods that study face appearance use generic models that are applicable to any object and do not take into account constraints provided by the specific appearance of a face. For example, even in state-of-the-art deep learning based methods it is often assumed that faces are Lambertian diffuse reflectors [kim17InverseFaceNet, Shu_2017_CVPR, SfSNet] and they ignore the specular component (resulting from oily skin or sweat) and subsurface effects. Where diffuse albedo (i.e. intrinsic colour of the skin) is explicitly modelled this is usually done with a statistical model [Tewari_2017_ICCV, tewari18FaceModel, nhan2015beyond, Li_2014, blanz1999morphable] or, in the case of intrinsic image decomposition approaches, with an unconstrained albedo map [Shu_2017_CVPR, SfSNet]. However, it is known that skin colour forms a curved manifold in RGB space [Claridge2003489, preece2003imaging, preece2004spectral] spanned by the main pigments in skin. Models that do not impose this biophysical constraint can generate implausible skin colours and linear models will require redundant dimensions to capture the nonlinear subspace. Besides providing a strong constraint on plausible skin colours, modelling in the biophysical domain also has advantages from an application point of view as it allows intuitive editing of parameter maps with physical meaning. These advantages motivate our decision to model face appearance using a biophysical model.
In this paper, we propose BioFaceNet: a deep convolutional neural network that learns to decompose a single RGB image into intrinsic components in the spectral domain. This is an ill-posed problem and so careful modelling and constraint is required to render the problem tractable. This knowledge is encapsulated in a model-based decoder that is used to train a CNN-based encoder. We combine a dichromatic reflectance model and biophysical spectral skin colouration model in order to decompose face appearance into specular and diffuse shading and distribution maps for two biophysical parameters (melanin and haemoglobin). In addition we estimate spectral illumination and camera sensitivity, constrained by physical and statistical models respectively. See 1 for some sample results.
1.1 Deep face appearance decomposition
In recent years, deep neural networks have been applied to estimate face parameters such as geometry, appearance properties and the results obtained are inspiring. Recent studies on face reconstruction [Tewari_2017_ICCV, Shu_2017_CVPR, SfSNet] rely on statistical face models to constrain geometry and appearance estimation. Tewari et al\bmvaOneDot[Tewari_2017_ICCV] introduce MoFA: a self-supervised learning approach to train a model-based autoencoder CNN architecture. The CNN is able to fit a 3D morphable model [blanz1999morphable] to single images by estimating shape and reflectance and regress scene illumination. The encoder learns to extract face parameters and the decoder uses a differentiable image formation to construct an image that allows unsupervised training on real images. The self-supervised loss is the error between the constructed image and the input. Kim et al\bmvaOneDot[kim17InverseFaceNet] introduced InverseFaceNet that estimate 3DMM parameters including colour reflectance and illumination. The CNN was trained on synthetic dataset and used a breeding method to increase variability in training dataset. Again the skin reflectance estimated based on statistical appearance model [blanz1999morphable]. Shu et al\bmvaOneDot[Shu_2017_CVPR] proposed unsupervised autoencoder networks to learn facial appearance’s components: albedo, normal, and lighting. They combined constraints on each components with an adversarial loss on on image reconstruction. Their decomposition is still not realistic. On the other hand, Sengupta et al\bmvaOneDot[SfSNet] start with supervised training on synthetic data and later finetune this network on real data to achieve: albedo, normal, and lighting estimates and these components are used later on pseudo-supervision stage. The image formation relies on Lambertian reflectance. A photometric reconstruction loss is applied to validate the composition.
1.2 Biophysical skin modelling
Modelling the appearance of human skin is fundamentally challenging, due to the complexity of its layered-structure and its optical properties. Tsumura et al\bmvaOneDot[Tsumura:2003:ISC] presented an image-based method to recover the concentrations of melanin and haemoglobin from colour face image using Independent Component Analysis (ICA). Their method is restricted to specific light and camera combinations. Donner et al\bmvaOneDot[donner2008layered] use multispectral and polarised light to derive biophysical skin parameter maps from a 2D planar sample. Gitlina et al\bmvaOneDot[gitlina2018practical] have combined polarised spherical gradient illumination with multispectral lighting to acquire spectral skin reflectance. Other studies focused on building models that accurately simulate face appearance by applying skin optics with biophysical components to reproduce spectral and spatial responses. Krishnaswamy and Baranoski [krishnaswamy2004biophysically] introduced the BioSpec model, with about twenty-four physically-meaningful parameters to simulate the light interaction within five layers of human skin. Biospec is computationally expensive, and very difficult to invert. Claridge and co-authors [Claridge2003489, preece2003imaging, preece2004spectral] combined a calibrated camera with a two or three parameter model based on Kubelka-Munk theory to measure skin parameters. Jimenez et al\bmvaOneDot[JIMENEZ2010] presented skin model to simulate dynamic effects caused by facial expressions.
Our model-based decoder simulates the spectral formation of an RGB image. This requires a number of basic components that we describe in this section. A number of assumptions underlie the choice of these components. We assume: 1. Images are captured by a camera that correctly white balances the scene and uses a fixed gamma, 2. Scene illumination is spectrally uniform, 3. Skin reflectance follows the dichromatic reflectance model. Assumption 2 is clearly violated in real images. For example, shadowed regions will be illuminated by different spectra to directly lit parts of the face. However, allowing spatially varying illumination spectra adds significant complexity and ambiguity to the problem and we leave this to future work.
2.1 Spectral image formation
A tristimulus RGB image arises from an integration over wavelength, , of the product of scene radiance (itself the product of illumination and reflectance spectra) and camera spectral sensitivity:
where is the spectral power distribution (SPD) of the illuminant, the spectral reflectance of the surface and the spectral sensitivity of the camera in colour channel .
2.2 Wavelength-discrete spectral image formation
We approximate this continuous model by discretising wavelength at locations:
where , and are the wavelength-discrete versions of the camera sensitivities, illuminant SPD and spectral reflectance respectively. We use to refer to the column of corresponding to colour channel .
2.3 Colour transformation pipeline
The raw colours measured by a sensor, , are transformed by the camera in order to produce perceptually pleasing images. The purpose is to normalise for lighting and sensor specific effects and apply a nonlinear mapping to compress intensities to a dynamic range that can be stored and displayed. The precise details of this pipeline are camera-specific however we assume the following generic model that is a good approximation for most cameras:
The first transformation, , performs white balancing for a given illuminant and camera. Specifically, it divides each channel by the colour of the light source as recorded by the sensor:
The second transformation, , converts from the camera-specific colour space to the standardised XYZ space:
where contains the wavelength discrete CIE-1931 2-degree color matching function and is the pseudoinverse of [jiang2013space]. This is a least squares solution to transform the camera’s spectral sensitivities to the CIE standard. We additionally rescale each row such that its sum is unity to preserve white balance such that . The final transformation is a fixed matrix to convert to sRGB space:
after which a final nonlinear gamma transformation is applied:
where we assume and .
2.4 Multispectral dichromatic model
The dichromatic model [shafer1985using] assumes that scene radiance, , is a sum of body (diffuse) and surface (specular) reflected components. Further, it divides each source of radiance into a part that depends on geometry (informally “shading”) and a wavelength dependent part (informally “colour”). The body reflection arises from subsurface scattering and modifies the SPD of the light through absorption whereas the surface reflectance happens at the interface and does not, meaning the model can be written as:
where and are the diffuse and specular shading respectively. In wavelength-discrete terms, this becomes:
2.5 Statistical camera model
The space of camera spectral sensitivities has been shown to be low dimensional. Using PCA to build a statistical model, Jiang et al\bmvaOneDot[jiang2013space] showed that two dimensions were sufficient to capture 97% of the variance of a data set of 28 empirically measured sensitivities. Accordingly, any spectral sensitivity can be approximated as:
where contains the first principal components, are the corresponding eigenvalues, is the mean sensitivity and is the parametric representation of . We use dimensions. Under the assumption that the original data is Gaussian distributed then the parameters are normally distributed: .
2.6 Physical lighting model
Our spectral illumination model is physically-based. We assume that the scene illumination can be approximated by a linear combination of CIE standard illuminants A, D and F respectively representing incandescent light, phases of daylight and fluorescent lights of various composition. Illuminant D requires an additional parameter representing the colour temperature ranging from 4,000 to 25,000K. Illuminant F is itself a linear combination of 12 measured fluorescent sources. Hence, our illumination model is given by:
where are the weights for each illuminant type, is the correlated color temperature and are the spectra of the standard illuminants.
3 Biophysical spectral skin model
We now constrain the multispectral dichromatic model in (9) using a biophysical human skin model. This has only two free biophysical parameters, such that the resulting biophysical dichromatic model has four unknowns per pixel in total. Our biophysical spectral reflectance model is closely related to a number of existing models [Claridge2003489, JIMENEZ2010, krishnaswamy2004biophysically, preece2004spectral, donner2008layered]. However, for the challenging task we seek to solve, we focus on simplicity and the minimum number of free parameters. Specifically: the melanin and haemoglobin concentration that vary spatially while all other parameters are based on validated approximation functions or measured data for healthy skin [jacques1998skin, Alotaibi_2017_ICCV, ANDERSON198113, THODY1991340, JIMENEZ2010, prahl1999optical, Flewelling1999, krishnaswamy2004biophysically]. We used a simplified two layered skin structure model of [alotaibi2019decomposing]. The epidermis is the outer layer containing the melanin pigment, originated from melanosomes cells, that absorbs the blue wavelengths and the rest of the light is mainly forward scattered. The deeper layer is the dermis containing blood vessels that carry the haemoglobin pigment and absorbs light in the blue and green wavelengths while the rest of the light is reflected back and reaches the epidermis where again absorption and forward scattering occur before light exits skin. This simplified model is written as:
where is the epidermal melanosomes volume fraction and falls in the range , is the dermal blood volume fraction and falls in the range , is the proportion light transmitted through the epidermis (twice) and is modelled using the Lambert-Beer law, is the proportion of light reflected from the dermis and is modelled by Kubelka-Munk theory. In wavelength-discrete terms, we write as the vector of diffuse spectral reflectance which can be substituted into (9).
Our overall architecture is shown in Fig. 2. At the most abstract level, this consists of a trainable convolutional encoder that estimates semantically meaningful parameters and a fixed, differentiable, model-based decoder that implements spectral image formation to transform these parameters back into an image. The semantic representation consists of four image quantities (the two biophysical parameter maps and diffuse and specular shading maps) and two vector quantities (parameters for the physical lighting and statistical camera models).
4.1 Trainable encoder
The encoder is a CNN and itself has an encoder/decoder architecture, all of which is trainable. We invert the nonlinear gamma on the input image such that the input is in linear space and appearance losses are calculated without applying gamma, i.e. also in linear space. We found that this gave more stable convergence than using nonlinear input images and applying gamma to our rendered output. The maps are predicted by a fully convolutional network with skip connections, following a U-net [ronneberger9351u] style architecture but with separate decoders for each map. The encoder/decoder consists of three convolutions per resolution with filter sets: , , , and . Each convolution is followed by batch normalisation, ReLU nonlinearity and finally max-pooling. From the lowest spatial resolution, a fully connected branch predicts the vector quantities . Since the encoder used to predict all 6 quantities is shared, this helps the encoder learn to disentangle the interaction of the different quantities.
Since the estimated quantities have physical meaning, they are bounded or subject to positivity constraints. The diffuse and specular maps must be positive so the raw estimates are exponentiated. The haemoglobin and melanin maps are bounded by the physically-plausible ranges in Sec. 3 which we map to the range . Hence, the raw estimates are passed through a sigmoid function, scaled by 2 and shifted by . Similarly, the camera parameters, , are transformed to the range (assuming standard deviations from the mean captures sufficient variation) and the correlated colour temperature, , is transformed to the range . There is an intrinsic scale ambiguity between the overall intensity of the light source and the diffuse/specular shading (i.e. the same image can be obtained by multiplying the illumination by 2 and dividing the shading maps by 2). We resolve this by rescaling all standard illuminants to have unit sum, , and then taking only convex combinations in (11), i.e. we enforce that . This is achieved by passing the weights predicted by the encoder through a softmax layer which also ensures their positivity. This guarantees and so fixes the scale of illumination.
4.2 Model-based decoder
The model-based decoder implements the components described in Sections 2 and 3 as shown in Fig. 3. All components are implemented in a differentiable manner, such that the gradients of the subsequent loss functions can be backpropagated through the decoder and into the trainable encoder. For efficiency, we precompute skin spectral reflectance at discrete values of the biophysical parameters within their plausible ranges and store in a 2D look up table. We then use differentiable bilinear interpolation to compute reflectance for continuous parameter values. In the colour transformation pipeline, computing requires taking a pseudoinverse of the camera spectral sensitivity. While this can be done in-network, for efficiency and stability we precompute as a lookup table as a function of and again use bilinear interpolation.
We train our network to minimise four losses:
The first is a self-supervised appearance loss measuring the difference between the input and reconstructed images (see Fig. 2): . Using this loss alone allows the network to converge to trivial solutions with physically meaningless decomposition of appearance. To constrain the problem we introduce three additional priors. We enforce a statistical prior loss on the camera sensitivity parameters: . Assuming lighting is sparse and the face surface smooth, we can assume that specular reflections are sparse and so impose an L1 sparsity prior on the specular shading . Finally, we provide some weak direct supervision of the diffuse shading. Following [Shu_2017_CVPR], we use an approximate normal map and spherical harmonic parameters obtained by a rough fit of a 3D morphable model and use this to compute pseudo ground truth diffuse shading, . There is an unknown scale ambguity between this shading and the one estimated by our network. So we compute the optimal scale, , using simple linear regression without the intercept term and apply this to our estimate before computing an L2 shading loss: . The three pixel-wise losses are summed over pixels and normalised by the number of foreground masked pixels.
We implement our network using the autonn wrapper for MatConvNet. We train on a 50k subset of the CelebA dataset [liu2015deep] as in [Shu_2017_CVPR] using SGD, a learning rate of and set the loss weights to .
In Fig. 1, we present qualitative results on unseen test images from CelebA. It is clear that the lips and flushed cheeks appear in the haemoglobin maps with high concentrations and the overall melanin maps reflects skin colour accurately. The specular maps detect the specular reflections and the diffuse shading is blurred as a result of subsurface scattering. We compute the diffuse albedo directly from the biophysical spectral reflectance. The fourth row shows a failure case where shadowing is interpreted as high haemoglobin. In Fig. 4, we show results of an editing application. We edit an estimated map, then we recompute the final image as in (7). In the first row, we remove specular reflections by setting the specular to constant map. The apparent changing colours between these images after reove the specular is consistent with [alotaibi2019decomposing] where the multispectra data is used. In the second row, we increase the melanin pigment by 0.6 and this shows a darker skin of the face appearance such as the face has been sun-tanned. In the last row, we scale the haemoglobin by 0.5 and this gives a flushed appearance such as if the face is overheated.
BioFaceNet is the first work attempt to decompose real images into biophysical maps and diffuse and specular shading. Moreover, there is no ground truth available for this task since no existing device or method can estimate these quantities from real images. For this reason, we propose a new benchmark based on pseudo ground truth computed from multispectral images but give our network access only to RGB images rendered from the multispectral data. We use the decomposition method proposed by Alotaibi and Smith [alotaibi2019decomposing] and apply it to 25 multispectral face images from the ISET database [ImageVal]. This provides pseudo ground truth for the four maps. We then render the multispectral images to RGB using D65 illumination and the mean camera sensitivity and provide this image as input to our CNN. We measure the RMSE error of each map against the pseudo ground truth. We compare against the diffuse shading and albedo obtained by a state-of-the-art method [SfSNet]. In Fig. 5 we show qualitative results and in Table 1 we show quantitative results. Our approach provides better performance than [SfSNet], though note that our sparse specularity prior is too severe and the albedo appears saturated compared to ground truth.
We have tackled a highly ambitious task: attempting to decompose a single, uncontrolled image into a biophysical and spectral explanation of the appearance. The main conclusion of our work is that the constraint afforded by restricting reflectance to the space of biophysically plausible skin colours enables a decomposition to be obtained that is qualitatively convincing and quantitatively better than a state-of-the-art inverse rendering method. An obvious extension is to combine this work with methods that estimate 3D face geometry. We currently do not constrain the two shading maps such that they are consistent with an underlying geometry and illumination environment. This additional constraint may improve performance and help disambiguate the task. We would also like to explore whether the intrinsic parameter maps can be used for recognition and whether a recognition loss could be used to help disambiguate the decomposition. Our biophysical colouration model could be made at least partially learnable and adversarial losses could help improve the realism of renderings of the model output (for example by applying transformations to the parameter maps or camera/illumination parameters) while still requiring that the output image is realistic.