How deep should be the depth of convolutional neural networks: a backyard dog case study
Abstract
We present a straightforward noniterative method for shallowing of deep Convolutional Neural Network (CNN) by combination of several layers of CNNs with Advanced Supervised Principal Component Analysis (ASPCA) of their outputs. We tested this new method on a practically important case of ‘friendorfoe’ face recognition. This is the backyard dog problem: the dog should (i) distinguish the members of the family from possible strangers and (ii) identify the members of the family. Our experiments revealed that the method is capable of drastically reducing the depth of deep learning CNNs, albeit at the cost of mild performance deterioration.
1 Introduction
IT giants have produced many software “semiproducts” for image recognition. This new opportunity gave rise to many works in face recognition. These works and popular critics of their results prove that the performance of these systems are problemdepending and the devil is in the detail of testing and validation: the systems, which are almost perfect for one problem can be useless for another one. In this paper we focus on a problem which, on the one hand, appears to be a close relative of the face recognition applications and yet, on the other hand, is somewhat more relaxed.
 The ‘backyard dog’ problem.

This dog should (i) separate friends form strangers and, at the same time (ii) should distinguish the members and friends of the family from each other (identity verification).
The ‘backyard dog’ should separate a relatively small set of friends from the huge set of potential strangers and identify friends identities (for example, in security systems with differentiated access). This dog is needed, e.g. for smart house projects.
The main problem is separation of exactly specified small number of clearly defined persons from many other persons, potential strangers. As we can see from Table 1 the greatest available face database contains about 8 millions of people. This is much more that potential strangers for the individual house but much less than 7.6 billion of total world population Population ().
The backyard dog problem can be considered asa relaxation of the overwhelmingly challenging problem of recognition of all individuals, both strangers and friends. This means that the predicate ‘if the two given images correspond to the same person’ should be implemented reliably for a large database, and the system should distinguish (almost) any two different persons and recognise, whether two given images belong to same persons with high reliability. This approach demands excessively exhaustive amounts of resources to assess an image of a given person against all records in a global database. Such global assessment, however, is not necessary in relevant practical applications. For example, it may not be important whether two images of a stranger correspond to a single individual or they are related to two different strangers. All we may want to know is if these images are of a stranger or a friend?
Two more important properties of the backyard dog problem specification should be mentioned: the problem must be solved quickly on a relatively weak CPU using relatively small volume of RAM. In this work, we have chosen Raspberry Pi 3 as a possible hardware platform. Moreover, we constrained ourselves with a single core of the processor, assuming that the remaining three cores are used for face framing and other smart house purposes. The total amount of Pi RAM is 1 GiB. It means that for the backyard dog problem we can use not more than 300 MiB.
Now we can specify the backyard dog problem. There is a list of Family Members and friends (FM) with several images of each member. There is also some database of other persons’ images (potential strangers). The backyard dog must discriminate FMs from other persons and identify the specific FM if an image is recognised as a FM.
We impose a technical restriction: the execution of the CNN on the platform must use not more than 1 second of one stream Pi CPU and not more than 300MiB memory.
2 Legacy systems for face recognition
Reports about performance of neural networks in face recognition problem are very optimistic. Accuracy of 98% and above is often claimed Parkin2015 (); Schroff2015 (); Taigman2014 (). On another hand, the welldescribed testing of human performance gives much less accuracy and even the specially trained staff makes 20% mistakes on the faces they have never seen before WhiteErrorRate2015 (). We can guess that the reported performance of neural networks may depend on the details of the testing protocol, which may be not available in full and differ from the real life operational conditions.
Strictly speaking, the socalled face recognition systems solve not a conventional multiclass classification problem. They are trained to answer the question: whether these two images belong to the same person. The common idea is: map the images into a ‘feature space’ with metric (or similarity measure) in such a way that for the images belong to the same person, and if then the correspondent persons are different. For such a general system, the test should demonstrate that the system works well for the persons and images never seen before (both the persons and images should be new by the obvious reasons). For specific problems the test should be also specific.
For our backyard dog problem, time for recognition is important. For time estimation we used two facilities: Pi and HP notebook with four core Intel Core i7 CPU and 8GiB of memory (further on we call it ‘Laptop’).
Most of works use for experiments the legacy systems, directly, or with some additions. Let us present some of them
2.1 Vgg
VGG published their version of Convolutions Neural Network (CNN) for face recognition Parkin2015 (). We call this network VGGCNN VGG2014 (). It was trained on the database of 2622 persons. Small modification of this network allows us to compare two images and decide whether they are likenesses of the same person or of different persons.
VGGCNN contains about 144M of weights. The recommended test procedure contains the following steps Parkin2015 ():

Scale detected face to three sizes: 256, 384, and 512.

Crop 224x224 fragment from each corner and from the centre of the scaled image.

Apply horizontal flip to crop.
Therefore, for testing of one face (one input image) one has to process 30 preprocessed images: .
Processing of one image in MatLab implementation MatConvNet () on Laptop requires approximately 0.7s and in TensorFlow implementation VGGTF () the same requires 7.3s.
2.2 FaceNet
There are several CNNs suggested by the Google team Schroff2015 ():

NN1 with images 220x220, 140M of weights and 1.6B FLOP,

NN2 with images 224x224, 7.5M of weights and 1.5B FLOP,

NN3 with images 160x160, 7.5M of weights and 0.744B FLOP,

NN4 with images 96x96, 7.5M of weights and 0.285B FLOP.
Here, FLOP stays for Floating Point Operations per image processing.
The testing procedure for FaceNet uses one processing of each image.
2.3 DeepFace
FaceBook Taigman2014 () models are DeepFace and DeepID. CNNs were created for these models with close outputs for congruous pairs of faces and distant outputs for incongruous pairs. Both models use ensembles of CNNs. Input is a pair of images. The whole construction can be presented by two layer network compositions: two replicas of the CNN network in the first layer (one replica for each input image) and a specially trained network for implementation the predicate ‘The same person/Different persons’ at the second layer of the composition.
The Facebook team presents DeepID system which includes up to 200 CNNs Taigman2014 ().
2.4 Databases used
List of datasets used to train the described networks is presented in Table 1. We can see that VGG dataset is the largest collection of face images besides industrial datasets by Google, Facebook, or Baidu, which are not publicly available.
Dataset  Identities  Images  Link 

LFW  5,749  13,233  http://viswww.cs.umass.edu/lfw/#download 
WDRef Chen2012 ()  2,995  99,773  N/A 
CelebFaces Sun2014 ()  10,177  202,599  N/A 
VGG Parkin2015 ()  2,622  2.6M  http://www.robots.ox.ac.uk/~vgg/data/vgg_face/ 
FaceBook Taigman2014 ()  4,030  4.4M  N/A 
Google Schroff2015 ()  8M  200M  N/A 
2.5 VGGCNN, FaceNet and DeepFace comparison
We compare solutions of three selected teams by the volume of weights (MiB) and required computational resources (in FLOP). We do not consider the parallel implementations because of technical restriction of usage the one stream only. Results of comparison are presented in Table 2.
Distributions of weights, features and time consumption along the depth of each networks are presented in Figs. 1 – 4.
Developer  Family name  Name  Weights  Features  FLOP  Image 
(M)  (M)  (M)  size  
VGG group  VGGCNN Parkin2015 ()  VGG16  144.0  6.4  15,475  224 
FaceNet Schroff2015 ()  NN1  140.0  1.2  1,606  220  
NN2  7.5  2.0  1,600  224  
NN3  7.5  NA  744  160  
NN4  7.5  NA  285  96  
DeepFace Taigman2014 ()  DeepFacealign2D  118.0  0.8  805  152 
Developer  Family name  Name  Laptop  Laptop  Pi TF  Pi 1 

ML  TF  core C++  
VGG group  VGGCNN Parkin2015 ()  VGG16  0.695  4.723  75.301  65.909 
FaceNet Schroff2015 ()  NN1  0.072  0.490  7.815  6.840  
NN2  0.072  0.488  7.786  6.815  
NN3  0.033  0.227  3.620  3.169  
NN4  0.013  0.087  1.387  1.214  
DeepFace Taigman2014 ()  DeepFacealign2D  0.036  0.246  3.917  3.429 
According to table 3, the special C++ implementation for Pi shows the same performance as the TensorFlow (TF) implementation. Nevertheless, we have to mention that the TF implementation used all four streams and all CPU specific accelerations including GPU. At the same time, the considered C++ implementation used only one stream for calculation as it is described in the specification. It means that the simple modification can decrease the time of C++ calculation at least 3 times.
Fig. 1 – Fig. 4 show that the notion of ‘deep’ network is very different for different teams: from 6 layers with weights in DeepFace to 16 such layers in VGG16.
We can see that all the considered networks require at least 30MiB for weights ( MiB), 3.2MiB for features and use all possible computational resources. The small networks satisfy the memory restrictions. The large networks like VGG16, NN1 or DeepFace require more than 100M of weights or 400MiB and do not satisfy memory restrictions.
Time of calculation for all the considered networks exceeds specified restrictions.
For our purposes we select VGG net as the prototype to develop the backyard dog. Our choice is motivated by uniformity of the network structure which allows us to extract subnetworks of different depth.
3 Backyard dog problem solver
There are several wellknown approaches to use pretrained network. For example, Simonyan2015 () suggests to convert the final fullyconnected layers into convolutional layers. This approach allows us to apply the pretrained network for any images with the same or greater size. Papers Parkin2015 (), Schroff2015 (), Taigman2014 () suggest to remove final full connected layer and consider output of truncated network as feature space to solve problem of identity verification. Moreover, authors of Taigman2014 () wrote about the three first layers: “These layers merely expand the input into a set of simple local features.”
In this work we apply this approach more regularly. Let us consider outputs of CNN layers as input features for the different levels. Input of the first layer contains simple RGB representation of image. Output of the first layer contains the first level features. These features were created by CNN during training without involvement of human expertise.
Authors of several works (see, for example, Parkin2015 (); Schroff2015 (); Taigman2014 (); Simonyan2015 () ) suggest to extract the initial part of CNN to form features and use this features as inputs for a next network to form solution. The last network can be trained with âfrozenâ weights of initial network.
We suggest to use a slightly different approach. We remove all fully connected layers to have a scale free network: this network can be used for input images of arbitrary size. Outputs of such truncated network are considered as input features to solve the backyard dog problem. Also we can consider smaller networks if it is necessary.
To solve the backyard dog problem we apply two widely used preprocessing procedures: we centralise all data by subtraction of mean vector calculated for training set and then we project all data onto unit sphere (normalise each data vector to unit length). The first of these operations can be presented as a network layer which subtracts a constant vector from the input vector and the second operation is wellknown normalisation layer used, for example, in NN1 and NN2 Schroff2015 ().
Preprocessed data are used as an input vector to a linear layer (fully connected layer). The output of this linear layer is an output feature vector which is compared to the prototype vectors for the final problem solving. The structure of the used network is presented in Figure 5. There are several possible approaches to the linear layer construction and output vector interpretation. The approach used in our study is described in the next section.
4 Interpretation of output vector
Consider a set of persons , where is the number of persons in the database. A set of persons forms a family ( is the number of FMs in the family). All persons, which are not FMs, are called ‘other persons’ or ‘strangers’. For each person , is the set of images of this person and is the number of these images.
Denote network output for image as . We calculate the following value:
(1) 
If , where is a specified threshold, then the image is interpreted as another person. If , then we interpret image as FM , where is defined as
(2) 
Three types of errors are considered:
 MF

Misclassification of a FM is an error when an image belongs to FM but is interpreted as ‘other person’ (a ‘stranger’).
 MO

Misclassification of a ‘stranger’ is an error when an image does not belong to FM but is interpreted as FM.
 MR

Misrecognition of a FM is an error when an image belongs to FM but is interpreted as an image of another FM.
Error rates are the fractions of the specific types of errors in testing (measured in %). measures the error rate for solution of the ‘friend or foe’ problem.
5 Linear layer formation
The specified procedure of the output interpretation determines specific requirements for the linear layer: we need to find an dimensional subspace in the space of outputs that the distance between projections of outputs for images of the same person onto is small and the distance between projections of outputs of images of different persons onto is relatively large. This problem is considered, for example, in Zinovyev2000 (); KorenCarmel2004 (); Gorban2009 (). In formulas below we follow Mirkes2016 (). Projection of vector onto subspace defined by orthonormal vectors is , where is matrix with th row (). Select the target functional in the form
(3) 
where

is the number of persons,

is the mean squared distance between projections of output vectors of different persons:
(4) 
is mean squared distance between projections of output vectors of person :
(5) 
parameter defines the relative cost of closeness of points of the same person in comparison to distance of points of different persons.
The space of the dimensional linear subspaces of a finitedimensional space (the Grassmannian manifold) is compact, therefore, the solution of (3) exists. The orthonormal basis of this space (the matrix ) is, by the definition, the set of the first Advanced Supervised Principal Components (ASPC) Mirkes2016 (). They are the first principal axis of the quadratic form defined from (3) KorenCarmel2004 (); Mirkes2016 ().
6 Testing protocol
We used the database with 25,402 images of 654 persons database2018 () (38.84 images per person at average). We randomly select 327 persons for the training set. Other 327 persons form the test set. All persons with less than 10 images are transferred from the test set to the training set. Then we randomly select 100 sets of 10 different persons as examples of families. All persons who are not used in the test sets are included into training set. Finally we have 250 persons in the test set and 404 persons in the training set. Denote the set of persons in training set as .
For each truncated VGG16 and each image from the training set we calculate the sets of all output vectors . Then we calculate the mean vector
(6) 
Output of the subtraction layer is
(7) 
After this subtraction, each vector is normalised to unit length.
We find ASPCs for the set of vectors calculated for the training set by solving problem (3). Now we need to identify two parameters: threshold of the stranger/FM separation and the number of ASPCs. We search the threshold which minimises sum of two errors: MF and MO for each test set (family) and any given number of ASPCs. Then we calculate the mean values of MF (MMF), MO (MMO), and MR (MMR) for corresponding errors of all test sets. The number of ASPCs which provides the minimum of of MMF+MMO is selected as the optimal .
7 Results
There are several different experiments with time measurement presented in Table 4. In Tables 5 – 8 we present the testing results for different models. The best model for network with 17 layers uses 70 ASPCs and the optimal network with 5 layers uses 60 ASPCs. This 5 layer network with 60 ASPCs meets the technical requirements (image processing time 1s on 1 core of Pi) and demonstrates reasonable performance in the solution of the ‘friendorfoe’ problem with the MF+MO rate less than 6%. The maximal number of MF+MO among 100 randomly selected test sets is 1.8 times higher than the average for both 17 layer deep and 5 layer deep networks (with optimal number of ASPCs).
Image  Layers  ML  TF Laptop  TF Pi  C++  

size  T1  T2  T1  T2  T1  T2  Laptop  Pi  
224  37  0.67  0.72  7.35  7.05  
224  35  0.73  0.67  
224  31  0.62  0.66  
128  31  0.25  0.24  
96  31  0.19  0.21  0.96  0.95  17.08  17.31  
64  31  0.07  0.07  0.61  0.64  11.32  11.28  
96  24  0.12  0.13  0.59  0.43  7.44  8.91  
64  24  0.06  0.06  0.35  0.35  7.20  7.27  1.21  5.69 
64  17  0.81  3.66  
64  10  0.39  1.61  
64  05  0.17  0.70 
Layers  MR  MF  MO  MF+MO 

24  11.00  11.00  0.01  11.01 
17  14.39  14.39  2.82  17.22 
10  16.71  16.71  5.86  22.57 
5  12.58  12.58  2.57  15.14 
Layers  MR  MF  MO  MF+MO 

24  4.16  4.13  1.09  5.22 
17  7.69  7.65  1.75  9.39 
10  10.94  10.82  3.64  14.46 
5  6.58  6.52  2.01  8.53 
Layers  MR  MF  MO  MF+MO 

17  4.80  4.80  1.22  6.02 
5  9.69  8.16  2.06  10.22 
Layers  MR  MF  MO  MF+MO 

17  2.50  2.46  0.81  3.27 
5  4.39  4.30  1.48  5.78 
Heuristic definition of efficient neural network was proposed in 1993: realization of maximal performance (or skills) with minimal number of connections (parameters) Gordienko1993 (). Various algorithms of neural networks optimization were proposed in the beginning of 1990s Gorban1990 (); Hassibi1993 (). Recent development in huge neural network systems brings us back to the problem of efficiency and network simplification. Among several solutions we have to mention SqueezeNet Iandola2016 () and DeepRebirth LiWangKong2017 (). Our system differs in that it combines layers’ cutting with construction of the new linear output ASPCA layer. It is demonstrated that this technology works and allowes us to implement the neural networks’ backyard dog on Pi with decision time less than 1 sec and the ‘friendorfoe’ error rate less than 6%.
References
 (1) Parkhi. O.M., Vedaldi, A., Zisserman, A. Deep face recognition, Proceedings of the British Machine Vision Conference (BMVC), 2015. http://www.robots.ox.ac.uk/~vgg/publications/2015/Parkhi15/parkhi15.pdf
 (2) F. Schroff, D. Kalenichenko, and J. Philbin. Facenet: A unified embedding for face recognition and clustering. In Proc. CVPR, 2015.
 (3) D. Chen, X. Cao, L. Wang, F. Wen, and J. Sun. Bayesian face revisited: A joint formulation. In Proc. ECCV, pages 566â579, 2012.
 (4) Sun, Y., Wang, X., Tang , X. Deep learning face representation from predicting 10,000 classes. In Proc. CVPR, 2014.
 (5) Taigman, Y., Yang, M., Ranzato, M., Wolf, L. DeepFace: Closing the gap to humanlevel performance in face verification. In Proc. CVPR, 2014.
 (6) White, D., Dunn, J.D., Schmid, A.C., Kemp, R.I., Error Rates in Users of Automatic Face RecognitiOn Software, PLOS One 10(10): e0139827, 2015 https://doi.org/10.1371/journal.pone.0139827
 (7) Population of the Earth, http://www.worldometers.info/worldpopulation/
 (8) Published VGG CNN http://www.vlfeat.org/matconvnet/models/vggface.mat
 (9) MatConvNet http://www.vlfeat.org/matconvnet
 (10) VGG in TensorFlow https://www.cs.toronto.edu/~frossard/post/vgg16/
 (11) Simonyan K., Zisserman A. Very deep convolutional networks for largescale image recognition. In International Conference on Learning Representations, 2015.
 (12) Zinovyev, A.Y. Visualisation of multidimensional data, Krasnoyarsk: Krasnoyarsk State Technocal University Press, 2000 (In Russian).
 (13) Koren, Y., Carmel, L. (2004). Robust linear dimensionality reduction. IEEE Transactions on Visualization and Computer Graphics, 10(4), 459–470, 2004. https://doi.org/10.1109/TVCG.2004.17
 (14) Gorban, A.N., Zinovyev, A.Y. Principal Graphs and Manifolds, Chapter 2 in: Handbook of Research on Machine Learning Applications and Trends: Algorithms, Methods, and Techniques, Emilio Soria Olivas et al. (eds), IGI Global, Hershey, PA, USA, 2009, pp. 2859.
 (15) Mirkes, E.M., Gorban A.N., Zinoviev A. Supervised PCA (2016). https://github.com/Mirkes/SupervisedPCA.
 (16) Gorban A.N., Mirkes E.M., Tyukin I.Y. Preprocessed database LITSO654 for face recognition testing https://drive.google.com/drive/folders/10cu4u31I24pKTOTIErjmie8gUZ8biz?usp=sharing.
 (17) Gordienko, P. Construction of efficient neural networks: Algorithms and tests. In Neural Networks, 1993. IJCNN’93Nagoya. Proceedings of 1993 International Joint Conference on 1993 Oct 25 (Vol. 1, pp. 313316). IEEE.
 (18) Gorban, A.N. Training Neural Networks, USSRUSA JV “ParaGraph”, 1990.
 (19) Hassibi, B., Stork, D.G., Wolff, G.J. Optimal brain surgeon and general network pruning. In Neural Networks, 1993., IEEE International Conference on 1993 (pp. 293299). IEEE.
 (20) Iandola, F.N., Han, S., Moskewicz, M.W., Ashraf, K., Dally, W.J., Keutzer, K. SqueezeNet: AlexNetlevel accuracy with 50x fewer parameters and¡ 0.5 MB model size. arXiv preprint, 2016 Feb 24, https://arxiv.org/abs/1602.07360.
 (21) Li, D., Wang, X., Kong, D. DeepRebirth: Accelerating deep neural network execution on mobile devices. arXiv preprint, 2017 Aug 16 https://arxiv.org/abs/1708.04728.