Camera Model Identification Using Convolutional Neural Networks
Source camera identification is the process of determining which camera or model has been used to capture an image. In the recent years, there has been a rapid growth of research interest in the domain of forensics. In the current work, we describe our Deep Learning approach to the camera detection task of 10 cameras as a part of the Camera Model Identification Challenge hosted by Kaggle.com where our team finished 2nd out of 582 teams with the accuracy on the unseen data of 98%. We used aggressive data augmentations that allowed a model to stay robust against transformations. A number of experiments are carried out on datasets collected by organizers and scraped from the web.
In this work, we describe our solution for the IEEE’s Signal Processing Society - Camera Model Identification Challenge hosted by Kaggle.com. Our team ranked the second place out of 582 teams in the task to create an algorithm that identifies the type of the camera that was used to capture an image. In general, the camera detection algorithm enables to solve many problems in forensics, such as identifies owner of illegal or controversial materials (pedo-pornographic shots, terrorist act scenes, images that do not respect privacy laws, etc.), as well as helping to claim the intellectual property.
There are two main ways to identify the type of the camera. The first one uses metadata (e.g., EXIF tag) that keeps information about the type of the camera, and parameters used when the image was captured. The problem with this approach is that it is very unreliable, since metadata can be easily manipulated or is unavailable if an image was re-saved or re-compressed.
The second approach is based on the low-level features that are camera model specific and originate from processing steps carried within a camera. Every camera maker develops a set of sophisticated, non-linear algorithms that are applied to the raw image before saving it to the memory card. Examples include demosaicing, noise filtering, fixing lens distortion, etc.
Several camera identification algorithms were proposed in the literature, each trying to extract features related to different post-processing techniques. Some of them aim to extract features that are trying to exploit some apriori knowledge about imaging model, noise characteristics, demosaicing strategies, lens distortion, histogram strategies, etc. The others capture the statistical image properties and feed them to the machine learning classifiers. All of these approaches are not scalable in a sense that adding a new model of the camera will take a lot of manual effort to define the ways of feature extraction designated to track the traces.
In recent years, solutions based on the Deep Learning techniques became a mainstream in computer vision. Problems in satellite [1, 2, 3, 4], medical [5, 6, 7, 8] or any other imagery are successfully tackled with these techniques, routinely beating human performance in many tasks, including classification, segmentation, and detection. There are a few reasons why these techniques is are popular. First, they are scalable in a way, that extending the model to work with new models is a straightforward task that does not require any special forensic domain knowledge or training, making this problem an engineering rather than a scientific problem. Second, empirical evidences show that the accuracy of the models is growing when more train data are provided, which allows us to take the advantage of the fact that tremendous amounts of videos and images are published on the internet.
There are many works where authors applied Deep Learning techniques to forensics. For example, in the paper , authors used CNNs to detect double JPEG Compression, while in , authors used CNN, to extract features and SVM classifiers on top of extracted features.
The approach that we propose follow the similar path, but we had a few essential modifications:
We trained a network that performs predictions in an end2end manner.
We used deep 161 Layer DenseNet architecture  that allows constructing very abstract representations of the low-level features created by the processing algorithms.
We use our weight initialization model that was pre-trained on the ImageNet.
We used aggressive data augmentations that allowed a model to stay robust against Gamma, Resize, Contrast and Resize transformations.
This paper is organized as follows. Section II review camera model identification papers, then, Section III describe the algorithm used in this work. Section IV summarizes the lesson learned and experimental results.
Ii Related work
In the past decades, many methods have been proposed to determine which camera was used to take the image.
Most of the existing approaches that do not take metadata into account can be divided into two main categories: hardware and software source camera identification. Hardware category considers features of camera hardware, such as lens [12, 13] and Ð¡harge Coupled Device (CCD) sensors . However, software approach works with color filter array (CFA) interpolation artifacts [15, 16, 17] and Sensor Pattern Noise(SPN) .
Current promising approaches according to Van Lanh T et al  are:
The first approach was proposed in the paper published by Choi et al.  in 2006. The method is focused on lens radial distortion where parameters of this distortion are used as features for the classification algorithm. This optical deviation occurs because of the use of the low-quality wide angle lenses which have a low cost. Manufacturers are implementing different lens systems to compensate for the radial distortion, where they are affecting the pattern of radial distortion. The important limitation of the method is manual zooming or changing to custom lenses which decrease the accuracy of classification. In 2007, Van Lanh T et al.  extended this approach and applied it to mobile phone cameras. This is one of the few methods that obtain results on early detection stage such as lenses.
The second approach which was initially proposed by Luka et al.  in 2006. Silicon wafers that are used during the production of the sensors have defects and different homogeneity. As a result, pixels at different positions have a different sensitivity to light which leads to a unique to each camera pattern of noise which is considered as the main component of Pixel Non-Uniformity(PRNU).
In [21, 22] authors enhanced the prior algorithm by subtracting the average whitened sensor pattern noise. The limitation of this approach is the recommendation to use the smooth content images to extract relevantly reliable noise-based fingerprint .
In  , authors use a feature extraction pipeline, consisting of edge extraction using canny and Laplace operators, and combining them with the original images to extract Homogeneity, Contrast, Entropy, and Correlation. They used SVM and other classifiers on top of these features to obtain high accuracy results on the Dresden Dataset.
In the last few years, deep learning techniques were also applied to the camera detection task [25, 26, 27]. Deep Learning approach has the advantage of working with extremely high capacity models, having tens of millions of free parameters. The power of the method is that Neural Networks do not require manual feature extraction as the model is learning the appropriate features directly from the data. It makes this method scalable in two ways. First of all, you can easily extend you detection algorithm to a big set of cameras, adding new models if needed.Second, the quality of the extracted features grows with the amount of the data that is used for training.
Typically, camera identification algorithms are evaluated on a Dresden Image Dataset . This dataset contains images from 74 cameras of 27 models with different scenes for each device (e.g., office, nature, etc.). However, it lacks augmentation and mobile phone cameras images.
In the current work, we used two datasets to evaluate our model performance. The first one was a dataset that was provided by the Organizers of the IEEE’s Signal Processing Camera identification Challenge, and had of 2500 images, corresponding to ten camera models with 250 pictures each. Lens aberration proved to be a powerful feature in the previous work . To prevent the participants from using it, the organizers of the competition cropped central 500x500 parts of the images in the test set. Furthermore, half photos were augmented by the transformations Resize, Gamma, Contrast, or Jpeg Compression and the other half was in their raw form. For this dataset, the ground truth labels were unknown to the challenge participants, and evaluation was performed via LeaderBoard on the Kaggle.com website.
In the competition, external data was allowed to use. Hence we scrapped more than 500 Gb from Flickr, Yandex.Fotki, Wikipedia Commons, and mobile reviews websites to obtain images for the required ten classes. After this, we performed filtering based on the EXIF metadata, removing those that were manipulated by a Photoshop or LightRoom software. After this, images with Jpeg compression quality less than 95, were excluded. Finally, we filtered out images that had sizes that did not belong to the default list of possible image sizes that corresponding cameras generate. After this filtering, we got 78807 not-manipulated images. We split them into two parts: the train set(Table I) and 100 for the validation set that was used to evaluate our model performance locally and to perform an ablation study.
|Motorola Droid Maxx||11608||100|
|Samsung Galaxy S4||9351||100|
|LG Nexus 5X||5437||100|
|Motorola Nexus 6||10950||100|
|Samsung Galaxy Note 3||6025||100|
|Sony NEX 7||3075||100|
Neural Networks are a universal approximator, that can learn any function from data, assuming that we have appropriate network architecture, enough training data, and proper training procedure. In the Camera Identification challenge at Kaggle, the organizers did not limit the use of the external data as it is typically happening in computer vision challenges. Because of this freedom we choose not to focus on the pre-processing steps, but invest time into selecting the proper network architecture and training procedure.
For the network, we choose DenseNet 161  which consists of the repeating convolutional blocks with an average pooling at the end before the last Dense layer that is used for the final classification. The name DenseNet comes from that fact that skip connections are added between each pair of layers, which is believed to make the loss surface more smoothed, the optimization procedure not to get stuck in the local minimums. The GlobalAverage Pooling procedure before the final Dense layer is agnostic to the size in the XY dimension which allows using images of different sizes as an input.
During training, we randomly cropped patches of the size 960x960 and augmented them with the following transformations and applied 480x480 crops after this.
Dihedral Group D4 transformations: Rotations by 90, 180, 270 degrees and flips.
Gamma transformation. We choose the gamma parameter uniformly from the [0.8, 1.2] range.
JPEGCompression with the parameters from 70 to 90.
For all the above transformations we used an implementation from the albumentations  library.
After these transformations, images were collected into batches of the size 480x480 and used to train the network. As an optimizer we standard for classification problems Crossentropy Loss:
We trained the network for 100k epochs using an Adam optimizer, with the initial learning rate as 1e-3. Loss curve is shown in Fig 2.
The test during the competition, we performed an inference on the 480x480 corner and center crops from the image, applying D4 transformation to each crop. All the above predictions were averaged. Our result with the score 0.987976 was the second out of 582 teams. This competition is evaluated on the weighted categorization accuracy:
Iv Ablation study
In addition to the participation in the challenge, we evaluated how the accuracy of our model is affected by the JPEG, gamma and transformations.
First, we evaluated the dependence of the model quality from the training set size. We used 25k, 50k, and 62.5k. We did not find the statistical difference for these sizes. We believe that this counter intuitive result is related to the fact that for the chosen powerful architecture with the corresponding training schedule, this task was not challenging enough, which lead to a robust model with the validation accuracy of 0.98 on the smallest data point of 25k images. We believe that we needed to perform classification, not on ten but a much larger number of classes, say 100 or larger classes. The positive correlation between the size of the train data and model accuracy was more evident.
Secondly, we evaluated the effect of the JPEG Compression, Gamma, and Resize augmentations on the validation accuracy. As shown Fig 1. As expected our model consistently shows excellent performance in the ranges of the augmentations that were used during training. We believe that this result gives additional evidence that Deep Learning models can be made robust to a broad range of different transformations if desired transformations were used as a training time augmentations.
Finally, we estimated the effect of the crop size on the model performance. It is believed in the literature that algorithms that were used to process the raw images, leave low-level local features that can be used by the camera detection algorithms. We would assume that for the data that follow this assumption crop size, would not affect the model performance for a wide range of crop sizes. But the curve Fig 1 may be interpreted as the fact that not just local, but long-range correlations between pixel values may serve as a powerful feature.
In the current work, we showed how the application of the deep learning techniques trained on the large amounts of the data. Data which scraped from the internet. In condition, good training schedule, network architecture and image augmentations could lead to a model that shows excellent performance in the camera detection task. Based on the proposed model, this paper studies can be straightforwardly applied in practice.
The authors would like to thank Open Data Science community  for many valuable discussions and educational help in the growing field of machine/deep learning.
- V. Iglovikov, S. Mushinskiy, and V. Osin, “Satellite imagery feature detection using deep convolutional neural network: A kaggle competition,” arXiv preprint arXiv:1706.06169, 2017.
- V. Iglovikov, S. Seferbekov, A. Buslaev, and A. Shvets, “Ternausnetv2: Fully convolutional network for instance segmentation,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, June 2018.
- A. Buslaev, S. Seferbekov, V. Iglovikov, and A. Shvets, “Fully convolutional network for automatic road extraction from satellite imagery,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, June 2018.
- S. Seferbekov, V. Iglovikov, A. Buslaev, and A. Shvets, “Feature pyramid network for multi-class land segmentation,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, June 2018.
- A. Shvets, A. Rakhlin, A. A. Kalinin, and V. Iglovikov, “Automatic instrument segmentation in robot-assisted surgery using deep learning,” arXiv preprint arXiv:1803.01207, 2018.
- A. Rakhlin, A. Shvets, V. Iglovikov, and A. A. Kalinin, “Deep convolutional neural networks for breast cancer histology image analysis,” in International Conference Image Analysis and Recognition. Springer, 2018, pp. 737–744.
- A. Shvets, V. Iglovikov, A. Rakhlin, and A. A. Kalinin, “Angiodysplasia detection and localization using deep convolutional neural networks,” arXiv preprint arXiv:1804.08024, 2018.
- V. Iglovikov, A. Rakhlin, A. Kalinin, and A. Shvets, “Pediatric bone age assessment using deep convolutional neural networks,” arXiv preprint arXiv:1712.05053, 2017.
- M. Barni, L. Bondi, N. Bonettini, P. Bestagini, A. Costanzo, M. Maggini, B. Tondi, and S. Tubaro, “Aligned and non-aligned double jpeg detection using convolutional neural networks,” Journal of Visual Communication and Image Representation, vol. 49, pp. 153–163, 2017.
- L. Bondi, L. Baroffio, D. Güera, P. Bestagini, E. J. Delp, and S. Tubaro, “First steps toward camera model identification with convolutional neural networks,” IEEE Signal Processing Letters, vol. 24, no. 3, pp. 259–263, 2017.
- G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, “Densely connected convolutional networks.” in CVPR, vol. 1, no. 2, 2017, p. 3.
- K. S. Choi, “Source camera identification using footprints from lens aberration,” Electronic …, vol. 6069, no. 852, pp. 1–8, 2006.
- A. E. Dirik, H. T. Senear, and N. Memon, “Digital single lens reflex camera identification from traces of sensor dust,” IEEE Transactions on Information Forensics and Security, vol. 3, no. 3, pp. 539–552, 2008.
- Z. Geradts, J. Bijhold, M. Kieft, K. Kurosawa, K. Kuroki, and N. Saitoh, “Methods for identification of images acquired with digital cameras,” Proceedings of SPIE - The International Society for Optical Engineering, vol. 4232, 2001.
- S. Bayram, H. T. Sencar, N. Memon, and I. Avcibas, “Source camera identification based on CFA interpolation,” in Proceedings - International Conference on Image Processing, ICIP, vol. 3, 2005, pp. 69–72.
- L. Yangjing and H. Yizhen, “Image based source camera identification using demosaicking,” in 2006 IEEE 8th Workshop on Multimedia Signal Processing, MMSP 2006, 2007, pp. 419–424.
- O. Celiktutan, b. Avcibac s, B. Sankur, and N. Memon, “Source Cell-Phone Identification,” IEEE Signal Processing and Communications Applications, pp. 1–3, 2006.
- J. Lukas, J. Fridrich, and M. Goljan, “Digital camera identification from sensor pattern noise,” IEEE Transactions on Information Forensics and Security, vol. 1, no. 2, pp. 205–214, 2006.
- T. Van Lanh, K.-S. Chong, S. Emmanuel, and M. S. Kankanhalli, “A Survey on Digital Camera Image Forensic Methods,” in Multimedia and Expo, 2007 IEEE International Conference on, 2007, pp. 16–19. [Online]. Available: http://ieeexplore.ieee.org/document/4284575/
- L. T. Van, S. Emmanuel, and M. S. Kankanhalli, “Identifying Source Cell Phone using Chromatic Aberration,” Multimedia and Expo, 2007 IEEE International Conference on, pp. 883–886, 2007. [Online]. Available: http://ieeexplore.ieee.org/document/4284792/
- X. Kang, Y. Li, Z. Qu, and J. Huang, “Enhancing source camera identification performance with a camera reference phase sensor pattern noise,” in IEEE Transactions on Information Forensics and Security, vol. 7, no. 2, 2012, pp. 393–402.
- C. T. Li and Y. Li, “Color-decoupled photo response non-uniformity for digital image forensics,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 22, no. 2, pp. 260–271, 2012.
- J. Fridrich, “Digital image forensics,” IEEE Signal Processing Magazine, vol. 26, no. 2, pp. 26–37, 2009.
- N. Kulkarni and V. Mane, “Source camera identification using GLCM,” in Souvenir of the 2015 IEEE International Advance Computing Conference, IACC 2015, 2015, pp. 1242–1246.
- L. Bondi, L. Baroffio, D. Guera, P. Bestagini, E. J. Delp, and S. Tubaro, “First Steps Toward Camera Model Identification with Convolutional Neural Networks,” IEEE Signal Processing Letters, vol. 24, no. 3, pp. 259–263, 2017.
- A. Tuama, F. Comby, and M. Chaumont, “Camera Model Identification With The Use of Deep Convolutional Neural Networks,” in IEEE International Workshop on Information Forensics and Security, vol. 6, 2016, pp. 1–6. [Online]. Available: https://hal.archives-ouvertes.fr/hal-01388975
- V. U. S. B, R. Naskar, and N. Musthyala, Digital Forensics and Watermarking, 2017, vol. 10431. [Online]. Available: http://link.springer.com/10.1007/978-3-319-64185-0
- T. Gloe and R. BÃ¶hme, “The ‘Dresden Image Database’ for benchmarking digital image forensics,” in Proceedings of the 25th Symposium On Applied Computing (ACM SAC 2010), vol. 2, 2010, pp. 1585–1591.
- A. Buslaev, A. Parinov, E. Khvedchenya, V. I. Iglovikov, and A. A. Kalinin, “Albumentations: fast and flexible image augmentations,” arXiv preprint arXiv:1809.06839, 2018.
- [Online]. Available: http://ods.ai/