Source Camera Attribution of Multi-Format Devices

Source Camera Attribution of Multi-Format Devices

Samet Taspinar, Manoranjan Mohanty, and Nasir Memon Samet Taspinar (st89@nyu.edu) is with Center for Cyber Security, New York University Abu Dhabi, UAE, Manoranjan Mohanty (m.mohanty@auckland.ac.nz) is with Department of Computer Science, University of Auckland, and Nasir Memon (memon@nyu.edu) is with Department of Computer Science and Engineering, New York University, New York, USA.
Abstract

Photo Response Non-Uniformity (PRNU) based source camera attribution is an effective method to determine the origin camera of visual media (an image or a video). However, given that modern devices, especially smartphones, capture images, and videos at different resolutions using the same sensor array, PRNU attribution can become ineffective as the camera fingerprint and query visual media can be misaligned. We examine different resizing techniques such as binning, line-skipping, cropping and scaling that cameras use to downsize the raw sensor image to different media. Taking such techniques into account, this paper studies the problem of source camera attribution. We define the notion of Ratio of Alignment, which is a measure of shared sensor elements among spatially corresponding pixels within two media objects resized with different techniques. We then compute the Ratio of Alignment between the different combinations of three common resizing methods under simplified conditions and experimentally validate our analysis. Based on the insights drawn from the different techniques used by cameras and the RoA analysis, the paper proposes an algorithm for matching the source of a video with an image and vice versa. We also present an efficient search method resulting in significantly improved performance in matching as well as computation time.

{keywords}

PRNU, camera attribution, media forensics.

I Introduction

The emergence of ”fake news” along with sophisticated techniques using machine learning to create realistic looking content such as deep-fakes has led to an increased interest in digital media forensics [1, 2, 3, 4]. One well-studied problem in digital media forensics is to discover the source of an image or a video. Photo-Response Non-Uniformity (PRNU) based source camera attribution[5] is a well-known technique that can determine whether a particular device was used to capture a specific visual object. Here, a PRNU camera fingerprint (or more precisely a fingerprint estimate) is first computed from multiple still images (i.e., images or video frames) known to be taken by a specific camera. Then, the PRNU noise extracted from a query visual media is correlated with this fingerprint to determine if it was taken with the given camera.

To perform PRNU-based source camera attribution, the query visual media has to be precisely aligned with the camera fingerprint. That is, the pixels of the fingerprint and query images should correspond to largely the same elements of the camera sensor array. When misalignment between the fingerprint and query image occurs due to simple geometric transformations such as resizing and cropping, attribution can still be made by exhaustively trying all the possible transformation parameters [6]. However, this can be a very time-consuming process. Efforts at speeding this up have been proposed achieving a speed-up factor of around ten by downsizing the media to be matched [7].

Although simple misalignment can be compensated for by exhaustive search techniques, some recent anonymization methods to prevent source camera attribution create complex misalignment using techniques like seam carving that make exhaustive search intractable  [8, 9]. In the case of seam carving, it was subsequently shown that when multiple seam carved images are available from the same camera, successful verification could still be done by increasing alignment between the camera fingerprint and the seam-carved images [10, 11], provided no additional operation such as scaling and cropping has been performed. In-painting [12], Patch-based desynchronization [13], and image stitching [14] are other examples of complex techniques for breaking alignment.

Complex misalignment between a camera fingerprint and query object also occurs when they represent different types of media. For example, this happens when the camera fingerprint has been computed from images and the query object is a video captured at a different resolution. Given that modern devices such as smartphones can capture different types of media with different resolutions, and given that social networks often transform visual media objects in different ways, performing source camera attribution with different types of media, potentially from different social platforms, and taken from the same camera is a real and relevant problem. Recently, DARPA’s Medifor program [15] issued a challenge for camera identification with a dataset that included same-type media (i.e., image to image or video to video) as well as mixed media (i.e., matching images to videos or vice versa).

In this paper, we study the problem of camera verification in the context of mixed media. The attribution scenarios we examine include a video vs a single image, a video vs a set of images. For each of these cases, one of the visual objects is from a known source, and the other is a query object whose source is under question.

The main contributions of this paper are summarized below.

  • We undertake a comprehensive study of source camera attribution with mixed media, taking various factors into account such as different aspect ratios, different techniques for capturing low-resolution content and different parts of the sensor used for media capture. These factors have not been taken into account by media forensics researchers before and partly explain the state-of-the-art results presented in the paper.

  • We define the notion of alignment and Ratio of Alignment (RoA) between two still images taken from the same camera but using different capture techniques. The notion of RoA provides a simple and intuitive framework to better explain and quantify why PRNU based attribution may or may not work when there is a mismatch between resizing techniques used in practice and in testing.

  • We provide an analytic determination of the ratio of alignment (albeit for a simplified case) between the three most common resizing techniques, namely, binning, line-skipping and bilinear scaling.

  • Based on our analysis and experimental validation, we explore why and how different resizing techniques, namely, bilinear scaling, line-skipping and binning perform for source camera matching with mixed media. We propose an algorithm that represents a good balance between performance and computing when matching a video and an image (and vice versa).

  • Given the importance of performing attribution in the presence of scaling and cropping, we propose an efficient search method that is significantly faster than naive exhaustive search.

  • We compile experimental results that lead to insights on parameters used by numerous camera brands and models with respect to the in-camera operations they use for capturing image and video content. This knowledge can be used to significantly speed up attribution when the camera model of the object is known.

  • We compile a mixed media dataset to be shared with the community, that contains images and videos of multiple resolutions from a variety of different cameras.

The rest of this paper is organized as follows. Section II provides an overview of PRNU-based source camera attribution and describes different approaches used for capturing images and video frames. Section III defines the notion of ‘Ratio of Alignment” and provides its analytic determination between three common techniques, namely, bilinear scaling, binning and line-skipping. In Section IV, we examine camera attribution for different scenarios of mixed media formats that a forensic analyst may encounter. In Section V we provide experimental results. Section VI concludes the paper.

Ii Background

In this section, an overview of how a camera captures a video and an image using a single sensor array is first provided. Then, we describe different resizing techniques that modern cameras use when capturing a video. Finally, we briefly summarize PRNU-based camera attribution.

Ii-a Image capturing pipeline

There is much processing that takes place within a camera after light from a scene is guided through the lens to the sensor array. Although different camera models may apply different processing steps, many of these steps are common to most cameras. Fig. 1 shows a simplified imaging pipeline.

Fig. 1: Imaging pipeline of a digital camera

Most single-chip cameras use a color filter array (CFA) which arranges RGB filters on a square grid. Hence, each sensor element receives only one of red, green, or blue components of the light passing through the lens. There are many different patterns according to which a CFA can be configured. The most common pattern is known as the Bayer Filter [16] and one particular variation of it is shown in Fig. 1 where every pixel array comprises of two green pixels and one red and one blue pixel. The missing color values in each pixel are interpolated from corresponding color values of neighboring pixels to get a full-color image.

After demosaicing, the remaining steps are mainly related to image quality. These include, for example, white balancing, gamma correction, and edge sharpening. Finally, JPEG compression is applied to the image which significantly decreases the disk storage needed with a negligible perceptible loss in quality. It should be noted that none of these steps cause any geometrical transformation of the image.

Ii-B In-camera resizing

Modern cameras typically contain over 10 million pixels which help capture intricate scene details in an image. However, as the number of pixels increase, the computational cost to capture a still image also increases. Thus, most cameras don’t use the full sensor resolution when capturing a video and downsize the sensor output to a lower resolution by in-camera processing. Moreover, based on user settings, images can also be captured at a lower resolution. To downsize an image or video frame when the sensor resolution is higher than the desired resolution, a combination of cropping, line-skipping, binning, and some scaling methods can be applied to the media. We describe these techniques below.

Fig. 2: Four ways of resizing using line-skipping
Line-skipping

Line-skipping is a technique that omits all the pixels in a row and/or column. After omitting the lines, the output image is expected to maintain a valid Bayer pattern. This can be done in several different ways. Fig. 2 shows an example where a image is resized to by skipping of the columns (i.e., ). This can be achieved with one of four possible ways with a valid Bayer pattern as shown in Fig. 2. If we extend this to both axes to downsize the image to resolution, we can now have different ways. For example, in Fig. (a)a, pairs of columns and rows are alternately kept and skipped, starting with keeping the and .

(a) An example of line-skipping
(b) binning scheme (Obtained from  [17])
Fig. 5: Two different in-camera resizing schemes
Pixel binning

Pixel-binning combines the values of multiple pixels of the same color in the raw image to create a composite pixel. For example, in Fig. (b)b, four red pixels on the left (i.e., the groups labeled and ) are combined together to create the red pixels in the binned image shown on the right. The green and blue pixels are also are created in the same way. We illustrate pixel binning, however, it is possible to do binning which downsizes an image by

To the best of our knowledge, weighted pixel-binning for resizing images to an arbitrary resolution is not used by cameras as this is against one of the main goals achieved by binning, i.e., decreasing computation of video capture. If cameras choose to further downsize still images after binning, they use another scaling technique, such as bilinear scaling. Binning can be enabled in a camera only if it is needed [18].

Scaling

As we mentioned, users can choose to capture images with a lower resolution. So resizing is not limited to video capture. Along with the techniques above, cameras use other scaling techniques such as bicubic or Lanczos (or their derivatives) to downsize still images (i.e., images or video frames). Some cameras with high computational power may not use binning or line-skipping for videos; they can simply process the entire sensor data and downsize using a scaling technique at the end of imaging pipeline.

(a) Downsizing by half by cropping from the center.
(b) Downsizing by by cropping from the center. Then resized by to obtain the right image
Fig. 8: The effect of cropping on field of view: The sensor captures the full resolution image. The saturated regions (borders) are discarded and the centers are the final images.
Cropping and Scaling

Cropping is the simplest technique to decrease image resolution. One approach is to use only the central pixels of a sensor and discard the surrounding pixels of the region of interest. However, one of the biggest drawbacks of cropping is that it changes the field of view in a still image if the cropped area is large. Hence it is cropping is most often used along with resizing. Fig. 8 shows two examples where the original image is downsized to half resolution. In Fig. (a)a, the image is cropped from the center without any resizing operation. Whereas in Fig. (b)b downsizing by is first done by cropping from the center. Then resizing by is performed to make it half resolution. As seen in the final outputs (the unsaturated parts of Fig. (a)a and Fig. (b)b [rightmost image]), both images are of the same resolution, but a significantly wider region of the scene is captured when using the second approach. Note that the resizing step can use any of the three techniques described above.

Fig. 9: Different regions of camera sensor
Active image and active boundary

In some cameras (such as Nexus 6, Lenovo P1 [19]), the size of the sensor array is larger than listed in the camera specification as shown in Fig. 9. Here, the active image, the center of the sensor as per the resolution in the specification, is surrounded by an active boundary of pixels. The active boundary is further surrounded by dark pixels, which are used for monitoring the black color level. The active image region of the sensor is typically used for capturing an image whereas both the active image and the active boundary regions are used when capturing a video [20]. Therefore, a video may contain some of the boundary pixels which were not used while capturing an image. It is crucial to take boundary pixels into account while correlating a video and an image; otherwise, they may fail to match even if they are from the same source camera.

Ii-C PRNU-based source camera attribution

PRNU-based camera attribution is established on the fact that the output of the camera sensor, , can be modeled as

(1)

where is the noise-free still image, is the PRNU noise, and is a combination of additional noise, such as readout noise, dark current, shot noise, content-related noise, and quantization noise. Denoising is typically done in each color component separately which results in three PRNU noise patterns of the same resolution: , , and [2]. These three noise components are then converted to a final noise component as follows:

(2)

Since denoising filters (such as Wavelet Denoising[21], BM3D[22]) are not perfect, they cannot totally eliminate the random noise, . Hence, multiple still images are averaged to minimize and improve the estimation of , which is called the camera fingerprint. A given query image can then be attributed to a camera by matching the PRNU noise extracted from the query image with using the Pearson correlation coefficient or Peak-to-Correlation Energy (PCE). However, for this to work, the PRNU of the query image has to be aligned with the camera fingerprint. If the image or fingerprint is resized, the correct resizing parameter must be found, and the resizing operation must be reversed. A brute force method can be used to find the resizing parameters [6]. When a still image is cropped, Normalized Cross Correlation (NCC) [23] can be used to find the cropping location [6].

Although PRNU-based source camera attribution has been well studied, enough attention has not been given towards attribution in the presence of mixed media datasets that contain both videos and images. So far, most of the work has focused on either images or videos (but not both). For images, much work has been done to improve PRNU based attribution [24, 25, 26, 27, 28], as well as using the scheme for purposes other than attribution [29, 30, 31]. Many researchers have also extended image-centric methods towards video  [32, 33, 34, 35, 36, 37, 38, 19]. Taspinar et al. [19] and Iuliani et al. [39] addressed attribution of multi-format devices with limited success using a brute-force search.

Iii Matching in Presence of Desynchronization

As we have seen in the previous section, different methods can be employed for capturing different types of media with different resolutions. A lower resolution media/fingerprint could be matched with a higher resolution media/fingerprint if the exact techniques and parameters that were used to create the lower resolution media are known. But this is generally not feasible as device manufacturers do not reveal such information. Besides in many situations, such as media gathered from social networking sites, there may be no information available about the camera model at all.

However, as is known in the media forensics research community, attribution may still be possible. For example, As established earlier with seam carving  [10, 11], even if an image has been downsized in a complex manner and is no longer synchronized with a camera fingerprint, attribution may still be possible with a suitably scaled down fingerprint if there are common pixels from the original versions that have been used to compute each pixel in the downsized version. The same idea applies when matching different media taken from the same sensor. As an example, consider Fig. 5. It can be seen that the set of input pixels used for computing a particular output pixel using binning and line-skipping can have common pixels. For example, for computing the pixel location in the output image, binning uses input pixels from the raw image, and line-skipping uses . In these sets, the input pixel is common between binning and line-skipping. However, the contribution of a common pixel can differ depending on the resizing parameters. In this example, the contribution of pixel is for binning and for line-skipping. Note that this is a simplified example where the demosaicing step is neglected for both images.

The presence of at least one common pixel in the input sets (such as in the above example) may lead to successful PRNU-based correlation between two media objects[10]. The more the pixels in common, the higher the correlation tends to be. To compare the degree and number of common pixels between two media objects taken from the same sensor but using different resizing strategies, we define the notion of Ratio of Alignment (RoA) in the subsections below. We then derive RoA for a few cases of different pairs of misaligned media arising from some common resizing approaches. Although the cases are simplified, the results provide insights that help formulate and understand better techniques for PRNU attribution. First we provide some notation below.

Iii-a Notation for Ratio of Alignment (RoA) derivations

  • : A matrix representing raw sensor output before demosaicing. has a resolution of .

  • For brevity, and where the context is clear, we use to denote the value of . The red, green, and blue components are at location , and , and , respectively, for any and .

  • , , and denote the pixel of color component after is resized by bilinear interpolation, binning, and line-skipping, respectively. The color component is one of red, green, or blue.

  • is the RoA between two images and for images, it consists of three separate components, , , and for red, green and blue color planes, respectively.

  • : Represents the formula

    where is floor operation. Note that this equation indicates the sum of the pixels from to with an increment of . For example, . Also when the is skipped (i.e. ), the value of is by default considered as one (i.e. ).

  • : Indicates the sum of all pixels from to and to with an increment of , and along and axes, respectively. For example, .

  • The analysis is done for different cases based on whether pixel positions and are odd or even:

    • case 1: are odd

    • case 2: is odd, is even

    • case 3: is even, is odd

    • case 4: are even

Iii-B Ratio of Alignment (RoA) definition and example

Before defining Ratio or Alignment (RoA), first, we define what we mean by alignment. Suppose the raw image, , is resized to half using two different resizing techniques (e.g., one is resized by binning and the other with bilinear scaling). Denote the first resized image as and the second as . To evaluate the RoA between and , we first determine the alignment of pixel. If there is at least one common pixel in the computation of and , then the two pixels are partially aligned. If the two are computed from identical sets, then they are fully-aligned. If all the pixels in are fully-aligned with their corresponding pixels in , then and are said to be fully-aligned (i.e., both and down-sized by the same resizing technique). For brevity, we say that two images are aligned, when the images are either fully or partially aligned, with the context making the specific case clear.

Suppose the pixel of the color component of , , is computed from pixels from , with weights (here are pixel indices in a vector form), whereas the pixel , is computed from the pixels with weights . Now suppose their intersection consists of pixels, . Then, the alignment between and is defined as

(3)

To obtain the alignment between and over all pixels in the color plane , , we average the alignment, , over all the pixels as follows:

(4)

Then, consistent with the conversion done for PRNU from individual color components to a single combined value in (2), we compute the weighted average of alignments in each color component and the general form of the RoA, , as:

(5)
(a) Computation of pixel for binning: The left image, is raw sensor output, the middle image, , is obtained after binning, and the right one, is red component of the final image after demosaicing. The figure shows eight pixels (i.e. ) are used to compute pixel index for binning.
(b) Computation of pixel for bilinear scaling: The left image, is the sensor output, the middle image, is obtained after demosaicing applied in camera, and the right one, is obtained after resizing via bilinear scaling. four pixels are used to compute using bilinear scaling
Fig. 12: Alignment of the same pixel index for binning vs. bilinear scaling

Example: To illustrate, let us take binning (with demosaicing) and bilinear scaling (with demosaicing) as examples of two resizing operations, and find the RoA between the binned media and the bilinearly scaled media shown in Fig. 12. Assume that the binned media is a video frame and the bilinearly scaled media is an image. Both these media, however, are captured using an raw camera sensor as shown in Fig. 12. During the in-camera video capturing process, the sensor output is first resized to half using binning, as shown in Fig. (b)b, and then the binned output is bilinearly interpolated during demosaicing. The demosaiced output, , is produced as a video frame. It can be seen here that the red color channel of the pixel index of , , is computed from eight input pixels of the raw sensor output, (which are shown by the arrows in Fig. (a)a). Each input pixel contributes equally to the output pixel. Thus, is computed as .

In the second scenario, suppose there is another image obtained from the same camera sensor. The image was captured with no in-camera resizing technique and subsequently demosaicing was applied to the sensor output. To match the image (or its PRNU noise) with the video (or its fingerprint), we resize the image (as an out-camera operation) to half resolution using bilinear scaling. To calculate the same pixel index as before (i.e., the red component of pixel index of , ), we first compute the pixel values which are used to compute it in demosaiced image. As shown in Fig. (b)b, four pixels (i.e., , , , ) in the demosaiced image contributes to . The values of these pixels are

where are the pixel values of the raw image, . Therefore, by averaging these four pixels of , we can compute the weights of all the input pixels creating as .

Using (3), the alignment of the red component at pixel index between and (i.e., ) can be found as . Similar analysis can be done for the other pixel positions in all three color planes and a more precise characterization of the alignment between the two scaled images can be made.

Having established notation and the definition of RoA with the aid of examples, in the subsequent sections we derive RoA for a bilinearly scaled image by half, with images scaled to the same size but using binning and line-skipping respectively. For brevity of analysis, we assume that in each case scaling is done from a full resolution image which represents the actual resolution of the underlying sensor. We also assume that scaling is done by exactly half the dimension of the full resolution sensor. The analysis presented is valid for any Bayer pattern (i.e., , , , and ) as each of them results in the same ratio of alignment. Also for brevity, the analysis is provided only for red color. Analysis of green and blue colors are similar to red.

Besides, many cameras do binning before linearization; that is, they average the electrical charges in pixels instead of intensity which helps decrease read-out and shot noise. However, these noise sources have limited effect on this analysis; hence we simply consider binning is done by taking the average of intensity values of each pixel being accumulated.

Finally, it is crucial to note that one of the main differences between binning and line-skipping (i.e., in-camera resizing) and bilinear scaling (i.e., out-camera resizing) is that binning and line-skipping are done before demosaicing (color filter array interpolation) whereas the bilinear scaling is done after demosaicing. Both binning and bilinear scaling use bilinear interpolation, however, the way they use it differ as shown in Fig. 12. For binning, interpolation is applied on the Bayer filter output for which a composite pixel is produced from four same-color neighboring pixels. Interpolation in bilinear scaling is applied on the final RGB image from the sensor after downsizing each color channel separately.

Iii-C Sensor-pixel correspondence for bilinear scaling

We consider the case when the higher resolution media is downsized using bilinear scaling after the media has been captured (i.e., out-camera resizing). This is done when an image fingerprint is being matched to a video fingerprint or vice versa. We assume bilinear scaling is implemented using downsampling kernel. In the demosaicking step, for red and blue components the convolution kernel, , and for green channel are used which do bilinear interpolation. Although these are basic filters and more complicated kernels may be used in real cameras, for the sake of simplicity of analysis, we used these kernels.

As we have discussed before, bilinear scaling is done on the demosaiced sensor output. When the raw image, is demosaiced without any down-sampling operation, the red color component of the output image, , becomes

(6)

Now, suppose bilinear scaling is applied on the demosaiced sensor output to resize it to half resolution. Then, , denoting the red color component of the pixel of the scaled output image, can be found as

(7)

By combining (6) and (7), we obtain as

(8)

Using a same approach, pixel of can be obtained (details can be found in Appendix A-1):

(9)

So, (8) and (9) show sensor-pixel correspondence for the red and green color planes when the raw image, is resized by half via bilinear scaling, . In the next subsections, we will obtain the sensor-pixel correspondence for images resized to half size by binning, , and line-skipping, , and then find their RoA with .

Iii-D Sensor-pixel correspondences for binning

Similar to the above section (i.e., Section III-C), we can obtain the red and green components of a still image resized with the binning approach we described in Section II. Considering the differences of binning and bilinear scaling, we can obtain the pixel of the red component of an image resized with binning, :

(10)

and green component as:

(11)

details of which can be found in Appendix A-2.

Iii-E Sensor-pixel correspondences for line-skipping

Line-skipping can be implemented in numerous ways and different implementations may use completely different sets of pixels. For our anlaysis, we consider that line-skipping is implemented by removing every and (where is a natural number) rows and columns from the sensor output (as shown in Fig. (a)a). In other words, every and rows and columns are skipped. Using the same approach as above, we can obtain pixel values of the red component of a still image resized by line-skipping, , as

(12)

Similarly, the green component of the same, , can be found as

(13)

(For more details, see Appendix A-3)

Iii-F RoA’s of different combinations

In the above sections, we show how pixel sensor correspondence can be computed for different resizing schemes. In this section, we compute the RoA between the different resizing schemes. As stated above, each pixel in the resized image can be computed using one of four different cases based on its index (i.e., whether odd or even). We assume the occurrence of each case is equal (the difference in occurrences are negligible for high-resolution images). Therefore, averaging the alignment of these four cases will yield the RoA of the whole image. Also, as clarified before, the RoA computations of red and blue color planes will be the same. For the green color plane, the computation will differ.

Suppose denotes the RoA of the red color component (for all pixels) between a binned image and a bilinear scaled image. Then, can be found by averaging the alignments obtained for the cases mentioned in Section III-A (i.e., case , …case ).

We can obtain the RoA for red color of the two output images (i.e., and ) by (8), and (10). For example, red component of case 1 in binning is

whereas any pixel in bilinear scaling is

So, using 3, their alignment for case 1, becomes (i.e., the minimum weights of the common pixels , , , and , respectively).

When we calculate the alignment of red component using 3, we can obtain as

(14)

Similarly, we can obtains the alignment for green channel using  (18) and (21).

(15)

Note that alignment of blue color is also . Thus by using (5), the RoA of the whole image can be computed as

Putting values of alignment for each color component in this equation, we can get .

The RoA of other cases (such as and etc) can be found using the same approach. Table I shows the possible combinations of these resizing approaches. If two images are resized with the same resizing technique, their RoA will be . But when they are resized using different techniques, their RoA will decrease based on the extent of the common sensor elements contributing to each pair of spatially corresponding pixels in the resized images.

Train\Test Bscale Bin Lskip
Bscale 1.00 0.46 0.17
Bin 0.46 1.00 0.21
Lskip 0.17 0.21 1.00
TABLE I: RoA for media resized differently

From the RoA calculation, we can infer that when a video is resized via binning and the image FE via bilinear scaling with the same factor, the video FE and image FE can still match since there is significant alignment between the two. Line skipping, however results in lower RoA in the case of a mismatch. This could be due to the fact that line skipping entirely discards pixels whereas binning and bilinear scaling compute a composite pixel, parts of which may still align between the resized image FE and video FE. However, for some edge cases (e.g., when the correlation value is slightly below the decision threshold), matching by resizing with another (the correct) scheme might yield a match decision. Hence using a single resizing technique, may not be the best option. This insight is used to develop a matching algorithm in the next section.

Iii-G Experimental validation

To evaluate the impact of RoA on correlation, we simulated an experiment which calculates (Pearson Correlation Coefficient) and True Positive Rate (TPR) when different resizing techniques are applied on train and test images. We used the Raise dataset which contains a set of RAW images provided by Dang-Nguyen et al. [40]. From this dataset, we obtained images: for training and for testing. We resized each image to half size via (i) bilinear scaling, (ii) binning, and (iii) line-skipping We implemented these three methods as described in Section III-C and III-D). After resizing the images with each of these methods we obtain three copies of each image. From the training images, we extract three camera fingerprints and three PRNU noise patterns from the test images. We then correlate each fingerprint with each PRNU noise. This way, we do nine different correlations.

Table II shows that the average correlation, , for the different resizing cases. In this table, rows and columns indicate how the training and test images are resized, respectively.

Train\Test Bscale Bin Lskip
Bscale 0.0308 0.0060 0.0033
Bin 0.0064 0.0186 0.0127
Lskip 0.0038 0.0130 0.0344
TABLE II: Average correlation for media resized differently

Similarly, Table III shows the TPR of these combinations (i.e., TPR is obtained using PCE with a threshold of ). As can be seen, RoA is aligned with both and TPR. When the RoA is (i.e., both training and test images are resized with the same technique), is high which results in TPRs that are above in all cases. When RoA decreases, correlation also decreases which leads to a lower TPR. Interestingly, when either training or test images are resized with binning and the other with line-skipping, it achieves better TPR compared to bilinear scaling vs line-skipping.

Train\Test Bscale Bin Lskip
Bscale 0.95 0.66 0.50
Bin 0.69 0.90 0.82
Lskip 0.55 0.84 0.98
TABLE III: TPR for media resized differently

Note that although the correlation significantly depends on the RoA, the contribution of individual pixels to the PRNU noise is also another factor. This contribution depends on image content and quality, and hence is difficult to model. Further, the analysis is for an idealized case where the down-sampled video is half the original sensor resolution. But the analysis is valid as it gives insight into the relative performance achieved for attribution in the presence of mixed media when different in-camera capture techniques are used. Also, it should be noted that we did the analysis for bilinear scaling, however, the same calculation can be done for bicubic or Lanczos scaling. When we calculate the RoA of bilinear and bicubic scaling (or other scaling methods), they are typically very high (i.e., ) as they do similar processing.

Finally, it should also be noted that there are many operations that are performed within the camera such as JPEG compression, denoising, or gamma correction. Each of them can also play a role in PRNU attribution performance. However, the focus of this work is PRNU attribution in the presence of misalignment. Since binning, line-skipping, and demosaicing are the only in-camera operations that could potentially cause misalignment between sensors and pixels, and we use bilinear scaling, we considered only these operations in the mathematical analysis presented. The rest of the in-camera processing were not included as they do not directly contribute to misalignments.

Iv Camera Attribution with Mixed Media

Now that we have a better understanding of the different ways resizing may be done within a camera and the RoA between them, we present a generic algorithm for source camera identification between images and videos.

The solution for source camera attribution is independent of whether it is the images or the videos that are from the known source camera. Hence, in the rest of this section, diverting from convention, and to avoid confusion among the different scenarios, we refer to all noise patterns and fingerprints obtained from images and videos as image FE (fingerprint estimate) and video FE respectively. We do this even when the estimate is obtained using just a single image or from a very short video.

To focus better on our contribution, we assume that neither the video FE nor the image FE is obtained from media that are zoomed, stabilized, or obtained using a non-linear operation (such as HDR). Moreover, we assume they were not subjected to out-camera cropping and/or resizing operation. There has been considerable research lately to perform attribution in the presence of such operations. For example, recent research has led to techniques to obtain a camera fingerprint from stabilized video [19, 39, 41]. Similarly, Goljan [42] has proposed a way to obtain camera fingerprints from HDR images. Also [43] and [44] use JPEG compression artifacts or periodic interpolation artifacts to find zooming factor from images. So, given that these operations are somewhat “accurately” reversed in visual media, they can be applied to obtain the fingerprint estimates before the proposed algorithm is used. Of course, attribution performance may drop in some cases, and parameters of the proposed algorithm (such as search range) will have to be re-adjusted in other cases.

Noted that our assumptions do not restrict the applicability of the techniques presented. As will be shown in the next section, experiments show that the proposed algorithm gave good performance even when the test set included data that were subjected to zooming, cropping, and stabilization.

image FE

video FE
\lxSVG@sh@defs\lxSVG@pos\lxSVG@sh

find thesearch range

resizedimage FE

center ofvideo FE
\lxSVG@sh@defs\lxSVG@pos\lxSVG@sh

NCC(video FE,resized image FE)

MATCH

more resizingtechniques?

MISMATCH

crop boundary pixels

choose a resizing technique

pce

No

Yes: reset the search andchoose next technique

pce ,try nextfactor

end of search
Fig. 13: Correlation of an image FE with a video FE

Fig. 13 gives a summary of the proposed algorithm for source camera attribution, that leverages the knowledge and insights obtained from the different in-camera resizing operations presented in Section 2 and RoA results in Section 3.

Step 1:

The fingerprint computed from the video is cropped by removing columns along the height ( columns each from the left and the right borders) and rows from the width. This step is required to overcome any misalignment that can potentially arise due to the use of boundary pixels for capturing a video which may not have been used while capturing an image as described in Section 2. Specifically, it can result in the computation of a different search range in Step 4 as will be explained later. The values of and vary with camera models. They can, however, set to a maximum number for making sure that the boundary pixels over all possible camera models have been cropped out.

Step 2:

Now we select a candidate resizing technique to use. As described in Section II, binning, line-skipping and bilinear scaling may use different sets of pixels in the raw image to obtain the same resolution video frame. As analyzed in the Section III, the use of different resizing techniques causes a decrease in RoA which may result in failure to match in some cases even if the two media are resized with the same factor. Therefore, in contrast to the belief that scaling would be sufficient to match a video FE with an image FE, different resizing techniques may have to be tried. This could be very time consuming as the techniques themselves have different parameters. Scaling could use different kernels. Binning can be done in different ways depending on the Bayer pattern being used, and line-skipping has numerous implementation possibilities. Based on the results in Section III (both analytical and experimental) we see that bilinear scaling and binning appear to provide better RoAs as opposed to line-skipping. Hence in the interest of efficiency, we propose the following resizing technique selection strategy. When we have good estimates of both the camera and the video fingerprint, just bilinear scaling can be employed. Otherwise, first, try bilinear scaling. If there is no match, then try four different combinations of binning.

Step 3:

In this step we determine the search range of the possible resizing factors that need to be tried to perform the match. To accurately determine the search range, we have to take multiple factors into account including the video and image resolutions, possible boundary pixels issue (which we crop in Step 1), the in-camera resizing techniques and difference in aspect ratios, if any. In the next subsection, we describe how to compute the correct search range.

Step 4:

Although the search range specifies the different resizing factors to try, not all are equally likely. So instead of just starting from the lowest and working ones way to the highest (as is current practice), based on the knowledge gained from Section 2 and our experiments, we propose an ordering of resizing factors to be used by first trying the more likely factors. Details are provided in the next subsection.

Step 5:

The scaled image FE is correlated with the cropped video FE using NCC. If the PCE result at the NCC peak is above a threshold, it is concluded that the image(s) and video are taken by the same camera and the algorithm halts. Otherwise, the next resizing factor is tried. When all the different resizing factors in the search range have been tried, and no match is found, then we go back to step 2 to try a different resizing technique and repeat steps 4 and 5. If no more resizing techniques are available, then the media objects are considered to have been taken by a different camera.

Iv-a Smart search

Since the algorithm above needs to explore different resizing techniques and resizing factors, it is important to find ways to speed up the search. In this subsection, we describe some simple heuristics that provide up to 5 times speedup over a naive exhaustive search. This involves first narrowing down the range of the resizing factors that are searched and the second involves trying them in decreasing order of likelihood. That is, trying the most likely factors within the range first.

Determining the search range

Suppose the resolution of the video is and of the image is . Suppose also that the image is resized with a factor, within the camera and the video with . Our goal is to find the values of and such that we can match the video FE to the image FE.

Let us first assume that we know . If this is the case, we can determine the in-camera resizing factor of the video, , by considering the following two cases:

  1. [label=(),font=]

  2. The aspect ratio of the image is the same as that of the active image region of the sensor.

  3. The aspect ratio of the image is different from the active image region.

Case (i): For this case, the search range for becomes . Notice that this search range is similar to the one proposed by Goljan et al. [6] with the addition of the variable in the equation. To clarify with an example, suppose the image is captured at a resolution of using the active image region of the sensor and the Full HD video () is captured from the same region without using the active boundary. Notice that the resolution (and aspect ratio) of the image is the same as the active image region and hence the resizing factor, is . In this circumstance, the search range becomes . So, we iteratively resize the image FE with these resizing factors and correlate with the video FE. Now suppose that the image was resized with . So its resolution became . In this case, the search range becomes . Without , only the search range of [] would be explored. However, this range is likely to fail.

Case (ii): This case is not as intuitive as case (i). Here, when the aspect ratio of the image is different from the sensor’s, choosing as the lowest resizing factor may become inaccurate. To understand this better, consider the following example. Suppose the image resolution is as captured from only the active image and no resizing (). The video is also captured from the active image region and then downsized to (i.e., ). Using the formula in case (i), the search range becomes which includes the correct resizing factor of the video. Now suppose, another image whose resolution is is taken by the same camera with only a part of the active image region without any scaling operation, only cropping from the sides (i.e., ). If we use the formula in case (i), the search range will be which does not contain , so the search will fail. To fix this, we need to change the lower bound of the above formula to . This results in the range and the search ends in a correct match. It is crucial to see this is one of the main differences separating the proposed approach from [6] in terms of finding the search range.

Note that the search range calculation is done disregarding the boundary pixel issue. Recall that in step 1 we crop rows from top and bottom of the video FE and appropriate number of columns that maintains its aspect ratio. For example for a video, columns from right and left of the video. This way, we handle the boundary pixel issue without changing the search range.

We have examined the case where is known. This, for example, is true when the image FE is obtained with images captured at full resolution, giving , as is often the case. By full resolution, we mean the resolution of the active image region of the sensor as shown in Fig. 9. However, when the image FE is obtained from images captured with a lower resolution due to in-camera resizing (as in Fig. (b)b), we can find by searching across low and full resolution images dimensions captured from a camera of the same model. If a full resolution image from the same camera is not available, choosing the minimum possible resizing factor (i.e., number of rows and columns of low-resolution image divided by the rows and columns of the full resolution image, respectively) will work.

Finally, it should be noted that search by also be sped up using the approach proposed by [6], where it has been shown that even if a query image is resized by a factor and the camera fingerprint is by , there still can be effective correlation between them. This is because the RoA between the query image FE and the fingerprint is non-zero (as resizing operation involves interpolation and demosaicking). In [6], the authors proposed search for possible resizing factors using the formula for the search range of . where for that satisfies . Then, they iteratively resize the camera fingerprint with the values of and then correlate it with the query image FE.

Cropping ratio

Although finding the correct search range is important, choosing a wide range which includes improbable resizing factors will increase time complexity. Therefore, we show a way to limit the upper bound of the search range. As described in Section II-B, excessive cropping during video capture is unlikely due to the severe impact it can have on the field of view. This provides a potential way to decrease the upper bound of the search range as explained below.

We define cropping ratio to be the minimum of the ratio of the number of rows and columns in the resized video frame without cropping, divided by number of rows and columns in the final video frame after cropping. For example, suppose a camera with a sensor resolution captures an HD video (). In this case, the camera might resize the raw sensor output by a factor between (i.e., no resizing, only cropping) to (, i.e., no cropping, only resizing). After resizing, the video frame can be obtained by cropping the center of the scaled region if it is still larger. With this example, the resizing factor must be between and , and the cropping ratio is between and . Now suppose the in-camera resizing factor is (i.e., output image after resizing is ), and the resized image is cropped from the center to obtain a video frame. Here, the cropping ratio is .

Since the order of cropping and resizing has no impact on the search range, for the sake of consistency in computation, we assume that the image is first resized and then cropped.

To better understand typical cropping ratios employed in cameras, we studied the distribution of cropping ratios of videos in NYUAD-mmd as shown in Fig. 14. The figure shows that cameras tend to avoid high cropping ratios (i.e., all less than ), and most of them use a factor close to (i.e., very little or no cropping). A values less than indicates the video is captured with boundary pixels as shown in Figure 14.

Fig. 14: Distribution of cropping ratios in NYUAD-mmd

From these observations, we can derive two heuristics to speed up the search process. The first heuristic is to narrow the search space by stopping when the cropping ratio is above . The second heuristic starts from the most likely resizing factors and progresses to lesser likely ones.

Suppose the same camera sensor in the above example captures an HD video (which may or may not have used boundary pixels). Using the algorithm (in Fig. 13), the row range of to (i.e., a total of correlations) is searched for finding the correct resizing factor using approach provided in [6]. In other words, the range possible cropping ratios is between and . Using the knowledge, obtained from Fig. 14, we can narrow the cropping ratio range from to as the maximum cropping ratio in the dataset is less than . Hence the row search range can be decreased from to (i.e., a total of correlations) which can speed up close to five times.

The second heuristic that can be derived from Fig. 14 is that the majority of cameras (i.e., ) capture videos with almost no cropping (i.e., a cropping ratio in , ). When the search starts from higher cropping ratio to lower (i.e., lower resizing factors to higher in bilinear scaling), in of the cases a match will occur within a short time. However, this heuristic will not be useful for cases as there will be no match in the search. Therefore, the search will continue until trying all possible cropping/resizing factors in the range.

image video method max cf ncc time (s)
smart 1.600 273 102
[6] 3.125 426 482
smart 1.600 173 171
[6] 2.083 218 247
smart 1.563 123 198
[6] 1.563 114 192
TABLE IV: Required time for search in sample cases

To better understand the effect of narrowing the cropping ratio range, we compared the running time of smart search with the standard method in [6]. Table IV shows the running times (using one resizing technique) of cases where the image is and the video is HD (), FHD () or QHD . For these cases, we estimated maximum cropping ratio (i.e., max cf), number of NCC operations ( ncc) and total time in seconds for smart and exhaustive search.

The results show that when the image resolution ratio of the image and video is high, the speedup achieved by smart search is significant. When the difference is low, smart search can be slightly slower as the number of NCC computations increase since we crop the boundary pixels.

It should be noted that once the resizing factor for a camera model is found, we don’t need to calculate it again. We can create a lookup table which contains the resizing factor (or the matching resolution) corresponding to each possible pair of media objects from a particular camera model. Table V shows this information for the Xiaomi Redmi Note 3 and Nexus 5 smartphone cameras. For example, an image with can be matched with a Full HD video when it is resized with a factor of such that its final resolution becomes . This way, the true peak can be found using NCC, and the video and the image can be matched. It is crucial to note that because in-camera resizing is software dependent, cameras of a particular model with different software may have different resizing factors; however, we haven’t observed such a case in NYUAD-mmd.

camera image video match resol rf
Redmi N3 0.3036
0.4563
Nexus 5 0.5882
0.3922
0.5884
TABLE V: Parameter look up table for sample cameras. “Match resol.” stands for the matching resolution that image is resized to, and “rf” indicates the resizing factor in this case.

V Experimental Analysis

For evaluation, we created a dataset, NYUAD mixed media dataset (NYUAD-mmd), which contains visual media from smartphone cameras ( brands, models). From these cameras, a total of images, and non-stabilized videos (most of them being 40+ seconds) of different resolutions, as allowable by the camera settings, were collected. Most cameras allow images or video to be captured at more than one resolution. For each camera, images with the same resolution were grouped together to calculate a fingerprint, corresponding to that resolution. Next, we used the first 40 seconds (i.e., approx. frames) of each video to create a video FE. NCC was used to determine potential alignments between two fingerprints and PCE for testing if they do indeed match. The performance (i.e., true positive rate and false positive rate) of the proposed method was compared with [6] and our previous work [19]. All the experiments were implemented on Matlab 2016a on Windows 7 PC with 32 GB memory and Intel Xeon(R) E5-2687W v2 @3.40GHz CPU.

Experiment: Train on images, test on videos

For this experiment, a total of image FEs were computed from cameras as the dataset contains at least different image resolutions for most cameras. A video FE was extracted from each video. So a total of video FEs were computed.

Since both image FEs and video FEs are reliable (from many still images), based on the analysis in Section III and the algorithm suggestion in Section IV, bilinear scaling was the only resizing technique tried.

We correlated all image FEs with the video FEs taken by the same camera (i.e., alternate hypothesis, ). We then compared each image FE with random video FEs to evaluate FPR (i.e. null hypothesis, ). A total of correlations were performed for and for by comparing image FEs of with the video FEs of camera. In this experiment, we compared our results with [6] and [19].

The results show that when is set to , TPR is , and for [6], [19] and the proposed method, respectively. The main reason for [6] to perform significantly lower is because it doesn’t consider that the images taken with lower resolution might be scaled and cropped. Since smartphones were not commonly used at the time of publication of [6], the problems this paper addresses may not have been relevant then. The lower TPR for [19] is due to the boundary pixel issue as well as failing to see the lower resolution images might be already cropped and resized (within the camera) which happens in cases. Further details for the cameras (use of boundary pixels etc.) can be found in Appendix B.

method Time(sec)
[6] 4 563 N/A 0.71% 469
[19] 6 563 N/A 1.07% 567
smart 6 563 N/A 1.07% 148
[6] 382 583 65.52% N/A% 171
[19] 502 583 86.11% N/A% 93
smart 565 583 96.91% N/A% 47
TABLE VI: Performance of train on images, test on videos

Note that we used Matlab’s bilinear scaling rather than the bilinear scaling used for analysis in Section III. When the same experiment is done with the filter used in Section III, the TPR drops down from to and average PCE drops from to .

Experiment: Train on videos, test on images

In this experiment, each image FE (here image FE refers to the PRNU noise of a single image) taken by a specific camera was correlated with all video FEs of the same camera. Since the image FE estimate is not highly accurate, based on the analysis in Section III and the algorithm suggestion in Section IV, both bilinear scaling and binning were tried for resizing. As we know from RoA derivation in Section III, bilinear scaling may not be the best approach to resize the images when the video is resized with binning especially when FE quality is low. To resize using binning, we reverted RGB images to raw and using binning scheme, we downsized the raw image to half resolution. We then followed the basic imaging pipeline and using demosaicing and bilinear scaling, and obtained an image FE which is at the resolution that we already learnt from the previous experiment.

In our experiment, we first compared all video FEs with image FEs using bilinear scaling for both [19] and smart search. The TPRs for these two cases were and respectively, as shown in Table VII. Then we used resizing using only binning and achieved a TPR of (labeled bin in the table). Finally, we get the maximum PCE values of binning and bilinear scaling which resulted in TPR. These results show that in of all cases (i.e., of cases) images don’t match with the corresponding video FE when they are downsized with bilinear scaling whereas they match with binning. These results show even if we know the correct resizing factors, we may not be able to match due to differences in video and image resizing techniques (i.e., due to lower RoA). However, when we resize with a technique that potentially has higher RoA, we may be able to match.

Note that since [6] was proposed as a general solution for cropping and resizing, it did not capture some of the cases in this experiment. Hence, we compare the proposed approach with only [19].

type
[19] 17949 24077 74.55%
bilinear 19948 24077 82.85%
bin 19224 24077 79.84%
bin+bilinear 20640 24077 85.72%
TABLE VII: Performance of train on videos, test on images
Fig. 15: PCE differences of video-image comparisons of a Xiaomi Note camera when images are resizing with bilinear scaling and binning. The first 80 ( ) correlations are with HD videos and the rest () with Full HD videos.

Analysis of our experimental results also revealed that the same camera might use different resizing techniques when capturing different resolution media. For example, Figure 15 shows the correlations of video FEs with image FEs for a Xiaomi Note 4 camera. Two HD videos () and four Full HD videos () and images of resolutions were captured by the camera. We first resized the images with bilinear scaling and binning (as explained in Section III). Then, we found the PCE (using NCC and getting the peak position after resizing) for both cases and got their differences. As seen in the figure, when the images are resized with binning, for HD videos, the PCE is higher whereas for Full HD videos, resizing with binning significantly drops the PCE value. Our inference from these results is that the HD videos may be resized with binning. Thus, they match better when images are also resized with binning. For full HD videos, binning might not be used, so resizing with binning performed poorly. Similar behavior was observed in the other two Xiaomi Note 4 cameras in the dataset.

DARPA Medifor Camera ID Challenge

In July 2018, DARPA’s Media Forensics (MediFor) program conducted a PRNU-based camera attribution challenge which consisted of sub-challenges based on the type of training and test sets. For example, if we use videos and images for training, and only videos for testing, the sub-challenge was called train on multimedia, test on video.

Participants had two options for each verification task: (i) submit an answer by providing “confidence score”, which indicates how likely the specified camera has taken the probe media, or (ii) opting out from submitting a solution (i.e., when the participant is not comfortable with the confidence score). Submissions were evaluated using three metrics: Area Under Curve (AUC), Correct Detection at False Acceptance Rate (CD FAR) and Trial Response Rate (TRR). AUC is the area under the curve for the ROC curve that is obtained from the confidence scores. This has a value between zero and one where one indicates the perfect result. CD FAR shows how many true cases are accepted when only of the false cases are accepted as true. Finally, TRR indicates the rate of the tasks that were opted in.

Table VIII shows the results for each sub-challenge when all tasks were opted in. For brevity, the name of the sub-challenges is shortened in the table. For example, “image-mm” indicates “test on image, train on multimedia”, “” shows the number of participants in a sub-challenge, “Rank” is our ranking in terms of AUC, and “AUC+” and “CD+” shows the difference we have with highest performing team in terms of AUC and CD, respectively. For the sub-challenges that no other group participated, we considered AUC+ and CD+ as N/A. As shown in Table VIII, in four of the challenges that contained mixed media, no other group had submissions. Our submissions for these challenges were based on the methods proposed in this paper.

Challenge #P Rank AU CD AUC+ CD+
imageimage 9 1 0.87 0.77 0.07 0.20
imagemm 1 1 0.84 0.69 N/A N/A
imagevideo 1 1 0.62 0.40 N/A N/A
videoimage 1 1 0.76 0.50 N/A N/A
videomm 1 1 0.68 0.47 N/A N/A
videovideo 4 2 0.60 0.36 -0.10 0.02
TABLE VIII: Performance comparison in fullset

Table IX shows the results for three sub-challenges when there was an option of opting out from submitting a solution.

Challenge #P Rank AUC CD AUC+ CD+ TRR
image-image 6 1 0.99 0.99 0.05 0.15 0.58
image-video 1 1 0.90 0.81 N/A N/A 0.40
video-video 4 2 0.76 0.58 -0.10 -0.10 0.62
TABLE IX: Performance comparison in subset

Note that since the DARPA dataset contained lower quality images and videos (e.g., stabilized, low intensity, scaled and/or cropped, tampered and so on), the error rates using this dataset is higher than NYUAD-mmd where images and video were not processed in any way after capture.

Vi Conclusion and Future Work

PRNU-based source camera attribution may become ineffective when the reference and query media are of different types (i.e., one video and the other image). This is due to the misalignment caused by the differences between in- and out-camera operations applied on the two media. In this paper, we examined these differences and proposed the notion of “Ratio of Alignment”, RoA, which provides insight about how the correlation of two media will be affected due to desynchronization caused by different resizing approaches We validated this analytical RoA estimation with an experiment.

We then presented an approach for source attribution for mixed media based on the knowledge obtained for in-camera processing and RoA analysis. The approach was validated using experiments on a dataset consisting of mixed media (i.e., reference is a set of images and query is a video or reference is a video and query is a single image). It was shown that the proposed approach gives state-of-the-art results. Although experimental results using our dataset involved pristine media (i.e., not modified outside the camera), experiments with a DARPA dataset that included modified images and video, resulted in a good performance overall as well. Our experiments also revealed insights about in-camera processing for different camera models as listed in Appendix A.

One of the biggest challenges while performing source camera attribution is the development of efficient and effective techniques to determine resizing factors. Since reverse engineering the resizing factor of a media is often infeasible, it is crucial to come up with techniques that will compute the resizing technique and factor in an efficient manner rather than trying all possible parameters. Although there has been some effort for resizing factor determination techniques for images, there is significant room, and they typically don’t do well for videos based on our experience. There has been no work in resizing technique determination. In fact, media forensics research has not taken into account the fact that different resizing techniques can be deployed as has been done in this work.

Another avenue for future research stems from the fact that RoA estimation showed here has a significant performance drop caused by different resizing techniques. Perhaps one can develop an out-camera resizing technique that has higher RoA and achieve higher correlation with videos resized by commonly used techniques such as binning and line-skipping.

Besides, when the reference is a video, and the query is images, the attribution performance drops to in the best case. Clearly, there is a need for improvement. For example, a technique to obtain a better quality PRNU noise from videos may help achieve better performance.

Finally, we have assumed that neither the video nor the images are zoomed, stabilized, or obtained using a non-linear operation (such as HDR). Moreover, we think they were not subjected to an out-camera cropping and/or resizing operation. Although there has been research lately to perform attribution in the presence of such operations, a lot more work is needed to obtain higher accuracy. Although such techniques would supplement the work presented in these papers, the attribution performance drop when a cascade of such techniques is applied need to be determined.

References

  • [1] H. T. Sencar and N. Memon, Digital image forensics: There is more to a picture than meets the eye.   New York, USA: Springer, 2013.
  • [2] J. Fridrich, “Digital image forensics,” IEEE Signal Processing Magazine, vol. 26, no. 2, 2009.
  • [3] E. Delp, N. Memon, and M. Wu, “Digital forensics,” IEEE Signal Processing Magazine, vol. 26, no. 2, pp. 14–15, 2009.
  • [4] S. Milani, M. Fontani, P. Bestagini, M. Barni, A. Piva, M. Tagliasacchi, and S. Tubaro, “An overview on video forensics,” APSIPA Transactions on Signal and Information Processing, vol. 1, 2012.
  • [5] J. Lukas, J. Fridrich, and M. Goljan, “Digital camera identification from sensor pattern noise,” IEEE TIFS, vol. 1, no. 2, pp. 205–214, 2006.
  • [6] M. Goljan and J. Fridrich, “Camera identification from scaled and cropped images,” Proc. SPIE, Electronic Imaging, Forensics, Security, Steganography, and Watermarking of Multimedia Contents X, vol. 6819, pp. 68 190E–68 190E–13, 2008.
  • [7] W. Yaqub, M. Mohanty, and N. Memon, “Towards camera identification from cropped query images,” in 25th ICIP.   IEEE, 2018, pp. 3798–3802.
  • [8] S. Bayram, H. Sencar, and N. Memon, “Seam-carving based anonymization against image & video source attribution,” in MMSP, 2013 IEEE 15th International Workshop on.   IEEE, 2013, pp. 272–277.
  • [9] A. E. Dirik, H. T. Sencar, and N. Memon, “Analysis of seam-carving-based anonymization of images against prnu noise pattern-based source attribution,” IEEE TIFS, vol. 9, no. 12, pp. 2277–2290, 2014.
  • [10] S. Taspinar, M. Mohanty, and N. Memon, “Prnu-based camera attribution from multiple seam-carved images,” IEEE TIFS, vol. 12, no. 12, pp. 3065–3080, 2017.
  • [11] ——, “Prnu based source attribution with a collection of seam-carved images,” in 2016 IEEE ICIP.   IEEE, 2016, pp. 156–160.
  • [12] S. Mandelli, L. Bondi, S. Lameri, V. Lipari, P. Bestagini, and S. Tubaro, “Inpainting-based camera anonymization,” in Image Processing (ICIP), 2017 IEEE International Conference on.   IEEE, 2017, pp. 1522–1526.
  • [13] J. Entrieri and M. Kirchner, “Patch-based desynchronization of digital camera sensor fingerprints,” in IS&T Media Watermarking, Security, and Forensics, 2016.
  • [14] A. Karaküçük, A. E. Dirik, H. T. Sencar, and N. Memon, “Recent advances in counter PRNU based source attribution and beyond,” IS&T Media Watermarking, Security, and Forensics, vol. 9409, April 2015.
  • [15] “DARPA Medifor Program,” https://www.darpa.mil/program/media-forensics/, 2018, [Online; accessed 7-August-2018].
  • [16] B. E. Bayer, “Color imaging array,” Jul. 20 1976, uS Patent 3,971,065.
  • [17] X. Jin and K. Hirakawa, “Analysis and processing of pixel binning for color image sensor,” EURASIP Journal on Advances in Signal Processing, vol. 2012, no. 1, p. 125, 2012.
  • [18] J. Zhang, J. Jia, A. Sheng, and K. Hirakawa, “Pixel binning for high dynamic range color image sensor using square sampling lattice,” IEEE Transactions on Image Processing, vol. 27, no. 5, pp. 2229–2241, 2018.
  • [19] S. Taspinar, M. Mohanty, and N. Memon, “Source camera attribution using stabilized video,” in 2016 IEEE International Workshop on Information Forensics and Security (WIFS).   IEEE, 2016, pp. 1–6.
  • [20] A. Imaging, “Mt9m001: 1/2-inch megapixel cmos digital image sensor,” MT9M001 DS Rev, vol. 1, pp. 1–27, 2004.
  • [21] M. K. Mihcak, I. Kozintsev, K. Ramchandran, and P. Moulin, “Low-complexity image denoising based on statistical modeling of wavelet coefficients,” IEEE Signal Processing Letters, vol. 6, no. 12, pp. 300–303, 1999.
  • [22] K. Dabov, A. Foi, V. Katkovnik, and K. Egiazarian, “Bm3d image denoising with shape-adaptive principal component analysis,” in SPARS’09-Signal Processing with Adaptive Sparse Structured Representations, 2009.
  • [23] J. P. Lewis, “Fast normalized cross-correlation,” in Vision interface, vol. 10, no. 1, 1995, pp. 120–123.
  • [24] J. Lukáš, J. Fridrich, and M. Goljan, “Digital camera identification from sensor pattern noise,” IEEE TIFS, vol. 1, no. 2, pp. 205–214, 2006.
  • [25] Y. Sutcu, S. Bayram, H. T. Sencar, and N. Memon, “Improvements on sensor noise based source camera identification,” in IEEE International Conference on Multimedia and Expo, 2007, pp. 24–27.
  • [26] C. T. Li and Y. Li, “Color-decoupled photo response non-uniformity for digital image forensics,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 22, no. 2, pp. 260–271, 2012.
  • [27] G. Chierchia, S. Parrilli, G. Poggi, C. Sansone, and L. Verdoliva, “On the influence of denoising in PRNU based forgery detection,” in ACM Multimedia in Forensics, Security and Intelligence, 2010, pp. 117–122.
  • [28] C. T. Li, “Source camera identification using enhanced sensor pattern noise,” IEEE TIFS, vol. 5, no. 2, pp. 280–287, 2010.
  • [29] R. Caldelli, I. Amerini, F. Picchioni, and M. Innocenti, “Fast image clustering of unknown source images,” in IEEE International WIFS, 2010, pp. 1–5.
  • [30] J. Lukáš, J. Fridrich, and M. Goljan, “Detecting digital image forgeries using sensor pattern noise,” in Electronic Imaging 2006.   International Society for Optics and Photonics, 2006, pp. 60 720Y–60 720Y.
  • [31] G. Chierchia, G. Poggi, C. Sansone, and L. Verdoliva, “A Bayesian-MRF approach for PRNU-based image forgery detection,” IEEE TIFS, vol. 9, no. 4, pp. 554–567, 2014.
  • [32] S. Milani, M. Fontani, and P. B. et. al., “An overview on video forensics,” Signal Processing Systems, vol. 1, pp. 1–18, June 2012.
  • [33] M. Chen, J. Fridrich, M. Goljan, and J. Lukas, “Source digital camcorder identification using sensor photo response non-uniformity,” in SPIE Electronic Imaging, 2007, pp. 1G–1H.
  • [34] S. McCloskey, “Confidence weighting for sensor fingerprinting,” in IEEE CVPR Workshops, 2008, pp. 1–6.
  • [35] W.-H. Chuang, H. Su, and M. Wu, “Exploring compression effects for improved source camera identification using strongly compressed video,” in IEEE ICIP, 2011, pp. 1953–1956.
  • [36] S. Chen, A. Pande, K. Zeng, and P. Mohapatra, “Video source identification in lossy wireless networks,” in IEEE INFOCOM, 2013, pp. 215–219.
  • [37] W. van Houten and Z. Geradts, “Using sensor noise to identify low resolution compressed videos from Youtube,” in IAPR International Workshop on Computational Forensics, 2009, pp. 104–115.
  • [38] D.-K. Hyun, C.-H. Choi, and H.-K. Lee, “Camcorder identification for heavily compressed low resolution videos,” in Computer Science and Convergence.   Springer, 2012, pp. 695–701.
  • [39] M. Iuliani, M. Fontani, D. Shullani, and A. Piva, “A hybrid approach to video source identification,” arXiv preprint arXiv:1705.01854, 2017.
  • [40] D.-T. Dang-Nguyen, C. Pasquini, V. Conotter, and G. Boato, “Raise: a raw images dataset for digital image forensics,” in Proceedings of the 6th ACM Multimedia Systems Conference.   ACM, 2015, pp. 219–224.
  • [41] S. Mandelli, P. Bestagini, L. Verdoliva, and S. Tubaro, “Facing device attribution problem for stabilized video sequences,” arXiv preprint arXiv:1811.01820, 2018.
  • [42] M. Goljan, “Camera identification from hdr images,” Private conversation.
  • [43] A. C. Gallagher, “Detection of linear and cubic interpolation in jpeg compressed images,” in null.   IEEE, 2005, pp. 65–72.
  • [44] B. Mahdian and S. Saic, “Blind authentication using periodic properties of interpolation,” IEEE Transactions on Information Forensics and Security, vol. 3, no. 3, pp. 529–538, 2008.

Appendix A Mathematical Analysis

A-1 Sensor - Pixel correspondence for bilinear scaling for Green component

For the green color plane, the pixel of the demosaiced image can be obtained as

(16)

After bilinear scaling, the green value of the pixel of the scaled image, , is given as

(17)

Putting (16) in (17), we obtain

(18)

A-2 Sensor-Pixel Correspondences for Binning

Assume that the represents the green color component of the pixel of the binned sensor output which can be found as