End-to-End Trained CNN Encoder-Decoder Networks For Image Steganography
In the field of information security steganography and steganalysis are considered as two important techniques . Steganography is used to conceal secret information (i.e. a message, a picture or a sound) also known as payload into another non-secret object (that can be an image, a sound or a text message) also known as cover object, such that both the secret message as well as its content remain invisible. Thus in steganography, the overall goal is to conceal a payload into a cover object without compromising the fidelity of cover object and payload. Media files (such as sound and image) are preferred choices for cover objects because of their size, where images have recently gained a lot of popularity as cover objects in research community . Steganalysis serves as adversarial to steganography and targets to identify cover objects and their payloads.
In image steganography, most of the work has been done to hide a specific text message into a cover image. Thus the focus of all the existing techniques has been finding either noisy regions or low-level image features such as edges, textures, etc., in cover image for embedding maximum amount of secret information without distorting the original image.
For instance, Least Significant Bit (LSB) substitution methods have been extremely popular for image steganography  due to their simplicity. These LSB methods replace the least significant bit of a chosen pixel of cover object by a single bit information from a secret message. However, since the pixels of cover object are updated in a disjoint manner, this leads to image distortion due to change in distribution of LSBs of image pixels. Thus, it becomes easy to detect both the cover object and its payload. Recently, researchers have proposed range of improvements on basic LSB approach. For instance, Yang et al.  use image content (texture, edges and pixels brightness) to estimate number of LSBs for robust hiding of payload.
Alternatively, other researchers have used domain knowledge to extract other primitive features for information hiding. For example, Holu et al.  use weights obtained from Wavelets in their WOW (Wavelet Obtained Weights) algorithm for finding noisy and textured regions of image for embedding the secret message. Zhao et al. and Tai et al.  propose modifications in image histogram representations as a data hiding technique. Saif et al.  use cover image edges for extending the payload embedding capacity and avoiding the detection during steganalysis.
In short, all these existing works use domain knowledge to identify basic image features for binary payload hiding while being robust to distortion in cover image appearance.
In comparison to earlier work, here we address a generic problem of hiding a real-world image (payload) into another real-world image (cover). According to our best of knowledge no such work exist, that takes both cover and payload directly as images. Note that although here we propose using images for both cover and payload, other information such as text can be considered as special case of this solution, where text can be converted into a bitmap representation and then can be used as payload with our method.
Thus to this end, we propose a novel and completely automatic steganography method for hiding one image to another. For this, we design a deep learning network that automatically identifies the best features from both cover and payload images to merge information. The biggest advantage of our this approach is that its generic and can be used with any type of images, to validate this we test our approach on variety of publicly available datasets including ImageNet , MNIST , CIFAR10 , LFW  and PASCAL-VOC12 .
Overall our main contributions
are as follows: (i) we propose a deep learning based generic encoder-decoder architecture for image steganography; (ii) we design a new loss function that ensures joint end-to-end training of encoder-decoder networks; (iii) we perform extensive empirical evaluation of proposed architecture on range of challenging publicly available datasets and report state-of-the-art payload capacity at high PSNR and SSIM values. Specifically, using our proposed algorithm we can reliably embed a unary channel image ( pixels) into a color image ( pixels). Our experiments show that we can achieve this payload of 33% (on average 8 bpp) with the average PSNR values of 32.9 db (SSIM =0.96) for cover and 36.6 db (SSIM=0.96) for recovered payload image – c.f. Section 3 for further details.
The rest of the paper is organized as follows. Section 2 gives details on encoder-decoder architecture. Section 3 discusses in detail different datasets, parameter settings and our results. Finally, Section 4 concludes the paper with relevant discussion.
We train end-to-end a pair of encoder and decoder Convolutional Neural Networks (CNNs) for creating the hybrid image from pair of input images, and recovering the payload image from input hybrid image – c.f. Figure 1 for architecture details. Here, we make use of observation that CNN layers learns a hierarchy of image features from low-level generic to high-level domain specific features. Thus our encoder identifies specific features from cover image to hide the details from the payload images, while decoder learns to separate those hidden features from the “hybrid” image.
Specifically, the encoder network takes two images (i.e. a “host” cover image and a “guest” payload image) as input and produces a single hybrid output image. Thus, the goal of encoder network is to produce a hybrid image, that remains visually identical to the host image but should also contain the guest image content in it. The decoder network takes as input the encoder produced hybrid image and recovers the guest image from it. The goal of decoder network is to recover the guest image from the input hybrid that remains visually similar to input guest image of encoder.
Let and represent input host and guest images to encoder, while and represent the output hybrid image and output decoder image respectively, then the complete loss function for encoder and decoder network can be written as:
Here and represent the learned weights for the encoder and decoder networks respectively while and are controlling parameters for encoder and decoder. The first term in loss function defines encoder loss and the second one decoder loss.
The encoder network at the input end has two parallel branches named as guest branch and host branch. Guest branch receives the input guest image and uses a sequence of convolution and ReLU layers to decompose the input image into low-level (edges, colors, textures, etc.) and high level features. Host branch receives the input host image and uses a sequence of convolution and ReLU layers (except the last layer which does not include ReLU layer) to decompose the input image into a hierarchy of feature representations and merge the extracted representations of guest image into host image.
Precisely, for merging the information from guest image, encoder concatenates the extracted feature maps from each alternating layer of guest branch (starting from input) to the corresponding output features maps of host branch. This procedure is repeated up to a layer of depth (we found as the best parameter), at this point we completely merge the guest branch features into host branch and guest branch cease to exist. After merging a further sequence of convolution and ReLU layers are used before the final convolution layer which produces as output hybrid image .
Our decoder network receives the encoder produced hybrid image as input and runs it through sequence of convolution and ReLU layers (except the last layer which does not include ReLU) to recover the concealed representation of guest image .
We also experimented with other design choices, however in our initial experiments this architecture comes out as the best choice. During training both encoder and decoder are trained end-to-end using the joint loss function – c.f. Eq. (Equation 1). However during testing both encoder and decoder are used in disjoint manner.
3Experiments and Results
In this section, we report our experimental settings. We also report quantitative and qualitative results of our algorithm on a diverse set of publicly available datasets, that is on ImageNet , CIFAR10 , MNIST , LFW  and PASCAL-VOC12 .
We randomly divided each dataset sample images into three datasets: training, validation and testing. All the configurations have been done using validation set and we report the final performance on test set.
For payload, we randomly select an image from the corresponding dataset and either convert it to gray-scale or just choose a single channel from the RGB channels. For cover, we randomly select an RGB image from the corresponding dataset.
For all experiments we use the same encoder and decoder architecture as explained in Section 2. However each input image is zero-centered. Encoder and decoder weights are randomly initialized using Xavier initialization . For learning these weights we use Adam optimizer with a fixed learning rate of 1E-4 and a batch size of 32 where regularization parameter was set to and . During each epoch, we disjointly sample images for cover and payload usage from the training set. All the filters in CNN layers are applied with stride of single pixel and using same padding.
We use Peak Signal to Noise Ratio (PSNR), Structural SIMilarity (SSIM) index and bits per pixel (bpp) to report the perceptual quality of images produced and embedding capacity of our algorithm.
For our initial experiment, we used cover images () from CIFAR10 while payload images were taken () from MNIST dataset. For this experiment, we were able to hide approximately 29.1% payload (i.e.7 bpp) in our cover images with average PNSR of 32.85 db and 32.0 db for encoder and decoder networks produced images respectively – c.f. Table ?. These results show that using our algorithm, we can successfully hide a huge payload with reasonably high PNSR and SSIM values. According to our best of knowledge, no one has been able to report such results on this dataset.
However, MNIST is a relatively simple dataset as majority of pixels in each image belong to plain background (white color) class. Thus, we conducted another experiment on CIFAR10 dataset – CIFAR10 being dataset of natural classes contains much larger variation in image foreground and background regions – with identical experimental settings.
In this experiment, both cover () and payload images () were randomly and disjointly sampled from CIFAR10 training batch. In this experiment we were able to hide a payload of 33.3% (i.e.8 bpp) in our cover images with average PNSR of 30.9 db and 29.9 db for encoder and decoder networks produced images respectively.
From our these experiments, we can conclude that our proposed algorithm is extremely generic and one can, using the same architecture, reliably guarantee huge payloads and acceptable PNSR values for complex images as well – c.f. Table ?. For both these experiments we ran our algorithm for 50 epochs.
To further consolidate our findings and to evaluate our algorithm’s embedding capacity and reconstruction performance on images of large size, we designed another experiment using ImageNet dataset. A subset of 8,000 images was randomly chosen from one million images. These selected images were then divided into two disjoints sets: training (6,000 images) and testing (2,000 images) – no validation set was used here since we reuse the earlier experiments settings. To allow uniform sized images as cover and payload all of these images were then resized to pixels. For our initial version of this experiment and to ensure a fair comparison with other results, we first ran our algorithm for 50 epochs.
For randomly sampled cover () and guest images () from our ImageNet test dataset, we were able to hide a payload of 33.3% (i.e.8 bpp) in our cover images with average PNSR of 29.6 db and 31.3 db for encoder and decoder networks produced images respectively. As we were able to hide high payload for similar PNSR values to earlier experiments for this complex dataset as well, so we further explored different experimental settings.
Our final model on ImageNet was trained for 150 epochs further improving the PNSR values for encoder and decoder to 32.92 db (SSIM=0.96) and 36.58 db (SSIM=0.96) respectively from 29.6 db and 31.3 db while maintaining similar payload capacity of 33.3% (on average 8 bpp) – c.f. Table ?.
To further evaluate the generalization capacity of our algorithm, we ran the ImageNet trained algorithm on sample of 1,000 unseen images from PASCAL-VOC12  and Labelled Faces in Wild (LFW)  datasets. Table ? shows the results of our this experiment. Here, even though our algorithm is trained on different dataset, it is still being able to achieve high payload capacity at high PNSR and SSIM values which shows the generalization capabilities of our proposed algorithm.
Figure ? shows a sample of result images from LFW, PASCAL-VOC12 and ImageNet datasets. Here once again we can verify using qualitative analysis that our method is being able to conceal and recover unseen complex payload images.
Therefore, given this quantitative and qualitative analysis, we can conclude that our algorithm is generic and robust to complex backgrounds and variations in objects appearance, thus can be reliably used for image steganography.
In this paper, we have presented a novel CNN based encoder-decoder architecture for image steganography. In comparison to earlier methods, which only consider binary representation as payload our algorithm directly takes an image as payload and uses a pair of encoder-decoder networks to embed and robustly recover it from the cover image. According to our best of knowledge, no such earlier work exists and we are the first one to introduce this method for image-in-image hiding using deep neural networks. We have performed extensive experiments and empirically proven the superiority of our proposed method by showing excellent results with strong payload capacity on a wide range of wild-image datasets.
- “A survey on image steganography and steganalysis,”
Bin Li, Junhui He, Jiwu Huang, and Yun Qing Shi, Journal of Information Hiding and Multimedia Signal Processing, vol. 2, no. 2, pp. 142–172, 2011.
- “A survey of image steganography techniques,”
Mehdi Hussain and Mureed Hussain, 2013.
- “Current status and key issues in image steganography: A survey,”
Mansi S Subhedar and Vijay H Mankar, Computer science review, vol. 13, pp. 95–113, 2014.
- “Large-scale jpeg steganalysis using hybrid deep-learning framework,”
Jishen Zeng, Shunquan Tan, Bin Li, and Jiwu Huang, arXiv preprint arXiv:1611.03233, 2016.
- “A high-capacity image data hiding scheme using adaptive lsb substitution,”
Hengfu Yang, Xingming Sun, and Guang Sun, Radioengineering, vol. 18, no. 4, pp. 509–516, 2009.
- “Designing steganographic distortion using directional filters,”
Vojtech Holub and Jessica Fridrich, in Information Forensics and Security (WIFS), 2012 IEEE International Workshop on. IEEE, 2012, pp. 234–239.
- “Reversible data hiding based on multilevel histogram modification and sequential recovery,”
Zhenfei Zhao, Hao Luo, Zhe-Ming Lu, and Jeng-Shyang Pan, AEU-International Journal of Electronics and Communications, vol. 65, no. 10, pp. 814–826, 2011.
- “Reversible data hiding based on histogram modification of pixel differences,”
Wei-Liang Tai, Chia-Ming Yeh, and Chin-Chen Chang, IEEE transactions on circuits and systems for video technology, vol. 19, no. 6, pp. 906–910, 2009.
- “Edge-based image steganography,”
Saiful Islam, Mangat R Modi, and Phalguni Gupta, EURASIP Journal on Information Security, vol. 2014, no. 1, pp. 8, 2014.
- “Imagenet large scale visual recognition challenge,”
Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al., International Journal of Computer Vision, vol. 115, no. 3, pp. 211–252, 2015.
- “The mnist database of handwritten digits,” 1998.
Yann LeCun, Corinna Cortes, and Christopher JC Burges,
- “The cifar-10 dataset,” 2014.
Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton,
- “Labeled faces in the wild: A database forstudying face recognition in unconstrained environments,”
Gary B Huang, Marwan Mattar, Tamara Berg, and Eric Learned-Miller, in Workshop on faces in’Real-Life’Images: detection, alignment, and recognition, 2008.
- “The pascal visual object classes (voc) challenge,”
Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman, International journal of computer vision, vol. 88, no. 2, pp. 303–338, 2010.
- “Imagenet: A large-scale hierarchical image database,”
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei, in Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on. IEEE, 2009, pp. 248–255.
- “Understanding the difficulty of training deep feedforward neural networks.,”
Xavier Glorot and Yoshua Bengio, in Aistats, 2010, vol. 9, pp. 249–256.