Denoising Dictionary Learning Against Adversarial Perturbations
Abstract
We propose denoising dictionary learning (DDL), a simple yet effective technique as a protection measure against adversarial perturbations. We examined denoising dictionary learning on MNIST and CIFAR10 perturbed under two different perturbation techniques, fast gradient sign (FGSM) and jacobian saliency maps (JSMA). We evaluated it against five different deep neural networks (DNN) representing the building blocks of most recent architectures indicating a successive progression of model complexity of each other. We show that each model tends to capture different representations based on their architecture. For each model we recorded its accuracy both on the perturbed test data previously misclassified with high confidence and on the denoised one after the reconstruction using dictionary learning. The reconstruction quality of each data point is assessed by means of PSNR (Peak Signal to Noise Ratio) and Structure Similarity Index (SSI). We show that after applying (DDL) the reconstruction of the original data point from a noisy sample results in a correct prediction with high confidence.
Denoising Dictionary Learning Against Adversarial Perturbations
John Mitro, Derek Bridge, Steven Prestwich {j.mitro, d.bridge, s.prestwich}@insightcentre.org Insight Centre for Data Analytics Department of Computer Science University College Cork, Ireland
Copyright © 2018, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. To appear in AAAI18 workshop.
Introduction
The observation of adversarial perturbations indicated in the seminal work by (?), is formulated as an attack against DNN models (?). It describes a mechanism for devising noise taking into account the models’ output. This misbehavior is described as a high confidence involuntary misclasification on the models’ part, which could potentially undermine the security of environments where these models are deployed. We examined five DNN models, (i) multilayer perceptron (MLP), (ii) convolutional neural network (CNN), (iii) autoencoder (AE), (iv) residual network (RNet), (v) hierarchical recurrent neural network (HRNN), of varying complexity and topology, in order to identify how they respond under adversarial noise.
We investigate and visualize which parts of the data maximize the output of the model and verify whether those areas are contaminated under these perturbation attacks. Furthermore, we propose denoising dictionary learning as a measure of protection against adversarial perturbations. It exhibits desirable properties such as robustness (?), (?) and flexibility (?), (?) since it can be incorporated in any supervised learning algorithm (?). Furthermore, it can operate without the presence of noisy samples which indicates a common use case in adversarial attacks where only the test data are perturbed and presented to the model. In comparison to (?) denoising dictionary learning has the ability to recover signals from heavily noised samples, moreover, it does not render the data unreadable compared to (?) which performs a non invertible transformation. After the transformation the data are deemed noninterpretable therefore it would require the storage of both datasets, prior to and after the transformation, for interpretability of the results. This not only increases the storage space but is also computationally inefficient in real world scenarios. Current DNN models can be easily exploited by a varying number of different attacks (?), (?), (?), (?), (?), (?), (?), some of them require knowledge of the intrinsic structure of the model (?) while others can operate without any prior knowledge of the loss function of the model (?). These perturbations can be categorized in three categories. 1. Model specific perturbations, 2. Whitebox perturbations, 3. Blackbox perturbations. It is important to notice that there might be an overlap between different perturbation techniques. For the experiments conducted in this study two adversarial perturbation attacks have been utilized. The first one called fast gradient sign (?), described as the gradient of the loss of the model multiplied by a scalar. The second one is called jacobian saliency map (?) and is exploiting the forward derivative rather than the cost function emitting information about the learned behavior of the model. The differentiation is applied with respect to the input features rather than the network parameters. Instead of propagating gradients backwards it propagates them forward which permits to find those pixels in the input image that lead to significant changes in the network output. Our contributions in this study are two fold. First, we provide an intuitive explanation of the main building operations or components of DNN, how they operate and how they can be utilized to build more complex models. In addition, we describe which components remain unchanged and how they might cause adversarial perturbations. Finally, we provide a defense mechanism against adversarial perturbations based on sparse dictionary learning to alleviate the problem.
Theoretical Background
In the following Section Deep Neural Network Components we provide the necessary information required to understand how DNN operate given their basic building blocks. Based on that information we show how to construct more complex models and identify the main problematic components leading to adversarial perturbations. The “Adversarial Perturbations” Section Adversarial Perturbations describes two well known adversarial perturbations utilized during the experiments indicating which components are harnessed in order to devise the perturbations. Finally, the “Dictionary Learning” Section Dictionary Learning describes in detail how the proposed defense mechanism is devised, constructed and how it operates to reverse the effect of the perturbations.
Deep Neural Network Components
Each of the models illustrated in Figure 1 is color coded in order to denote the different building components. Each block describes a different operation. Here we provide a formal representation of each models’ structure starting from a model as simple as a mulitlayer perceptron (MLP) up until to an LSTM unit. An (MLP) with one hidden layer can be described as a function where is the size of the input vector and is the size of the output vector . Using matrix notation we formulate it as:
(1) 
where is the activation function. Setting we can rewrite the above equation as:
(2) 
On the same notion we can extend Equation 1 to accommodate for convolutional operations. Recall that a convolution is defined as:
(3) 
For the discrete case of a 1D signal the formulation is:
(4) 
This can be extended to 2D as follows:
(5) 
Replacing Equation 5 with Equation 2 we get the representation for the convolutional hidden layer:
(6) 
We can easily extend the formulation of a convolutional layer to define autoencoders which could be described as a succession of hidden layers mapped first into a latent space instead of the original space and finally with an inverse transformation restored to the original space:
(7) 
Even residual networks (?) and their residual building block shown in Figure 2.
can be reconstructed from Equation 2 as follows:
(8) 
Notice that the bias term is optional and in most recent residual network architectures is omitted.
Finally, recurrent neural networks as a building block introduce two new concepts. The concept of time and memory. Time could be easily described as a feed forward network unfolded across the time axis. Memory permits the network to share states across the different hidden layers.
(9) 
The hidden state at time step represents a function of the input modified by a weight matrix added to the hidden state of the previous time step , multiplied by its own hidden to hidden state matrix , otherwise known as a transition matrix, and similar to a Markov chain. The weight matrices are filters that determine the amount of importance to accord to both the present input and the past hidden state. LSTMs, first introduced by (?), can be described as a collection of gates with additional constraints. An LSTM layer is described as:
(10) 
In this section we wanted to demonstrate how complex DNN models can be constructed starting from MLP up to LSTMs. Even though the different components might require some changes when transitioning from one DNN model to the other, usually three of them remain consistent. Those refer to (a) the loss function, which is the categorical cross entropy in Equation 11 for all five models, (b) the activation function ReLU, (c) the optimization process resulting in a variant of gradient descent (?).
We believe that is this combination that is responsible for the phenomenon of adversarial perturbations. Examining closely each building block we realize that DNN models resemble linear models since most of the components either are a result of, or describe a linear transformation. This would explain the high confidence in misclassified examples since linear models also have the tendency to extrapolate to unseen data points with high confidence. Unfortunately, the loss function does not help either, in this situation, since eventually any differential loss has the potential to emit information on which parts of the input should be altered accordingly in order to maximize a particular output target. This information can be exploited by an adversary. Finally, any optimization process requiring small steps towards the direction of the local minimum such as gradient descent or variants will result in a process where small changes gradually lead to overall bigger effects. It is these small changes that the adversaries exploit in order to devise the perturbations.
Adversarial Perturbations
In this section we provide a short description and formulation of the adversarial perturbations utilized during the experiments. First, we present (FGSM) which is the gradient of the loss with respect to the input multiplied by a constant defined as:
(11) 
where describes the output of the neural network given input . The final perturbed sample for an input is . Second, we present (JSMA) in the following three steps. (i) Compute the forward derivative . (ii) Construct the saliency map based on the derivative. (iii) Modify an input feature by . The adversary’s’ objective is to craft an adversarial sample such that the final output of the network results in a misclasification where describes the derivative of one output neuron. Reformulating it as an optimization problem we have the following objective . The forward derivative of a DNN for a given sample is essentially the Jacobian learned by the neural network , where denotes the input feature and denotes the output of neuron .
Next compute the saliency map based on the forward derivative. Saliency maps convey information which pixel intensity values should be increased in order for a specific target to be misclassified by the neural network where denotes the true label assigned to input . The saliency map is defined as:
(12) 
The first line rejects components with negative target derivative or positive derivative on all other classes. The second line considers all other forward derivatives. In summary, high saliency map values denote features that will either increase the target class or decrease other classes significantly. Increasing those feature values causes the neural network to misclasify a sample into the target class.
Dictionary Learning
In summary, dictionary learning could be deconstructed into the following three principles. (i) Linear decomposition, (ii) sparse approximation, and (iii) dictionary learning. Linear decomposition asks the question whether a signal can be described as a linear combination of some basis vectors and their coefficients. Given a signal and a matrix of vectors , the linear decomposition of described by is such that where represent the coefficients and is the error. Sparse approximation on the other hand refers to the ability of to reconstruct from a set of sparse basis vectors. When contains more vectors than samples, i.e., , then it is called a dictionary and its vectors are referred to as atoms. Since , has multiple solutions for . It is usually accustomed to introduce different types of constraints, such as, for instance, sparsity or others, in order to allow the solution to be regularized. The decomposition of under sparse constraints is formulated as , where is a constant. There exist a plethora of algorithms to deal with this problem, some utilize greedy approaches such as orthogonal matching pursuit (OMP) (?) or norm based optimizations such as basis pursuit (?), leastangle regression (?), iterative shrinkage thresholding (?) which guarantee convex properties.
OMP proceeds by iteratively selecting the atoms, i.e., the columns in a dictionary with corresponding nonzero coefficients computed via orthogonal projection of on that best explain the current residue . Mainly the optimization process is a two step approach alternating between OMP and dictionary learning. As we mentioned the goal of dictionary learning is to learn the dictionary that is most suitable to sparsely approximate a set of signals as it is shown in Equation 15. This nonconvex problem is usually solved by alternating between the extraction of the main atoms, which is referred to as sparse coding or sparse approximation step, and the actual learning process which is referred to as the dictionary update step. This optimization scheme reduces the error criterion iteratively. There are several dictionary learning algorithms, such as maximum likelihood (ML), method of optimal directions (MOD), and K SVD for batch methods, and online dictionary learning and recursive least squares (RLS) for online methods, which are less expensive in computational time and memory than batch methods.
Next we will formulate the problem of dictionary learning from corrupted samples and demonstrate its ability as a defense mechanism against adversarial perturbations. Formally, the problem of image denoising is described as follows:
(13) 
is our measurements, is the original image and is the noise. The objective is to recover from our noisy measurements . We can reformulate our problem as an energy minimization problem also known as maximum a posterior estimation.
(14) 
refers to the relation to the measurements and is the prior. There are a number of different classical priors from which we can choose.

Smoothness

Total variation

Wavelet sparsity
Utilizing sparse representations for image reconstruction we can rewrite . This could be translated visually to the following operation’s:
Learning a dictionary of atoms can be viewed as an optimization equation derived from two parts. The first part refers to the reconstruction of the original signal and the second part refers to the sparsity.
(15) 
Where is the pseudonorm and is the norm. How the optimization problem works is as follows. We formulate and solve a matrix factorization problem after we have extracted all overlapping patches of . The factorization problem is formulated as follows:
(16) 
Notice that different constraints adhere on and depending on the matrix factorization approach. For instance if PCA is selected as a solution to the factorization problem then should be orthonormal and orthogonal. Otherwise if non negative matrix factorization is selected then and should be non negative. The optimization for dictionary learning is formulated as follows:
(17) 
In the following section we provide the description for OMP and dictionary learning algorithms utilized in the experiments. Regarding Algorithm 1, at the current iteration , OMP selects the atom that produces the strongest decrease in the residue. This is equivalent to selecting the atom that is most correlated with the residue. An active set, i.e., nonzero, is formed, which contains all of the selected atoms. In the following step the residue is updated via an orthogonal projection of on . Finally, the sparse coefficients of are also updated according to the active set .
As for Algorithm 2 the sparse coding step is usually the one which is carried out by orthogonal matching pursuit described in Algorithm 1.
Methodology
In this study we trained five different models for 100 epochs, from multilayer perceptrons to hierarchical LSTMs with a batch size of 32, on two different datasets, MNIST and CIFAR10 whose distributions are presented below.
Each model has been trained and its accuracy has been recorded on the clean test data as well as on the adversarial one. Moreover, we represent visually the distributions for each dataset and how they change accordingly before and after the adversarial attacks in Figure 4. Notice that in real world datasets such as CIFAR10 the shift of the distribution is almost unnoticeable which demonstrates the severity of adversarial attacks undermining the security of neural networks. The first row contains the qqplot along with its density plot for MNIST before (a) and after (b) the adversarial perturbation while the second row contains the same information for CIFAR10.
(a)  (b) 
We also demonstrate visually which are the most vulnerable parts in regards to the top predictions that each neural network has learned in order to differentiate two similar data points belonging in the same category in Figure 7 for MNIST and Figure 9, Figure 5 and Figure 6 for CIFAR10.
Finally, we propose dictionary learning as a defense mechanism against adversarial attacks. We demonstrate its ability on two different adversarial attacks and record the results in Table 3 and Table 3, which clearly shows that, in overall, each model achieves higher classification accuracy on the perturbed datasets after utilizing dictionary learning to reconstruct the original data from the perturbed samples. One of the advantages of dictionary learning is that it can operate regardless of the presence of perturbed or noisy samples. This implies that the dictionary can be learned either from the extracted patches of the perturbed samples or from the clean patches of the non perturbed train data. This could be useful in environments where we do not know a priory the attack or the method used to generate the perturbed data. Another advantage of dictionary learning and sparse coding is the fact that it can be embedded in any supervised learning algorithm without any severe restrictions. In the particular case of DNN the dictionary can be easily learned from the weights of an autoencoder model for instance.
Furthermore, the computational complexity of dictionary learning is much lower compared to (?) since the dictionary can be learned during training time avoiding the overhead in invoking a non invertible transformation during test time. This means that whenever a prediction is required from the model, a non invertible and computationally expensive transformation has to be performed in advance.
Experiments
Evaluation
The experiments in this study utilized five DNN models that resemble as close as possible real world application architectures composed of multiple layers such as, dropout and batch normalization to avoid overfitting. We deliberately avoided DNN models composed of only convolutional layers which seem to be more error prone to adversarial attacks. The description of the hyperparameters for each model is provided in Table 3. Each model has been evaluated on two different datasets perturbed using two different perturbation techniques (FGSM) and (JSMA). After the evaluation of dictionary learning as defense mechanism against adversarial perturbations we found that is able to withstand the attacks and provide good results in terms of accuracy for each model on the reconstructed datasets. During training for consistency we utilized Adam (?) as the optimizer for all the models. We proceeded by perturbing the test set once for FGSM , and once for JSMA , where we tested each model equivalently on , and recorded their accuracies in Table 3 and Table 3. For each image we extracted a set of overlapping patches which were used to learn the dictionary and its coefficients for each dataset as it is shown below.
(a)  (b) 
Figure 4a: OMP dictionary, MNIST (a) & CIFAR10 (b). 
Next we extracted patches from the noisy samples and , applying orthogonal matching pursuit described in Algorithm 1 reconstructing the original samples . For each sample we evaluated its reconstruction error utilizing mainly two different metrics. The first one is referred to as peak to signal noise ratio (PSNR) and is formulated as , where refers to the maximum pixel value of the image and MSE refers to the mean square error. The second one is structural similarity index (SSIM) and is formulated according to where here represent windows of size , represents the average of a window depending on the subscript, is the variance for each window and is the covariance between and . Finally, and are constants in order to stabilize the division with weak denominators.
Results
The results are summarized in Table 3 and Table 3 as well as in Figure 5 through Figure 9. Table 3 describes the results for all five models by evaluating their accuracy on MNIST and CIFAR10, on the actual test data, on the perturbed one after the FGSM perturbation attack and finally on the denoised samples recovered through denoising dictionary learning. The choice of atoms was based on a heuristic selection and it resulted in 38 atoms for MNIST and 2 atoms for CIFAR10. Although we suspect that the overall results could be improved by utilizing Bayesian hyperparameter selection for the choice of atoms. Table 3 equivalently describes the results for the actual test data, perturbed under the JSMA perturbation attack and the denoised one recovered through dictionary learning.
Dataset  MNIST  CIFAR10  

Perturbed  ✗  ✓  Denoised  ✗  ✓  Denoised  
Classifier  MLP  98.39%  12.80%  82%  60%  11%  55.60% 
ConvNet  99.35%  79.90%  90.46%  77.90%  15.06%  68.29%  
AutoEnc  99.34%  77.60%  89.37%  70.56%  13.63%  67.76%  
ResNet  93.79%  1.95%  74.27%  76.11%  0.089%  70.06%  
HRNN  98.90%  23.02%  82.52%  64%  15.75%  58.41% 
Dataset  MNIST  CIFAR10  

Perturbed  ✗  ✓  Denoised  ✗  ✓  Denoised  
Classifier  MLP  98.39%  53.02%  60%  60%  52.77%  57.63% 
ConvNet  99.35%  79.90%  93.85%  77.90%  14.59%  67.23%  
AutoEnc  99.34%  62.02%  91.47%  70.56%  56.83%  64.76%  
ResNet  93.79%  38.16%  61.12%  76.11%  56.87%  60.16%  
HRNN  98.90%  52.65%  63.21%  64%  56.43%  61.21% 
MLP  ConvNet  AutoEncoder  ResNet  HRNN 
Dropout 0.5  Dropout 0.5  Conv2D: filters=16, kernel=3  Conv2D  LSTM: units=256 
BatchNorm  2 Conv2D: filters=32, kernel=3  MaxPool: size=2  BatchNorm  TimeDistributed 
Dense 784  MaxPool: size=2  Conv2D: filters=32, kernel=3  ReLU  LSTM: units=256 
ReLU  Dropout 0.25  MaxPool: size=2  Basic Block  Reshape: 1616 
Dropout 0.2  2 Conv2D: filters=64, kernel=3  Conv2D: filters=32, size=3  3Residual Block  LSTM: units=256 
BatchNorm  MaxPool: size=2  MaxPool: size=2  Dropout 0.25  Dropout 0.25 
Dense 256  Conv2D: filters=128, kernel=3  Conv2D: filters=64, kernel=3  BatchNorm  Dense 10 
ReLU  Conv2D: filters=256, kernel=3  Conv2D: filters=128, kernel=3  ReLU  Softmax 
Dense 10  Dropout 0.25  UpSample: size=2  GlobalAveragePooling  
Softmax  Dense 512  Conv2D: filters=32, kernel=3  Dense 10  
Dropout 0.5  UpSample: size=2  Softmax  
Dense 10  Conv2D: filters=3, kernel=3  
Softmax  Conv2D: filters=1, kernel=5  
Dense 10  
Softmax 
As it is evident all five models achieve higher accuracy on the recovered samples compared to the perturbed version. We present the top classifications for the convolutional and residual network model along with their misclasification on the perturbed data under the FGSM attack on MNIST, as well as, their class activation maps which describe the sensitivity of the classifier on different parts of the input. In Figure 9 and Figure 5 we present the top misclasification for the convolutional and residual network along with the activation maps for CIFAR10 perturbed under the JSMA attack. In Figure 7 we show the top classifications for the autoencoder model along with their activation maps under the FGSM attack on MNIST. Equivalently in Figure 6 we show the top misclasification for the autoencoder on CIFAR10 under the JSMA attack. As you might have noticed the multilayer perceptron does not have feature maps similar to a convolutional network therefore it is impossible to derive the class activation maps. Similarly the same holds true for the hierarchical recurrent model. We noticed that the results from the residual network were not as resistant as we would have expected due to its skip connections. What we can infer from Figure 5 is that models who have the ability to focus on very small details of the overall image seem to be more susceptible to adversarial perturbations. In Figure 8 we demonstrate an example of a reconstructed image equivalently for each dataset and perturbation attack.
Conclusion
In this article, a defense mechanism is proposed against adversarial perturbations based on dictionary learning sparse representations for gray scale (MNIST) and color images (CIFAR10). The method has been evaluated against five modern deep neural network architectures which compose the building blocks for the majority of recent neural network architectures. The choice of dictionary learning is based solely on its properties. The resulted dictionary is a redundant, overcomplete basis, and it provides a more efficient representation than a normal basis. It is robust against noise, it has more flexibility for matching patterns in the data, and it allows a more compact representation. Future directions include the extension and comparison of the current work with deep denoising models such as gated Markov random fields and deep Boltzmann machines on the ImageNet dataset.
(a)  
(b) 