Energy-Based Spherical Sparse Coding
In this paper, we explore an efficient variant of convolutional sparse coding with unit norm code vectors where reconstruction quality is evaluated using an inner product (cosine distance). To use these codes for discriminative classification, we describe a model we term Energy-Based Spherical Sparse Coding (EB-SSC) in which the hypothesized class label introduces a learned linear bias into the coding step. We evaluate and visualize performance of stacking this encoder to make a deep layered model for image classification. 111This work was supported by NSF grants DBI-1262574, IIS-1253538, and a hardware donation from NVIDIA.
|Bailey Kong and Charless C. Fowlkes|
|Department of Computer Science|
|University of California, Irvine|
|Irvine, CA 92697 USA|
Sparse coding has been widely studied as a representation for images, audio and other vectorial data. This highly successful method that has found its way into many applications, from signal compression and denoising (Donoho, 2006; Elad & Aharon, 2006) to image classification (Wright et al., 2009), to modeling neuronal receptive fields in visual cortex (Olshausen & Field, 1997). Since its introduction, subsequent works have brought sparse coding into the supervised learning setting by introducing classification loss terms to the original formulation to encourage features that are not only able to reconstruct the original signal but are also discriminative (Jiang et al., 2011; Yang et al., 2010; Zeiler et al., 2010; Ji et al., 2011; Zhou et al., 2012; Zhang et al., 2013).
While supervised sparse coding methods have been shown to find more discriminative features leading to improved classification performance over their unsupervised counterparts, they have received much less attention in recent years and have been eclipsed by simpler feed-forward architectures.
This is in part because sparse coding is computationally expensive. Convex formulations of sparse coding typically consist of a minimization problem over an objective that includes a least-squares (LSQ) reconstruction error term plus a sparsity inducing regularizer.
Because there is no closed-form solution to this formulation, various iterative optimization techniques are generally used to find a solution (Zeiler et al., 2010; Bristow et al., 2013; Yang et al., 2013; Heide et al., 2015). In applications where an approximate solution suffices, there is work that learns non-linear predictors to estimate sparse codes rather than solve the objective more directly (Gregor & LeCun, 2010). The computational overhead for iterative schemes becomes quite significant when training discriminative models due to the demand of processing many training examples necessary for good performance, and so sparse coding has fallen out of favor by not being able to keep up with simpler non-iterative coding methods.
In this paper we introduce an alternate formulation of sparse coding using unit length codes and a reconstruction loss based on the cosine similarity. Optimal sparse codes in this model can be computed in a non-iterative fashion and the coding objective lends itself naturally to embedding in a discriminative, energy-based classifier which we term energy-based spherical sparse coding (EB-SSC). This bi-directional coding method incorporates both top-down and bottom-up information where the features representation depends on both a hypothesized class label and the input signal. Like Cao et al. (2015), our motivation for bi-directional coding comes from the “Biased Competition Theory”, which suggests that visual processing can be biased by other mental processes (e.g., top-down influence) to prioritize certain features that are most relevant to current task. Fig. 1 illustrates the flow of computation used by our SSC and EB-SSC building blocks compared to a standard feed-forward layer.
Our energy based approach for combining top-down and bottom-up information is closely tied to the ideas of Larochelle & Bengio (2008); Ji et al. (2011); Zhang et al. (2013); Li & Guo (2014)—although the model details are substantially different (e.g., Ji et al. (2011) and Zhang et al. (2013) use sigmoid non-linearities while Li & Guo (2014) use separate representations for top-down and bottom-up information). The energy function of Larochelle & Bengio (2008) is also similar but includes an extra classification term and is trained as a restricted Boltzmann machine.
Matrices are denoted as uppercase bold (e.g., ), vectors are lowercase bold (e.g., ), and scalars are lowercase (e.g., ). We denote the transpose operator with , the element-wise multiplication operator with , the convolution operator with , and the cross-correlation operator with . For vectors where we dropped the subscript (e.g., and ), we refer to a super vector with components stacked together (e.g., ).
2 Energy-Based Spherical Sparse Coding
Energy-based models capture dependencies between variables using an energy function that measure the compatibility of the configuration of variables (LeCun et al., 2006). To measure the compatibility between the top-down and bottom-up information, we define the energy function of EB-SSC to be the sum of bottom-up coding term and a top-down classification term:
The bottom-up information (input signal ) and the top-down information (class label ) are tied together by a latent feature map .
2.1 Bottom-Up Reconstruction
To measure the compatibility between the input signal and the latent feature maps , we introduce a novel variant of sparse coding that is amenable to efficient feed-forward optimization. While the idea behind this variant can be applied to either patch-based or convolutional sparse coding, we specifically use the convolutional variant that shares the burden of coding an image among nearby overlapping dictionary elements. Using such a shift-invariant approach avoids the need to learn dictionary elements which are simply translated copies of each other, freeing up resources to discover more diverse and specific filters (see Kavukcuoglu et al. (2010)).
Convolutional sparse coding (CSC) attempts to find a set of dictionary elements and corresponding sparse codes so that the resulting reconstruction, accurately represents the input signal . This is traditionally framed as a least-squares minimization with a sparsity inducing prior on :
Unlike standard feed-forward CNN models that convolve the input signal with the filters, this energy function corresponds to a generative model where the latent feature maps are convolved with the filters and compared to the input signal (Bristow et al., 2013; Heide et al., 2015; Zeiler et al., 2010).
To motivate our novel variant of CSC, consider expanding the squared reconstruction error . If we constrain the reconstruction to have unit norm, the reconstruction error depends entirely on the inner product between and and is equivalent to the cosine similarity (up to additive and multiplicative constants). This suggests the closely related unit-length reconstruction problem:
In Appendix A we show that, given an optimal unit length reconstruction with corresponding codes , the solution to the least squares reconstruction problem (Eq. 2) can be computed by a simple scaling .
The unit-length reconstruction problem is no easier than the original least-squares optimization due to the constraint on the reconstruction which couples the codes for different filters. Instead consider a simplified constraint on which we refer to as spherical sparse coding (SSC):
In 2.3 below, we show that the solution to this problem can be found very efficiently without requiring iterative optimization.
This problem is a relaxation of convolutional sparse coding since it ignores non-orthogonal interactions between the dictionary elements222We note that our formulation is also closely related to the dynamical model suggested by Rozell et al. (2008), but without the dictionary-dependent lateral inhibition between feature maps. Lateral inhibition can solve the unit-length reconstruction formulation of standard sparse coding but requires iterative optimization.. Alternately, assuming unit norm dictionary elements, the code norm constraint can be used to upper-bound the reconstruction length. We have by the triangle and Young’s inequality that:
where the factor is the dimension of and arises from switching from the -norm to the -norm. Since is a tighter constraint we have
However, this relaxation is very loose, primarily due to the triangle inequality. Except in special cases (e.g., if the dictionary elements have disjoint spectra) the SSC codes will be quite different from the standard least-squares reconstruction.
2.2 Top-Down Classification
To measure the compatibility between the class label and the latent feature maps , we use a set of one-vs-all linear classifiers. To provide more flexibility, we generalize this by splitting the code vector into positive and negative components:
and allow the linear classifier to operate on each component separately. We express the classifier score for a hypothesized class label by:
The classifier thus is parameterized by a pair of weight vectors ( and ) for each class label and -th channel of the latent feature map.
This splitting, sometimes referred to as full-wave rectification, is useful since a dictionary element and its negative do not necessarily have opposite visual semantics. This splitting also allows the classifier the flexibility to assign distinct meanings or alternately be completely invariant to contrast reversal depending on the problem domain. For example, Shang et al. (2016) found CNN models with ReLU non-linearities which discard the negative activations tend to learn pairs of filters which are related by negation. Keeping both positive and negative responses allowed them to halve the number of dictionary elements.
We note that it is also straightforward to introduce spatial average pooling prior to classification by introducing a fixed linear operator used to pool the codes (e.g., ). This is motivated by a variety of hand-engineered feature extractors and sparse coding models, such as Ren & Ramanan (2013), which use spatially pooled histograms of sparse codes for classification. This fixed pooling can be viewed as a form of regularization on the linear classifier which enforces shared weights over spatial blocks of the latent feature map. Splitting is also quite important to prevent information loss when performing additive pooling since positive and negative components of can cancel each other out.
Bottom-up reconstruction and top-down classification each provide half of the story, coupled by the latent feature maps. For a given input and hypothesized class , we would like to find the optimal activations that maximize the joint energy function . This requires solving the following optimization:
where is an image and is a class hypothesis. is the -th component latent variable being inferred; and are the positive and negative coefficients of , such that . The parameters , , and are the dictionary filter, positive coefficient classifier, and negative coefficient classifier for the -th component respectively. A key aspect of our formulation is that the optimal codes can be found very efficiently in closed-form—in a feed-forward manner (see Appendix B for a detailed argument).
2.3.1 Asymmetric Shrinkage
To describe the coding processes, let us first define a generalized version of the shrinkage function commonly used in sparse coding. Our asymmetric shrinkage is parameterized by upper and lower thresholds
Fig. 2 shows a visualization of this function which generalizes the standard shrinkage proximal operator by allowing for the positive and negative thresholds. In particular, it corresponds to the proximal operator for a version of the -norm that penalizes the positive and negative components with different weights . The standard shrink operator corresponds to while the rectified linear unit common in CNNs is given by a limiting case . We note that is required for to be a proper function (see Fig. 2).
2.3.2 Feed-Forward Coding
We now describe how codes can be computed in a simple feed-forward pass. Let
be vectors of positive and negative biases whose entries are associated with a spatial location in the feature map for class . The optimal code can be computed in three sequential steps:
Cross-correlate the data with the filterbank
Apply an asymmetric version of the standard shrinkage operator
where, with abuse of notation, we allow the shrinkage function (Eq. 9) to apply entries in the vectors of threshold parameter pairs to the corresponding elements of the argument.
Project onto the feasible set of unit length codes
2.3.3 Relationship to CNNs:
We note that this formulation of coding has a close connection to single layer convolutional neural network (CNN). A typical CNN layer consists of convolution with a filterbank followed by a non-linear activation such as a rectified linear unit (ReLU). ReLUs can be viewed as another way of inducing sparsity, but rather than coring the values around zero like the shrink function, ReLU truncates negative values. On the other hand, the asymmetric shrink function can be viewed as the sum of two ReLUs applied to appropriately biased inputs:
SSC coding can thus be seen as a CNN in which the ReLU activation has been replaced with shrinkage followed by a global normalization.
We formulate supervised learning using the softmax log-loss that maximizes the energy for the true class label while minimizing energy of incorrect labels .
where is the hyperparameter regularizing , , and . We constrain the relationship between and the entries of and in order for the asymmetric shrinkage to be a proper function (see Sec. 2.3.1 and Appendix B for details).
In classical sparse coding, it is typical to constrain the -norm of each dictionary filter to unit length. Our spherical coding objective behaves similarly. For any optimal code , there is a -dimensional subspace of parameters for which is optimal given by scaling inversely to , . For simplicity of the implementation, we opt to regularize to assure a unique solution. However, as Tygert et al. (2015) point out, it may be advantageous from the perspective of optimization to explicitly constrain the norm of the filterbank.
Note that unlike classical sparse coding, where is a hyperparameter that is usually set using cross-validation, we treat it as a parameter of the model that is learned to maximize performance.
In order to solve Eq. 13, we explicitly formulate our model as a directed-acyclic-graph (DAG) neural network with shared weights, where the forward-pass computes the sparse code vectors and the backward-pass updates the parameter weights. We optimize the objective using stochastic gradient descent (SGD).
As mentioned in Sec. 2.3 shrinkage function is asymmetric with parameters or as defined in Eq. 10. However, the inequality constraint on their relationship to keep the shrinkage function a proper function is difficult to enforce when optimizing with SGD. Instead, we introduce a central offset parameter and reduce the ordering constraint to pair of positivity constraints. Let
be the modified linear “classifiers” relative to the central offset . It is straightforward to see that if and that satisfy the constrain in Eq. 13, then adding the same value to both sides of the inequality will not change that. However, taking to be a midpoint between them, then both and will be strictly non-negative.
Using this variable substitution, we rewrite the energy function (Eq. 1) as
where is constant offset for each code channel. The modified linear “classification” terms now takes on a dual role of inducing sparsity and measuring the compatibility between and .
This yields a modified learning objective that can easily be solved with existing implementations for learning convolutional neural networks:
where and are the new sparsity inducing classifiers, and are the arbitrary origin points. In particular, adding the origin points allows us to enforce the constraint by simply projecting and onto the positive orthant during SGD.
3.1.1 Stacking Blocks
We also examine stacking multiple blocks of our energy function in order to build a hierarchical representation. As mentioned in Sec. 2.3.2, the optimal codes can be computed in a simple feed-forward pass—this applies to shallow versions of our model. When stacking multiple blocks of our energy-based model, solving for the optimal codes cannot be done in a feed-forward pass since the codes for different blocks are coupled (bilinearly) in the joint objective. Instead, we can proceed in an iterative manner, performing block-coordinate descent by repeatedly passing up and down the hierarchy updating the codes. In this section we investigate the trade-off between the number of passes used to find the optimal codes for the stacked model and classification performance.
For this purpose, we train multiple instances of a 2-block version of our energy-based model that differ in the number of iterations used when solving for the codes. For recurrent networks such as this, inference is commonly implemented by “unrolling” the network, where the parts of the network structure are repeated with parameters shared across these repeated parts to mimic an iterative algorithm that stops at a fixed number of iterations rather than at some convergence criteria.
In Fig. 3, we compare the performance between models that were unrolled zero to four times. We see that there is a difference in performance based on how many sweeps of the variables are made. In terms of the training objective, more unrolling produces models that have lower objective values with convergence after only a few passes. In terms of the testing error rate, however, we see that full code inference is not necessarily better, as unrolling once or twice has lower error rates than unrolling three or four times. The biggest difference was between not unrolling and unrolling once, where both the training objective and testing error rate go down. The testing error rate decreases from 0.0131 to 0.0074. While there is a clear benefit in terms of performance for unrolling at least once, there is also a trade-off between performance and computational resource, especially for deeper models.
We evaluate the benefits of combining top-down and bottom-up information to produce class-specific features on the CIFAR-10 (Krizhevsky & Hinton, 2009) dataset using a deep version of our EB-SSC. All experiments were performed using the MatConvNet (Vedaldi & Lenc, 2015) framework with the ADAM optimizer (Kingma & Ba, 2014). The data was preprocessed and augmented following the procedure in Goodfellow et al. (2013). Specifically, the data was made zero mean and whitened, augmented with horizontal flips (with a 0.5 probability) and random cropping. No weight decay was used, but we used a dropout rate of before every convolution layer except for the first. For these experiments we consider a single forward pass (no unrolling).
|block||kernel, stride, padding||activation|
We compare our proposed EB-SSC model to that of Springenberg et al. (2015), which uses rectified linear units (ReLU) as its non-linearity. This model can be viewed as a basic feed-forward version of our proposed model which we take as a baseline. We also consider variants of the baseline model that utilize a subset of architectural features of our proposed model (e.g., concatenated rectified linear units (CReLU) and spherical normalization (SN)) to understand how subtle design changes of the network architecture affects performance.
We describe the model architecture in terms of the feature extractor and classifier. Table 1 shows the overall network architecture of feature extractors, which consist of seven convolution blocks and two pooling layers. We test two possible classifiers: a simple linear classifier (LC) and our energy-based classifier (EBC), and use softmax-loss for all models. For linear classifiers, a numerical subscript indicates which of the seven conv blocks of the feature extractor is used for classification (e.g., LC indicates the activations out of the last conv block is fed into the linear classifier). For energy-based classifiers, a numerical subscript indicates which conv blocks of the feature extractor are replace with a energy-based classifier (e.g., EBC indicates the activations out of conv5 is fed into the energy-based classifier and the energy-based classifier has a similar architecture to the conv blocks it replaces). The notation differ because for energy-based classifiers, the optimal activations are a function of the hypothesized class label, whereas for linear classifiers, they are not.
|Model||Train Err. (%)||Test Err. (%)||# params|
The results shown in Table 2 compare our proposed model to the baselines ReLU+LC (Springenberg et al., 2015) and CReLU+LC (Shang et al., 2016), and to intermediate variants. The baseline models all perform very similarly with some small reductions in error rates over the baseline CReLU+LC. However, CReLU+LC reduces the error rate over ReLU+LC by more than one percent (from 11.40% to 10.17%), which confirms the claims by Shang et al. (2016) and demonstrates the benefits of splitting positive and negative activations. Likewise, we see further decrease in the error rate (to 9.74%) from using spherical normalization. Though normalizing the activations doesn’t add any capacity to the model, this improved performance is likely because scale-invariant activations makes training easier. On the other hand, further sparsifying the activations yielded no benefit. We tested values and found to perform better. Replacing the linear classifier with our energy-based classifier further decreases the error rate by another half percent (to 9.23%).
4.2 Decoding Class-Specific Codes
A unique aspect of our model is that it is generative in the sense that each layer is explicitly trying to encode the activation pattern in the prior layer. Similar to the work on deconvolutional networks built on least-squares sparse coding (Zeiler et al., 2010), we can synthesize input images from activations in our spherical coding network by performing repeated deconvolutions (transposed convolutions) back through the network. Since our model is energy based, we can further examine how the top-down information of a hypothesized class effects the intermediate activations.
The first column in Fig. 4 visualizes reconstructions of a given input image based on activations from different layers of the model by convolution transpose. In this case we put in zeros for class biases (i.e., no top-down) and are able to recover high fidelity reconstructions of the input. In the remaining columns, we use the same deconvolution pass to construct input space representations of the learned classifier biases. At low levels of the feature hierarchy, these biases are spatially smooth since the receptive fields are small and there is little spatial invariance capture in the activations. At higher levels these class-conditional bias fields become more tightly localized.
Finally, in Fig. 5 we shows decodings from the conv2 and conv5 layer of the EB-SSC model for a given input under different class hypotheses. Here we subtract out the contribution of the top-down bias term in order to isolate the effect of the class conditioning on the encoding of input features. As visible in the figure, the modulation of the activations focused around particular regions of the image and the differences across class hypotheses becomes more pronounced at higher layers of the network.
We presented an energy-based sparse coding method that efficiently combines cosine similarity, convolutional sparse coding, and linear classification. Our model shows a clear mathematical connection between the activation functions used in CNNs to introduce sparsity and our cosine similarity convolutional sparse coding formulation. Our proposed model outperforms the baseline model and we show which attributes of our model contributes most to the increase in performance. We also demonstrate that our proposed model provides an interesting framework to probe the effects of class-specific coding.
- Bristow et al. (2013) Hilton Bristow, Anders Eriksson, and Simon Lucey. Fast convolutional sparse coding. In Computer Vision and Pattern Recognition (CVPR), 2013.
- Cao et al. (2015) Chunshui Cao, Xianming Liu, Yi Yang, Yinan Yu, Jiang Wang, Zilei Wang, Yongzhen Huang, Liang Wang, Chang Huang, Wei Xu, et al. Look and think twice: Capturing top-down visual attention with feedback convolutional neural networks. In International Conference on Computer Vision (ICCV), 2015.
- Donoho (2006) David L Donoho. Compressed sensing. IEEE Transactions on information theory, 2006.
- Elad & Aharon (2006) Michael Elad and Michal Aharon. Image denoising via sparse and redundant representations over learned dictionaries. IEEE Transactions on Image processing, 2006.
- Goodfellow et al. (2013) Ian J Goodfellow, David Warde-Farley, Mehdi Mirza, Aaron C Courville, and Yoshua Bengio. Maxout networks. In International conference on Machine learning (ICML), 2013.
- Gregor & LeCun (2010) Karol Gregor and Yann LeCun. Learning fast approximations of sparse coding. In International Conference on Machine Learning (ICML), 2010.
- Heide et al. (2015) Felix Heide, Wolfgang Heidrich, and Gordon Wetzstein. Fast and flexible convolutional sparse coding. In Computer Vision and Pattern Recognition (CVPR), 2015.
- Ji et al. (2011) Zhengping Ji, Wentao Huang, G. Kenyon, and L.M.A. Bettencourt. Hierarchical discriminative sparse coding via bidirectional connections. In International Joint Converence on Neural Networks (IJCNN), 2011.
- Jiang et al. (2011) Zhuolin Jiang, Zhe Lin, and Larry S Davis. Learning a discriminative dictionary for sparse coding via label consistent K-SVD. In Computer Vision and Pattern Recognition (CVPR), 2011.
- Kavukcuoglu et al. (2010) Koray Kavukcuoglu, Pierre Sermanet, Y-Lan Boureau, Karol Gregor, Michaël Mathieu, and Yann L Cun. Learning convolutional feature hierarchies for visual recognition. In Advances in neural information processing systems (NIPS), 2010.
- Kingma & Ba (2014) Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- Krizhevsky & Hinton (2009) Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. 2009.
- Larochelle & Bengio (2008) Hugo Larochelle and Yoshua Bengio. Classification using discriminative restricted boltzmann machines. In International conference on Machine learning (ICML), 2008.
- LeCun et al. (2006) Yann LeCun, Sumit Chopra, Raia Hadsell, M Ranzato, and F Huang. A tutorial on energy-based learning. Predicting structured data, 2006.
- Li & Guo (2014) Xin Li and Yuhong Guo. Bi-directional representation learning for multi-label classification. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases (ECML KDD). 2014.
- Olshausen & Field (1997) Bruno A Olshausen and David J Field. Sparse coding with an overcomplete basis set: A strategy employed by v1? Vision research, 1997.
- Ren & Ramanan (2013) Xiaofeng Ren and Deva Ramanan. Histograms of sparse codes for object detection. In Computer Vision and Pattern Recognition (CVPR), 2013.
- Rozell et al. (2008) Christopher J Rozell, Don H Johnson, Richard G Baraniuk, and Bruno A Olshausen. Sparse coding via thresholding and local competition in neural circuits. Neural computation, 2008.
- Shang et al. (2016) Wenling Shang, Kihyuk Sohn, Diogo Almeida, and Honglak Lee. Understanding and improving convolutional neural networks via concatenated rectified linear units. In International conference on Machine learning (ICML), 2016.
- Springenberg et al. (2015) J Springenberg, Alexey Dosovitskiy, Thomas Brox, and M Riedmiller. Striving for simplicity: The all convolutional net. In International conference on Learning Representations (ICLR) (workshop track), 2015.
- Tygert et al. (2015) Mark Tygert, Arthur Szlam, Soumith Chintala, Marc’Aurelio Ranzato, Yuandong Tian, and Wojciech Zaremba. Convolutional networks and learning invariant to homogeneous multiplicative scalings. arXiv preprint arXiv:1506.08230, 2015.
- Vedaldi & Lenc (2015) A. Vedaldi and K. Lenc. Matconvnet – convolutional neural networks for matlab. In ACM International Conference on Multimedia, 2015.
- Wright et al. (2009) John Wright, Allen Y Yang, Arvind Ganesh, S Shankar Sastry, and Yi Ma. Robust face recognition via sparse representation. IEEE transactions on pattern analysis and machine intelligence (TPAMI), 2009.
- Yang et al. (2013) Allen Y Yang, Zihan Zhou, Arvind Ganesh Balasubramanian, S Shankar Sastry, and Yi Ma. Fast-minimization algorithms for robust face recognition. IEEE Transactions on Image Processing, 2013.
- Yang et al. (2010) Jianchao Yang, Kai Yu, and Thomas Huang. Supervised translation-invariant sparse coding. In Computer Vision and Pattern Recognition (CVPR), 2010.
- Zeiler et al. (2010) Matthew D. Zeiler, Dilip Krishnan, Graham W. Taylor, and Robert Fergus. Deconvolutional networks. In Computer Vision and Pattern Recognition (CVPR), 2010.
- Zhang et al. (2013) Yangmuzi Zhang, Zhuolin Jiang, and Larry S Davis. Discriminative tensor sparse coding for image classification. In British Machine Vision Conference (BMVC), 2013.
- Zhou et al. (2012) Ning Zhou, Yi Shen, Jinye Peng, and Jianping Fan. Learning inter-related visual dictionary for object recognition. In Computer Vision and Pattern Recognition (CVPR), 2012.
Here we show that spherical sparse coding (SSC) with a norm constraint on the reconstruction is equivalent to standard convolutional sparse coding (CSC). Expanding the least squares reconstruction error and dropping the constant term gives the CSC problem:
Let be the norm of the reconstruction for some code and let be the reconstruction scaled to have unit norm so that:
We rewrite the least-squares objective in terms of these new variables:
Taking the derivative of w.r.t. yields the optimal scaling as a function of :
Plugging back into yields:
Discarding solutions with can be achieved by simply dropping the square which results in the final constrained problem:
We show in this section that coding in the EB-SSC model can be solved efficiently by a combination of convolution, shrinkage and projection, steps which can be implemented with standard libraries on a GPU. For convenience, we first rewrite the objective in terms of cross-correlation rather than convolution (i.e., , ). For ease of understanding, we first consider the coding problem when there is no classification term.
where . Pulling the constraint into the objective, we get its Lagrangian function:
From the partial subderivative of the Lagrangian w.r.t. we derive the optimal solution as a function of ; and from that find the conditions in which the solutions hold, giving us:
This can also be compactly written as:
where and . The sign vector of can be determined without knowing , as is a Lagrangian multiplier for an inequality it must be non-negative and therefore does not change the sign of the optimal solution. Lastly, we define the squared -norm of , a result that will be used later: