AgileNet: Lightweight Dictionarybased
Fewshot Learning
Abstract
The success of deep learning models is heavily tied to the use of massive amount of labeled data and excessively long training time. With the emergence of intelligent edge applications that use these models, the critical challenge is to obtain the same inference capability on a resourceconstrained device while providing adaptability to cope with the dynamic changes in the data. We propose AgileNet, a novel lightweight dictionarybased fewshot learning methodology which provides reduced complexity deep neural network for efficient execution at the edge while enabling lowcost updates to capture the dynamics of the new data. Evaluations of stateoftheart fewshot learning benchmarks demonstrate the superior accuracy of AgileNet compared to prior arts. Additionally, AgileNet is the first fewshot learning approach that prevents model updates by eliminating the knowledge obtained from the primary training. This property is ensured through the dictionaries learned by our novel endtoend structured decomposition, which also reduces the memory footprint and computation complexity to match the edge device constraints.
AgileNet: Lightweight Dictionarybased
Fewshot Learning
Mohammad Ghasemzadeh
UC San Diego
mghasemzadeh@ucsd.edu
Fang Lin
UC San Diego & SDSU
fanglin@ucsd.edu
Bita Darvish Rouhani
UC San Diego
bita@ucsd.edu
Farinaz Koushanfar
UC San Diego
farinaz@ucsd.edu
Ke Huang
SDSU
khuang@sdsu.edu
noticebox[b]\end@float
1 Introduction
Deep Neural Networks (DNNs) have achieved a remarkable success in several critical application domains including computer vision, speech recognition, and natural language processing. The trend of making deeper and wider networks to achieve higher model accuracy counters the goal of providing networks with higher efficiency in terms of model size and the speed of training/inference. Efficiency and compactness are of growing concerns since many of the applications relying on deep learning models are eventually aimed at providing intelligence on resourceconstrained devices at the edge. The conventional cloud outsourcing approach fails to address latency, privacy, and availability concerns Howard et al. (2017); Abadi et al. (2016). This has been the catalyst for a large number of works building efficient DNN inference accelerators such as Lane et al. (2015); Sharma et al. (2016). Training phase of DNNs incurs a larger memory footprint and computation complexity compared with the inference. Assuming training is a onetime task, after which the model can be deployed on the inference accelerator platform on the edge, the major trend has been to train on the cloud Lane et al. (2016). However, providing adaptability at the edge is necessary to maintain the desired accuracy in dynamic environment settings.
To address the above requirements, there is a need to tackle two key challenges so they can effectively fit within the edge devices: (i) how to reduce memory and computation cost of the DNN model on the cloud server without compromising the application performance and accuracy. (ii) how to extend the space of model parameters to learn new tasks ondevice without forgetting the knowledge learned originally. Learning new tasks should be performed using few data instances over few iterations to comply with the stringent physical performance requirements at the edge.
Conventional supervised deep learning is dependent on the availability of a massive amount of labeled data; the trained models generally perform poorly when labeled data is limited. The problem of rapidly learning new tasks with a limited amount of labeled data is referred to as “fewshot learning”, which has received considerable attention from research community in recent years Li et al. (2006); Lake et al. (2015); Hariharan and Girshick (2017). However, many of the recent approaches solely consider the model’s performance on the new task and thus their approach discards the primary knowledge of the older tasks. This is in contrast with the goal of providing adaptable intelligence at the edge where adding to the capabilities of the model is desired without forgetting the previous knowledge. Neglecting the physical constraints of the edge device, in terms of memory, compute power, and energy consumption is another drawback of many of the stateoftheart fewshot learning approaches. A practical fewshot learning methodology should extend the capabilities of the model not only using the few available new data instances but also through lightweight updates to the model.
This work proposes AgileNet, the first lightweight fewshot learning scheme that enables efficient and adaptable edge device realization of DNNs. To enable AgileNet, we create a novel endtoend structured decomposition methodology for DNNs which allows lowcost model updates to capture the dynamics of the new data. AgileNet not only performs lightweight and effective fewshot leaning but also shrinks the storage requirement and computational cost of the model to match the edge device constraints. In summary, the contributions of this work are as follows:

Proposing AgileNet, a novel dictionarybased fewshot learning approach to enable adaptability at the edge while complying with the stringent resource constraints.

Developing a new endtoend structured decomposition methodology which reduces memory footprint and computational complexity of the model to match edge constraints.

Innovating a lightweight model updating mechanism to capture the dynamics of the new data with only a few instances leveraging the properties of the learned dictionaries.

Demonstrating the superior accuracy of AgileNet on fewshot learning benchmarks compared with the stateoftheart approaches on standard fewshot learning benchmarks. AgileNet is shown to preserve the accuracy on old and new classes, while reducing the amount of storage and computing.
The rest of paper is structured as follows. Section 2 provides a review of related literature and discusses drawbacks of the prior art. The global flow of AgileNet is described in Section 3. Section 4 presents the details of the structured decomposition methodology. Fewshot learning technique is explained in Section 5. Section 6 provides the experiment setting and benchmark evaluations and is followed by conclusions in Section 7.
2 Related Work
The key challenge of fewshot learning is to use primary knowledge obtained through original training data to make predictions about unseen classes of data with a limited number of available samples. Following the long history of research on fewshot learning approaches, the first work to leverage modern machine learning for oneshot learning was proposed by Li et al. (2006). In recent years, the work in Lake et al. (2015) and Koch et al. (2015) have established two standard benchmarks, Omniglot and MiniImageNet respectively, to compare fewshot learning approaches in terms of accuracy. Lake et al. (2015) leverages a Bayesian model while the authors of Koch et al. (2015) utilized a Siamese network which learns pairwise similarity metrics to generalize the predictive power of the model to new classes. These works were followed by other pairwise similaritybased fewshot learning approaches in Vinyals et al. (2016); Snell et al. (2017); Mehrotra and Dukkipati (2017).
From a different perspective, fewshot learning through combining graphbased analytics with deep learning has been proposed in Garcia and Bruna (2017). In a separate trend of work, metalearners Ravi and Larochelle (2016); Munkhdalai and Yu (2017); Mishra et al. (2017b) are developed to generalize the DNN model to new related tasks. The aforementioned works have incrementally increased the accuracy on fewshot learning benchmarks. However, all these works are negligent to the model accuracy on old classes. Therefore, their proposal can degrade the predictive power of the model on old data. Additionally, many of the aforementioned approaches incur a high computation cost to adapt the model and thus are not amenable to resourceconstrained settings. AgileNet preserves the prior knowledge of the model on old data while outperforming all stateoftheart approaches in terms of fewshot learning accuracy. Additionally, lightweight model updates of AgileNet complies with stringent limitations of edge devices.
3 Global Flow of AgileNet
Figure 1 presents the global flow of AgileNet which involves three stages: primary training stage, dictionary learning stage, and fewshot learning stage. The first two stages are performed on the cloud, and the last stage is executed on the edge device with limited resources.
Primary Training Stage: At this stage, the original model with the mainstream architecture is trained using conventional training methodologies.
Dictionary Learning Stage: The trained model and edge constraints in terms of the memory and computation resources are taken into account for transforming the model using endtoend structured decomposition discussed in Section 4. At this stage, the tradeoff between memory/computation cost and final accuracy of the model is leveraged to match the edge constraints.
Fewshot Learning Stage: Finally, the AgileNet model is deployed on the edge device. Despite the memory and computational benefits of structured decomposition, it enables adaptability for dynamic settings. The AgileNet model provides the expected inference accuracy on the desired task under tight resource constraints. At the same time, when encountering new classes of data, lowcost updates on the edge device are sufficient to learn new capabilities.
4 EndtoEnd Structured Decomposition
AgileNet performs structured decomposition on all layers using an adaptive subspace projection method built on the foundation of column subset selection proposed in Tropp (2009); Boutsidis et al. (2009). We emphasize that AgileNet is the first to leverage this technique to perform an endtoend transformation of a DNN model; however, the work in Rouhani et al. (2016) used a similar approach to project input data to a DNN model into lower dimensions.
4.1 Adaptive Subspace Projection
Assume an arbitrary matrix . The goal of subspace projection technique is to represent with a coefficient matrix and a basis dictionary matrix such that and , where is dimensionality of the ambient space after projection and is the absolute tolerable error threshold for the projection. This decomposition allows us to represent matrix with correlated columns using the coefficients and the dictionary with negligible error. To build coefficient and dictionary matrices, adaptive subspace projection adds a particular column of that minimizes the projection error to the dictionary matrix at each iteration. According to the desired error threshold, this technique creates the dictionary by increasing , which is the number of columns of dictionary until it finds a suitable lowerdimensional subspace for the data projection. This dictionary can be adaptively updated as the dynamics of the original matrix change by appending new columns to it.
4.2 Layerwise Dictionary Learning
Neural network computations are dominated by matrix multiplications. At dictionary learning stage, trained DNN weight matrix for each layer is decomposed into a dictionary matrix and a coefficient matrix according to an error threshold which can be adjusted per layer. Next, we will explain this structured decomposition for fullyconnected (fc) and convolution layers (conv).
Fullyconnected layer: In a conventional fullyconnected layer, the following matrixvector multiplication is performed
(1) 
where and are input and output vectors respectively. In our scheme, weight matrix is transformed into dictionary matrix and coefficient matrix . Substituting this into Equation 1 results in:
(2) 
In AgileNet, the above equation is performed by two subsequent layers. In particular, a conventional fullyconnected layer is replaced by a tiny fullyconnected layer (with weight matrix ) preceded by an transformation layer (with weight matrix ) as shown in Figure 2.
Convolution Layer: For a convolution layer, we first matricize weight tensor . After subspace projection, dictionary remains intact while the coefficient matrix is reshaped into a threedimensional tensor. The reason for this decision is to comply a universal format for the dictionaries in all layers. Similar to a fullyconnected layer, substituting the weight tensor of a convolution layer with dictionary matrix and coefficients tensor transforms a conventional convolution (with output channels) into a tiny convolution layer (with output channels) preceded by a transformation layer as shown in Figure 3. For any row of , each element is multiplied by all elements of the corresponding channel (of the tiny conv layer output) and resulting channels are summed up elementwise to generate one output channel. As such, the transformation layer takes an channel input and transforms it into an channel output using a linear combination approach.
4.3 EndtoEnd Dictionary Learning
At dictionary learning stage, weights of the trained model are initially decomposed into a dictionary matrix and a coefficient matrix to comply with the edge constraints. The transformed model has the same architecture of layers as the original model but the fullyconnected and convolution layers are replaced by their corresponding transformation layer and tiny layer as discussed above. To compensate for possible loss of accuracy as the result of the structured decomposition, the transformed model is finetuned. Note that very tight memory/compute budget at the edge might result in a transformed model that is not inherently capable of achieving the desired accuracy. Additionally, we empirically realized that last few fullyconnected layers in a DNN architecture contribute more to the final model accuracy and therefore require a smaller decomposition error threshold. These two important observations are explored in Section 6. At the end of this stage, the transformed model of AgileNet can be readily deployed on the edge device.
5 Fewshot Learning on the Edge Device
The stringent memory and energy constraints at the edge are the major challenges towards ondevice training of the neural networks. This limitation is due to the powerhungry computations of the training phase as well as excessive memory requirement of large models used in realworld applications. The prohibitive memory cost of primary training data hinders its storage on the edge device to be used for model updating. The new data is also available only in few instances. Also, adapting the model to new data should not exacerbate the performance on old classes.
Structured decomposition generates dictionaries that preserve the structure of weights in each layer and are built such that they capture the space of weight parameters. We leverage this property for updating the model in fewshot learning scenarios: AgileNet keeps the dictionary for all layers intact and only finetunes the coefficients. A minute update to the coefficients of AgileNet should be enough to expand the capability of the model for inference on new data. This means that the model can be tuned for new data through only a small number of iterations. Additionally, since the coefficient matrix (tensor) is considerably smaller than the original weight matrix (tensor), a smaller number of parameters need to be updated for AgileNet. In particular, the number of trainable parameters for fewshot learning tasks is reduced by approximately a factor for both fullyconnected and convolution layers, where is the number of rows (channels) in the coefficient matrix (tensor) and is the number of rows (channels) in the original weight matrix (tensor). Note that as we show in Section 6, our approach for fewshot learning also preserve the predictive power of the model on original classes. To enable ondevice training under more strict compute/energy budgets, we introduce an ultralight mode which reduces parameter updates even more.
UltraLight Fewshot Learning: This mode is designed to further limit the cost of model adaptation at the edge, though it might also limit the maximum achievable accuracy on the new data. In ultralight fewshot learning mode, all layers except the last layer are not trained; not only the dictionaries but also the coefficients of all layers except the last remain intact. Furthermore, the coefficient of the last fullyconnected layer, as well as rows of its dictionary matrix that correspond to old data classes, are fixed. The only parameters that are updated belong to the few rows of the dictionary matrix that correspond to the new data categories. This mode, which is depicted in Figure 4, has significantly fewer parameters to finetune and thus, converges in a much small number of iterations.
6 Experiments and Evaluation
Our evaluation are performed on three benchmark datasets: MNIST LeCun et al. (2010), Omniglot Lake et al. (2015) and MiniImagenet Vinyals et al. (2016). learning experiment is performed as follows: we randomly sample classes from the test dataset. From each selected class, we choose data instances randomly. We feed the corresponding labeled examples to the model during the fewshot learning stage. The trained model is then tested on the data from the same classes excluding examples used for fewshot learning. The top1 average test accuracy is reported for different random new classes and different data instances within each new class. Note that for all experiments, we followed all three steps of the global flow of AgileNet.
MNIST: The dataset of handwritten digits to consists of examples in the training set, and examples in the test set of size . For fewshot learning experiments, we randomly chose digits for primary training and the remaining digit was used as the new classes in the fewshot setting. We used LeNet architecture which has two convolution layers with the kernel size of , followed by a dropout layer, and two fullyconnected layers. To validate AgileNet in fewshot learning scenarios, we chose five samples from all ten classes randomly and created a new training set for fewshot learning stage. Since this data contains five data instances from the new class, the fewshot task is 1way 5shot learning. We note that adding samples from old classes to the training data for fewshot learning stage is to preserve model accuracy on old classes and prevent overfitting on the new class. However, we only need to store 5 samples of each old class on the edge device for this purpose which does not add significant memory overhead to this stage.
Figure 5 shows the classification accuracy on the test set after fewshot learning stage. The green and red lines represent AgileNet accuracy on the new and old classes, respectively. Our approach achieves a reasonable accuracy of only after 20 iterations while preserving accuracy on original 9 classes. However, conventional training sacrifices the accuracy on the old classes to obtain a comparable accuracy to AgileNet on the new class. There are two key factors for success of AgileNet in preserving the knowledge on old classes: (i) The learned dictionaries preserve the structure of the weights at each layer and minute coefficient updates do not exacerbate the accuracy on the old classes. (ii) Our training data covers samples from both old and new classes to prevent overfitting.
Omniglot: This benchmark dataset for fewshot learning tasks has 50 different alphabets including 1623 character classes totally. Each character class has only samples. The dataset is split into the first 1200 classes for training and the remaining 423 classes for testing as in Vinyals et al. (2016); Garcia and Bruna (2017). Images are resized to . To compensate for the low number of training examples per class, we appended three rotated images (by 90, 180, and 270 degrees) for each original image to the training data.
For this dataset, we used the CNN architecture proposed by Vinyals et al. (2016) in which each block has a convolution layer with 64 filters of size , a batchnormalization layer, a maxpooling layer, and a leakyrelu. Four of these blocks are stacked and followed by a final fullyconnected layer. We experimented , , , and scenarios. Table 1 compares final accuracy of AgileNet with prior work on these tasks. For experiments, AgileNet outperforms all prior work and for tasks, it achieves a comparable accuracy. Note that for these experiments, we are only comparing the accuracy on the new classes as all of the prior works in Table 1 have used this metric for comparison. As such, the training data for fewshot learning stage consists of only samples from the new class.
Model  5Way  20Way  

1shot  5shot  1shot  5shot  
Matching Networks Vinyals et al. (2016)  98.1%  98.9%  93.8%  98.5% 
Statistic Networks Edwards and Storkey (2016)  98.1%  99.5%  93.2%  98.1% 
Res. PairWise Mehrotra and Dukkipati (2017)      94.8%   
Prototypical Networks Snell et al. (2017)  97.4%  99.3%  95.4%  98.8% 
ConvNet with Memory Kaiser et al. (2017)  98.4%  99.6%  95.0%  98.6% 
Agnostic Metalearner Finn et al. (2017)  98.7%  99.9%  95.8%  98.9% 
Meta Networks Munkhdalai and Yu (2017)  98.9%    97.0%   
TCML Mishra et al. (2017a)  98.96%  99.75%  97.64%  99.36% 
GNN Garcia and Bruna (2017)  99.2%  99.7%  97.4%  99.0% 
AgileNet (Ours)  99.5%  99.9%  94.95%  98.9% 
MiniImagenet: A more challenging benchmark for fewshot learning experiments was proposed by Vinyals et al. (2016) which is extracted from the original Imagenet dataset. MiniImagenet consists of 60,000 images of size belonging to 100 classes. We used first 64 classes for training, 16 classes for validation and last 20 for test similar to Ravi and Larochelle (2016). The CNN architecture used in this experiment consists of 4 convolution layers. Each convolution layer has a different number of filters (64, 96, 128, 256) with the kernel size of followed by a batch normalization layer, a max pooling layer, and a leakyrelu. The last two convolution layers are also followed by a dropout layer to avoid overfitting. This architecture has a fullyconnected layer at the end.
In order to explore the space of decomposition error threshold for different layers which determines the dictionary size (and in turn, memory footprint and computation cost) as well as the model accuracy, we conducted a comprehensive analysis of AgileNet for MiniImagenet dataset. Figure 6 presents the tradeoff between memory footprint, computation cost and final accuracy after fewshot learning stage corresponding to different decomposition error thresholds uniformly set for all layers. Memory and computation costs are compared with those of the original model in the primary training stage and the fewshot learning accuracies denote the absolute test accuracy on new classes. Notice that memory footprint and computation cost decrease significantly as increases. The drop in accuracy of AgileNet is negligible until reaches . Then, the model accuracy drops to %. These results demonstrate that the tradeoff between model accuracy and memory/computation cost can be leveraged by adjusting the decomposition error threshold. This flexibility allows AgileNet to match the edge device physical constraints while enabling the desired degree of adaptability to new data.
To further understand the impact of decomposition error threshold on convolution layers and fullyconnected layers, we varied for the fullyconnected layer in this DNN architecture from to while keeping for all convolution layers intact as shown in Figure 7. Changing for the fullyconnected layer mainly impacts the memory footprint of the model while the computation cost is dominated by the convolution layers. Similar to the previous experiment, memory footprint decreases as increases for the fullyconnected layer. The drop in accuracy of AgileNet is negligible until reaches . Then, the model accuracy drops to %. These results show that layerwise exploration of the decomposition error threshold is necessary to maximize the memory/computation benefits of AgileNet while achieving a desired accuracy for fewshot learning tasks.
To compare AgileNet with prior work, we used two configurations for decomposition error thresholds for different layers as shown in Table 2. In both shot and shot scenarios, AgileNet achieves a higher accuracy than all prior arts. Similar to the Omniglot benchmark, these results only consider the accuracy on new classes. We emphasize that AgileNet outperforms prior works in terms of accuracy while reducing memory footprint by 5.8 and computation cost by 6 as shown in Figure 7. Therefore, AgileNet not only helps amenability of large DNN models to resourceconstrained devices but also it achieves a superior accuracy in fewshot learning scenarios compared to the stateoftheart.
Model  5Way  

1shot  5shot  
Matching Networks Vinyals et al. (2016)  43.6%  55.3% 
Prototypical Networks Snell et al. (2017)  46.61% 0.78%  65.77% 0.70% 
Model Agnostic Metalearner Finn et al. (2017)  48.70% 1.84%  63.10% 0.92% 
Meta Networks Munkhdalai and Yu (2017)  49.21% 0.96%   
M. Optimization Ravi and Larochelle (2016)  43.40% 0.77%  60.20% 0.71% 
TCML Mishra et al. (2017a)  55.71% 0.99%  68.88% 0.92% 
GNN Garcia and Bruna (2017)  50.33% 0.36%  66.41% 0.63% 
AgileNet ()  48.38% 0.90%  69.21% 0.25% 
AgileNet ()  58.23% 0.10%  71.39% 0.10% 
7 Conclusion
This work presents the first lightweight fewshot learning approach that beats the accuracy of stateoftheart approaches on standard benchmarks through only a small number of parameter updates. The key enabler of AgileNet is our novel endtoend structured decomposition methodology that replaces every convolution and fullyconnected layer by its tiny counterpart such that memory footprint and computational complexity of the transformed model matches the edge constraints. Our experiments corroborated that the learned dictionaries of AgileNet preserve the structure of the model, enabling lowcost and effective fewshot learning without degrading the model accuracy on old data classes.
References
 Abadi et al. [2016] Martin Abadi, Andy Chu, Ian Goodfellow, H Brendan McMahan, Ilya Mironov, Kunal Talwar, and Li Zhang. Deep learning with differential privacy. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, pages 308–318. ACM, 2016.
 Boutsidis et al. [2009] Christos Boutsidis, Michael W Mahoney, and Petros Drineas. An improved approximation algorithm for the column subset selection problem. In Proceedings of the twentieth annual ACMSIAM symposium on Discrete algorithms, pages 968–977. Society for Industrial and Applied Mathematics, 2009.
 Edwards and Storkey [2016] Harrison Edwards and Amos Storkey. Towards a neural statistician. arXiv preprint arXiv:1606.02185, 2016.
 Finn et al. [2017] Chelsea Finn, Pieter Abbeel, and Sergey Levine. Modelagnostic metalearning for fast adaptation of deep networks. arXiv preprint arXiv:1703.03400, 2017.
 Garcia and Bruna [2017] Victor Garcia and Joan Bruna. Fewshot learning with graph neural networks. arXiv preprint arXiv:1711.04043, 2017.
 Hariharan and Girshick [2017] Bharath Hariharan and Ross Girshick. Lowshot visual recognition by shrinking and hallucinating features. In Proc. of IEEE Int. Conf. on Computer Vision (ICCV), Venice, Italy, 2017.
 Howard et al. [2017] Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017.
 Kaiser et al. [2017] Lukasz Kaiser, Ofir Nachum, Aurko Roy, and Samy Bengio. Learning to remember rare events. CoRR, abs/1703.03129, 2017. URL http://arxiv.org/abs/1703.03129.
 Koch et al. [2015] Gregory Koch, Richard Zemel, and Ruslan Salakhutdinov. Siamese neural networks for oneshot image recognition. In ICML Deep Learning Workshop, volume 2, 2015.
 Lake et al. [2015] Brenden M Lake, Ruslan Salakhutdinov, and Joshua B Tenenbaum. Humanlevel concept learning through probabilistic program induction. Science, 350(6266):1332–1338, 2015.
 Lane et al. [2015] Nicholas D Lane, Sourav Bhattacharya, Petko Georgiev, Claudio Forlivesi, and Fahim Kawsar. An early resource characterization of deep learning on wearables, smartphones and internetofthings devices. In Proceedings of the 2015 International Workshop on Internet of Things towards Applications, pages 7–12. ACM, 2015.
 Lane et al. [2016] Nicholas D Lane, Sourav Bhattacharya, Petko Georgiev, Claudio Forlivesi, Lei Jiao, Lorena Qendro, and Fahim Kawsar. Deepx: A software accelerator for lowpower deep learning inference on mobile devices. In Information Processing in Sensor Networks (IPSN), 2016 15th ACM/IEEE International Conference on, pages 1–12. IEEE, 2016.
 LeCun et al. [2010] Yann LeCun, Corinna Cortes, and CJ Burges. Mnist handwritten digit database. AT&T Labs [Online]. Available: http://yann. lecun. com/exdb/mnist, 2, 2010.
 Li et al. [2006] FeiFei Li, Rob Fergus, and Pietro Perona. Oneshot learning of object categories. IEEE transactions on pattern analysis and machine intelligence, 28(4):594–611, 2006.
 Mehrotra and Dukkipati [2017] Akshay Mehrotra and Ambedkar Dukkipati. Generative adversarial residual pairwise networks for one shot learning. arXiv preprint arXiv:1703.08033, 2017.
 Mishra et al. [2017a] Nikhil Mishra, Mostafa Rohaninejad, Xi Chen, and Pieter Abbeel. Metalearning with temporal convolutions. arXiv preprint arXiv:1707.03141, 2017a.
 Mishra et al. [2017b] Nikhil Mishra, Mostafa Rohaninejad, Xi Chen, and Pieter Abbeel. A simple neural attentive metalearner. In NIPS 2017 Workshop on MetaLearning, 2017b.
 Munkhdalai and Yu [2017] Tsendsuren Munkhdalai and Hong Yu. Meta networks. arXiv preprint arXiv:1703.00837, 2017.
 Ravi and Larochelle [2016] Sachin Ravi and Hugo Larochelle. Optimization as a model for fewshot learning. ICLR 2017, 2016.
 Rouhani et al. [2016] Bita Darvish Rouhani, Azalia Mirhoseini, and Farinaz Koushanfar. Delight: Adding energy dimension to deep neural networks. In Proceedings of the 2016 International Symposium on Low Power Electronics and Design, pages 112–117. ACM, 2016.
 Sharma et al. [2016] Hardik Sharma, Jongse Park, Divya Mahajan, Emmanuel Amaro, Joon Kyung Kim, Chenkai Shao, Asit Mishra, and Hadi Esmaeilzadeh. From highlevel deep neural models to fpgas. In Microarchitecture (MICRO), 2016 49th Annual IEEE/ACM International Symposium on, pages 1–12. IEEE, 2016.
 Snell et al. [2017] Jake Snell, Kevin Swersky, and Richard Zemel. Prototypical networks for fewshot learning. In Advances in Neural Information Processing Systems, pages 4080–4090, 2017.
 Tropp [2009] Joel A Tropp. Column subset selection, matrix factorization, and eigenvalue optimization. In Proceedings of the Twentieth Annual ACMSIAM Symposium on Discrete Algorithms, pages 978–986. Society for Industrial and Applied Mathematics, 2009.
 Vinyals et al. [2016] Oriol Vinyals, Charles Blundell, Tim Lillicrap, Daan Wierstra, et al. Matching networks for one shot learning. In Advances in Neural Information Processing Systems, pages 3630–3638, 2016.