1 Introduction

The goal of meta-learning is to train a model on a variety of learning tasks, such that it can adapt to new problems within only a few iterations. Here we propose a principled information-theoretic model that optimally partitions the underlying problem space such that the resulting partitions are processed by specialized expert decision-makers. To drive this specialization we impose the same kind of information processing constraints both on the partitioning and the expert decision-makers. We argue that this specialization leads to efficient adaptation to new tasks. To demonstrate the generality of our approach we evaluate on three meta-learning domains: image classification, regression, and reinforcement learning.


Hierarchical Expert Networks for Meta-Learning


Heinke Hihn &Daniel A. Braun


Institute for Neural Information Processing
Ulm University
Ulm, Germany &Institute for Neural Information Processing
Ulm University
Ulm, Germany

1 Introduction

Recent machine learning research has shown impressive results on incredibly diverse tasks from problem classes such as pattern recognition, reinforcement learning, and generative model learning [Devlin et al., 2018, Mnih et al., 2015, Schmidhuber, 2015]. These success stories typically have two computational luxuries in common: a large data base with thousands or even millions of training samples and a very long and extensive training period. However, applying these pre-trained models to new tasks naïvely usually leads to very poor performance, as with each new incoming batch of data, expensive and slow re-learning is required. In contrast to this, humans are able to learn from very few examples and excel at adapting quickly [Jankowski et al., 2011], for example in motor tasks [Braun et al., 2009] or at learning new visual concepts [Lake et al., 2015].

Sample-efficient adaptation to new tasks can be regarded as a form of meta-learning or “learning to learn” [Thrun and Pratt, 2012, Schmidhuber et al., 1997, Caruana, 1997] and is an ongoing and active field of research–see e.g. [Koch et al., 2015, Vinyals et al., 2016, Finn et al., 2017, Ravi and Larochelle, 2017, Ortega et al., 2019, Botvinick et al., 2019, Yao et al., 2019]. Meta-learning can be defined in different ways, but a common point is that the system learns on two levels, each with different time scales: slow learning across different tasks on a meta-level, and fast learning to adapt to each task individually.

Here, we propose a novel learning paradigm for hierarchical meta learning systems. Our method finds an optimal soft partitioning of the problem space by imposing information-theoretic constraints on both the process of expert selection and on the expert specialization. We argue that these constraints drive an efficient division of labor in systems that are bounded in their respective information processing power, where we make use of information-theoretic bounded rationality [Ortega and Braun, 2013]. When the model is presented with previously unseen tasks it assigns them to experts specialized on similar tasks – see Figure 1. Additionally, expert networks specializing on only a subset of the problem space allows for smaller neural network architectures with only few units per layer. In order to split the problem space and to assign the partitions to experts, we learn to represent tasks through a common latent embedding, that is then used by a selector network to distribute the tasks to the experts.

The outline of this paper is as follows: first we introduce bounded rationality and meta learning, next we introduce our novel approach and derive applications to classification, regression, and reinforcement learning. Finally, we conclude.

Figure 1: The selector assigns the new input encoding to one of the three experts , or , depending on the similarity of the input to previous inputs seen by the experts.

2 Background

2.1 Bounded Rational Decision Making

An important concept in decision making is the notion of utility [Von Neumann and Morgenstern, 2007], where an agent picks an action such that it maximizes their utility in some context , i.e. , where the utility is given by a function and the states distribution is known and fixed. Trying to solve this optimization problem naïvely leads to an exhaustive search over all possible pairs, which is in general a prohibitive strategy. Instead of finding an optimal strategy, a bounded-rational decision-maker optimally trades off expected utility and the processing costs required to adapt. In this study we consider the information-theoretic free-energy principle [Ortega and Braun, 2013] of bounded rationality, where the decision-maker’s resources are modeled by an upper bound on the Kullback-Leibler divergence between the agent’s prior distribution and the posterior policy , resulting in the following constrained optimization problem:


This constraint can also be interpreted as a regularization on . We can transform this into an unconstrained variational problem by introducing a Lagrange multiplier :


For we recover the maximum utility solution and for the agent can only act according to the prior. The optimal prior in this case is given by the marginal [Ortega and Braun, 2013].

2.1.1 Hierarchical Decision Making

Aggregating several bounded-rational agents by a selection policy allows for solving optimization problems that exceed the capabilities of the individual decision-makers [Genewein et al., 2015]. To achieve this, the search space is split into partitions such that each partition can be solved by a decision-maker. A two stage mechanism is introduced: The first stage is an expert selection policy that chooses an expert given a state and the second stage chooses an action according to the expert’s posterior policy . The optimization problem given by (3) can be extended to incorporate a trade-off between computational costs and utility in both stages:


where is the resource parameter for the expert selection stage and for the experts. is the mutual information between the two random variables. The solution can be found by iterating the following set of equations [Genewein et al., 2015]:


where and are normalization factors and is the free energy of the action selection stage. Thus the marginal distribution defines a mixture-of-experts policy given by the posterior distributions weighted by the responsibilities determined by the Bayesian posterior . Note that is not determined by a given likelihood model, but is the result of the optimization process (4).

Figure 2: Our proposed method consists of three main stages. First, the training dataset , is passed through a convolutional autoencoder to find a latent representation for each , which we get by flattening the preceding convolutional layer (labeled as flattening layer in the figure). This image embedding is then pooled and fed forward selection network.

2.2 Meta Learning

Meta-learning algorithms can be divided roughly into Metric-Learning [Koch et al., 2015, Vinyals et al., 2016, Snell et al., 2017], Optimizer Learning [Ravi and Larochelle, 2017, Finn et al., 2017, Zintgraf et al., 2018], and Task Decomposition Models [Lan et al., 2019, Vezhnevets et al., 2019]. Our approach depicted in Figure 2 can be seen as a member of the latter group.

2.2.1 Meta Supervised Learning

In a supervised learning task we are usually interested in a dataset consisting of multiple input and output pairs and the learner is tasked with finding a function that maps from input to output, for example through a deep neural network. To do this, we split the dataset into training and test sets and fit a set of parameters on the training data and evaluate on test data using the learned function . In meta-learning, we are instead working with meta-datasets , each containing regular datasets split into training and test sets. We thus have different meta-sets for meta-training, meta-validation, and meta-test (, and , respectively). On , we are interested in training a learning procedure (the meta-learner) that can take as input one of its training sets and produce a classifier (the learner) that achieves low prediction error on its corresponding test set .

A special case of meta-learning for classification are -Shot -way tasks. In this setting, we are given for each dataset a training set consisting of labeled examples of each of the classes ( examples per dataset) and corresponding test sets. In our study, we focus on the following variation of -Shot 2-Way tasks: the meta-learner is presented with samples ( positive and negative examples) and must assign this dataset to an expert learner. Note that the negative examples may be drawn from any of the remaining classes.

2.2.2 Meta Reinforcement Learning

We model sequential decision problems by defining a Markov Decision Process as a tuple , where is the set of states, the set of actions, is the transition probability, and is a reward function. The aim is to find the parameter of a policy that maximizes the expected reward:


We define as the cumulative reward of trajectory , which is sampled by acting according to the policy , i.e. and . Learning in this environment can then be modeled by reinforcement learning [Sutton and Barto, 2018], where an agent interacts with an environment over a number of (discrete) time steps . At each time step , the agent finds itself in a state and selects an action according to the policy . In return, the environment transitions to the next state and generates a scalar reward . This process continues until the agent reaches a terminal state after which the process restarts. The goal of the agent is to maximize the expected return from each state , which is typically defined as the infinite horizon discounted sum of the rewards. A common choice to achieving this is Q-Learning [Watkins and Dayan, 1992], where we make use of an action-value function that is defined as the discounted sum of rewards , where is a discount factor. Learning the optimal policy can be achieved in many ways. Here, we consider Policy gradient methods [Sutton et al., 2000] which are a popular choice to tackle continuous reinforcement learning problems. The main idea is to directly manipulate the parameters of the policy in order to maximize the objective by taking steps in the direction of the gradient .

In meta reinforcement learning the problem is given by a set of tasks , where each task is defined by an MDP as described earlier. We are now interested in finding a set of policies that maximizes the average cumulative reward across all tasks in and generalizes well to new tasks sampled from a different set of tasks .

1:Input: Data Distribution , number of samples , batch-size , training episodes
2:Hyper-parameters: resource parameters , , learning rates , for selector and experts
3:Initialize parameters
4:for  = 0, 1, 2, …,  do
5:     Sample batch of datasets , each consisting of a training dataset and a meta-validation dataset with samples each  
6:     for  do
7:         Find Latent Embedding
8:         Select expert
9:         Compute of on         
10:     Update selection parameters with
11:     Update Autoencoder with pos. samples in
12:     Update experts with assigned    
13:return ,
Algorithm 1 Expert Networks for Supervised Meta-Learning

3 Expert Networks for Meta-Learning

Information-theoretic bounded rationality postulates that hierarchies and abstractions emerge when agents have only limited access computational resources [Genewein et al., 2015, Gottwald and Braun, 2019b, Gottwald and Braun, 2019a], e.g. limited sampling complexity [Hihn et al., 2018] or limited representational power [Hihn et al., 2019]. We will show that forming such abstractions equips an agent with the ability of learning the underlying problem structure and thus enables learning of unseen but similar concepts. The method we propose comes out of a unified optimization principle and has the following important features:

  1. A regularization mechanism to enforce the emergence of expert policies.

  2. A task compression mechanism to extract relevant task information.

  3. A selection mechanism to find the most efficient expert for a given task.

  4. A regularization mechanism to improve generalization capabilities.

3.1 Latent Task Embeddings

Note that the selector assigns a complete dataset to an expert and that this can be seen as a meta-learning task, as described in [Ravi and Larochelle, 2017]. To do so, we must find a feature vector of the dataset . This feature vector must fulfill the following desiderata: 1) invariance against permutation of data points in , 2) high representational capacity, 3) efficient computability, and 4) constant dimensionality regardless of sample size . In the following we propose such features for image classification, regression, and reinforcement learning problems.

For image classification we propose to pass the positive images in the dataset through a convolutional autoencoder and use the respective outputs of the bottleneck layer. Convolutional Autoencoders are generative models that learn to reconstruct their inputs by minimizing the Mean-Squared-Error between the input and the reconstructed image (see e.g. [Chen et al., 2019]). In this way we get similar embeddings for similar inputs belonging to the same class. The latent representation is computed for each positive sample in and then passed through a pooling function to find a single embedding for the complete dataset–see figure 2 for an overview of our proposed model. While in principle functions such as mean, max, and min can be used, we found that max pooling yields the best results. The authors of [Yao et al., 2019] propose a similar feature set.

For regression we define a similar feature vector. The training data points are transformed into a feature vector by binning the points into bins according to their respective value and collecting the respective value. If more than one point falls into the same bin the values are averaged, thus providing invariance against the order of the data points in . We use this feature vector to assign each data set to an expert according to .

In the reinforcement learning setting we use a dynamic recurrent neural network (RNN) with LSTM units [Hochreiter and Schmidhuber, 1997] to classify trajectories. We feed the RNN with tuples to describe the underlying Markov Decision Process describing the task. At we sample the expert according to the learned prior distribution , as there is no information available so far. The authors of [Lan et al., 2019] propose a similar feature set.

Omniglot Few-Shot Classification Results
Number of Experts
K 2 4 8 16
% Acc I(X;W) % Acc I(X;W) % Acc I(X;W) % Acc I(X;W)
1 76.2 ( 0.02) 0.99 ( 0.01) 86.7 ( 0.02) 1.96 ( 0.01) 90.1 ( 0.01) 2.5 ( 0.20) 92.9 ( 0.01) 3.2 ( 0.3)
5 67.3 ( 0.01) 0.93 ( 0.01) 75.5 ( 0.01) 1.95 ( 0.10) 78.4 ( 0.01) 2.7 ( 0.10) 81.2 ( 0.01) 3.3 ( 0.2)
10 66.4 ( 0.04) 0.95 ( 0.30) 75.8 ( 0.01) 1.90 ( 0.03) 77.3 ( 0.01) 2.8 ( 0.15) 77.8 ( 0.01) 3.1 ( 0.2)
Table 1: Classification results for the omniglot data set [Lake et al., 2011]. We evaluate our system by splitting the dataset into training and validation data (80% - 20%) and train the system as described in Algorithm 1 and report the classification accuracy on the validation, i.e. classes and samples that are novel to the model. We trained for 50.000 episodes each with a batch of 32 datasets and set and .

3.2 Hierarchical On-line Meta-Learning

As discussed in section 2.1, the aim of the selection network is to find an optimal partition of the experts , such that the selector’s expected utility is maximized under an information-theoretic constraint , where are the selector’s parameters (e.g. weights in a neural network), the expert and is an input. Each expert follows a policy that maximizes their expected utility . We introduce our gradient based on-line learning algorithm to find the optimal partitioning and the expert parameters in the following. Rewriting the optimization problem (4) as


where the objective is given by


and are the parameters of the selection policy and the expert policies, respectively. Note that each expert policy has a distinct set of parameters , i.e. , but we drop the index for readability. In the following we will show how we can apply this formulation to classification, regression and reinforcement learning.

3.2.1 Application to Supervised Learning

Combining multiple experts can often be beneficial [Kuncheva, 2004], e.g. in Mixture-of-Experts [Yuksel et al., 2012] or Multiple Classifier Systems [Bellmann et al., 2018]. Our method can be interpreted as a member of this family of algorithms.

In accordance with Section 2.1 we define the utility as the negative prediction loss, i.e. , where is the prediction of the expert given the input data point (in the following we will use the shorthand ) and is the ground truth. We define the cross-entropy loss as a performance measure for classification and the mean squared error for regression. The objective for expert selection thus is given by


where , i.e. the free energy of the expert and are the parameters of the selection policy and the expert policies, respectively. Analogously, the action selection objective for each expert is defined by

Figure 3: Here we show the soft-partition found by the selection policy for the sine prediction problem , where are chosen uniformly at each trial. To generate these plots we train a system on or respectively, sample and points and feed the data set to the selection policy. Each color represents a different expert. We can see that the selection policy becomes increasingly more precise as we provide more points per data set (denoted by ) to the system. We set and .

3.2.2 Application to Reinforcement Learning

In the reinforcement learning setup the utility is given by the reward function . In maximum entropy RL the regularization penalizes deviation from a fixed uniform prior, but in a more general setting we can discourage deviation from an arbitrary prior policy by determining the optimal policy as


As discussed in Section 2.1, the optimal prior is the marginal of the posterior policy given by . We approximate the prior distributions and by exponential running mean averages of the posterior policies.

To optimize the objective we define two separate value functions: one to estimate the discounted sum of rewards and one to estimate the free energy of the expert policies. The discounted reward for the experts is which we learn by parameterizing the value function with a neural network. Similar to the discounted reward we can now define the discounted free energy as where . The value function is learned by parameterizing it with a neural network and performing regression on the mean-squared-error.

3.2.3 Expert Selection

The selector network learns a policy that assigns states to expert policies optimally. The resource parameter constrains the information-processing in this step. For the selection assigns each state completely randomly to an expert, while for the selection becomes deterministic, always choosing the most promising expert . The selector optimizes the following objective:


where , which is the free energy of the expert. The gradient of is then given (up to an additive constant) by

Figure 4: The single expert system is not able to learn the underlying structure of the sine wave, where the two expert system is already able to capture the periodic structure. Adding more experts improves adaption further, as the results show. We trained for 10.000 episodes each with a batch of 32 data sets.
Figure 5: Analogously to the rate-distortion curve in rate-distortion theory [Blahut, 1972, Arimoto, 1972], we can interpret this curve as the rate-utility showing the trade-off between information processing and expected utility (transparent area represents the standard deviation). Increasing the processing power of the selection stage (i.e. adding more experts) improves adaption.

The double expectation can be replaced by Monte Carlo estimates, where in practice we use a single tuple for . This formulation is known as the policy gradient method [Sutton et al., 2000] and is prone to producing high variance gradients, but can be reduced by using an advantage function instead of the reward [Schulman et al., 2015]. The advantage function is a measure of how well a certain action performs in a state compared to the average performance in that state, i.e. . Here, is called the value function and captures the expected cumulative reward when in state , and is an estimate of the expected cumulative reward achieved in state when choosing a particular action . Thus the advantage is an estimate of how advantageous it is to pick in state in relation to a baseline performance . Instead of learning and , we can approximate the advantage function


such that we can get away with just learning a single value function . Both the selection network and the selector value network are implemented as recurrent neural networks with LSTM cells [Hochreiter and Schmidhuber, 1997]. Both networks share the recurrent cell followed by independent feed forward layers.

Figure 6: In each Meta-Update Step we sample tasks from the training task set and update the agents. After training is completed we evaluate their respective performance on a tasks from the meta test set . Rewards are normalized to and the episode horizon is 500 time steps. Results are averaged over 10 random seeds and trained for 1000 episodes each with a batch of 64 environments.

3.2.4 Action Selection

The actions is sampled from the posterior action distribution of the experts. Each expert maintains a policy for each of the world states and updates those according to the utility/cost trade-off. The advantage function for each expert is given as


The objective of this stage is then to maximize the expected advantage .

4 Empirical Results

4.1 Sinusoid Regression

We adopt this task from [Finn et al., 2017]. In this -shot problem, each task consists of learning to predict a function of the form , with both and chosen uniformly, and the goal of the learner is to find given based on only pairs of . Given that the underlying function changes in each iteration it is impossible to solve this problem with a single learner. Our results show that by combing expert networks, we are able to reduce the generalization error iteratively as we add more experts to our system–see Figures 5 for and settings. In Figure 4 we show how the system is able to capture the underlying problem structure as we add more experts and in Figure 3 we visualize how the selector’s partition of the problem space looks like.

4.2 Few-Shot Classification

The Omniglot dataset [Lake et al., 2011] consists of over 1600 characters from 50 alphabets. As each character has merely 20 samples each drawn by a different person, this forms a difficult learning task and is thus often referred to as the ”transposed MNIST” dataset. The Omniglot dataset is regarded as a standard meta learning benchmark, see e.g. [Finn et al., 2017, Vinyals et al., 2016, Ravi and Larochelle, 2017].

We train the learner on a subset of the dataset (, i.e. 1300 classes) and evaluate on the remaining classes, thus investigating the ability to generalize to new data. In each round we build the datasets and by selecting a target class and sample positive and negative samples. To generate negative samples we draw images randomly out of the remaining classes. We present the selection network with the feature presentation of the positive training samples (see Figure 2), but evaluate the experts’ performance on the test samples in . In this way the free energy of the experts becomes a measure of how well the expert is able to generalize to new samples of the target class and distinguish them from negative examples. Using this optimization scheme, we train the expert networks to become experts in recognizing a subset of classes. After a suitable expert is selected we train that expert using the samples from the training dataset—see Figure 5 and Table 1 for results. To generate this figure, we ran a 10-fold cross-validation on the whole dataset and show the averaged performance metric and the respective standard-deviation across the folds. In both settings ”0 bits” corresponds to a single expert, i.e. a single neural network trained on the task.

4.3 Meta Reinforcement Learning

Task Distribution
Distance Penalty [] []
Goal Position [0.3, 0.4] [0, 3]
Start Position [-0.15, 0.15] [-0.25, 0.25]
Motor Torques
Motor Actuation [185, 215] [175, 225]
Inverted Control
Gravity [0.01, 4.9] [4.9, 9.8]
Table 2: All parameters are sampled uniformly from the specified range for each environment. is used for training and for meta evaluation.

We create a set of RL tasks by sampling the parameters for the Inverted Double Pendulum problem [Sutton, 1996] implemented in OpenAI Gym [Brockman et al., 2016]. The task is to balance a two-link pendulum in an upward position. We modify inertia, motor torques, reward function, goal position and invert the control signal – see Table 2 for details. The control signal is continuous in the interval [-1,1] is generated by neural network that outputs and of a gaussian. The action is sampled by re-parameterizing the distribution to , where , so that the distribution is differentiable w.r.t to the network outputs.

The meta task set is based on the same environment, but the parameter distribution and range is different, providing new but similar reinforcement learning problems. In each episode environments are sampled and the system is updated accordingly. After training is concluded the system is evaluated on tasks sampled from . We trained the system for 1000 Episodes with 64 tasks from and evaluate for 100 system updates on tasks from . We report the results in Figure 6, where we can see improving performance as more experts are added and the mutual information in the selection stage indicates that the tasks can be assigned to their respective expert policy.

5 Discussion

We have introduced and evaluated a novel information-theoretic approach to meta learning. In particular we leveraged an information-theoretic approach to bounded rationality [Leibfried et al., 2017, Grau-Moya et al., 2019, Hihn et al., 2019, Schach et al., 2018, Gottwald and Braun, 2019b, Lindig-Leon et al., 2019]. Our results show that our method is able to identify sub-regions of the problem set with expert networks. In effect, this equips the system with several initializations covering the problem space and thus enables it to adapt quickly to new but similar tasks. To reliably identify such tasks, we have proposed feature extraction methods for classification, regression and reinforcement learning, that could be simply be replaced and improved in the future. The strength of our model is that it follows from simple principles that can be applied to a large range of problems. Moreover, the system performance can be interpreted in terms of the information processing of the selection stage and the expert decision-makers.

Most other methods for meta learning such as [Finn et al., 2017] and [Ravi and Larochelle, 2017] try to find a initial parametrization of a single learner, such that it is able to adapt quickly to new problems. This initialization can be interpreted as compression of the most common task properties over all tasks. Our method however learns to identify task properties over a subset of tasks and provide several initializations. Task specific information is thus directly available instead of a delayed availability after several iterations as in [Finn et al., 2017] and [Ravi and Larochelle, 2017]. In principle, this can help to adapt within fewer iterations. Thus our method can be seen as the general case of such monolithic meta-learning algorithms.

Another hierarchical approach to meta-learning is the work of [Yao et al., 2019], where the focus is on learning similarities between completely different problems (e.g. different classification datasets). In this way the portioning is largely governed by the different tasks. Our study however focuses on discovering meta-information within the same task family, where the meta-partitioning is determined solely by the optimization process and can thus potentially discover unknown dynamics and relations within a task family.

Although our method is widely applicable, it suffers from low sample efficiency in the RL domain. An interesting research direction would be to combine our system with model-based RL which is known improve sample efficiency. Another research direction would be to investigate our systems performance in continual adaption tasks, such as in [Yao et al., 2019]. There the system is continuously provided with data sets (e.g. additional classes and samples). Another limitation is the restriction to binary meta classification tasks, which we leave for feature work.


This work was supported by the European Research Council Starting Grant BRISC, ERC-STG-2015, Project ID 678082.


  • [Arimoto, 1972] Arimoto, S. (1972). An algorithm for computing the capacity of arbitrary discrete memoryless channels. IEEE Transactions on Information Theory, 18(1):14–20.
  • [Bellmann et al., 2018] Bellmann, P., Thiam, P., and Schwenker, F. (2018). Multi-classifier-systems: Architectures, algorithms and applications. In Computational Intelligence for Pattern Recognition, pages 83–113. Springer.
  • [Blahut, 1972] Blahut, R. E. (1972). Computation of channel capacity and rate-distortion functions. IEEE Transactions on Information Theory, 18(4):460–473.
  • [Botvinick et al., 2019] Botvinick, M., Ritter, S., Wang, J. X., Kurth-Nelson, Z., Blundell, C., and Hassabis, D. (2019). Reinforcement learning, fast and slow. Trends in cognitive sciences.
  • [Braun et al., 2009] Braun, D. A., Aertsen, A., Wolpert, D. M., and Mehring, C. (2009). Motor task variation induces structural learning. Current Biology, 19(4):352–357.
  • [Brockman et al., 2016] Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., and Zaremba, W. (2016). Openai gym. arXiv preprint arXiv:1606.01540.
  • [Caruana, 1997] Caruana, R. (1997). Multitask learning. Machine learning, 28(1):41–75.
  • [Chen et al., 2019] Chen, W.-Y., Liu, Y.-C., Kira, Z., Wang, Y.-C. F., and Huang, J.-B. (2019). A closer look at few-shot classification. International Conference on Representation Learning.
  • [Devlin et al., 2018] Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
  • [Finn et al., 2017] Finn, C., Abbeel, P., and Levine, S. (2017). Model-agnostic meta-learning for fast adaptation of deep networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1126–1135. JMLR. org.
  • [Genewein et al., 2015] Genewein, T., Leibfried, F., Grau-Moya, J., and Braun, D. A. (2015). Bounded rationality, abstraction, and hierarchical decision-making: An information-theoretic optimality principle. Frontiers in Robotics and AI, 2:27.
  • [Gottwald and Braun, 2019a] Gottwald, S. and Braun, D. A. (2019a). Bounded rational decision-making from elementary computations that reduce uncertainty. Entropy, 21(4).
  • [Gottwald and Braun, 2019b] Gottwald, S. and Braun, D. A. (2019b). Systems of bounded rational agents with information-theoretic constraints. Neural computation, 31(2):440–476.
  • [Grau-Moya et al., 2019] Grau-Moya, J., Leibfried, F., and Vrancx, P. (2019). Soft q-learning with mutual-information regularization. International Conference on Learning Representations.
  • [Hihn et al., 2018] Hihn, H., Gottwald, S., and Braun, D. A. (2018). Bounded rational decision-making with adaptive neural network priors. In IAPR Workshop on Artificial Neural Networks in Pattern Recognition, pages 213–225. Springer.
  • [Hihn et al., 2019] Hihn, H., Gottwald, S., and Braun, D. A. (2019). An information-theoretic on-line learning principle for specialization in hierarchical decision-making systems. arXiv preprint arXiv:1907.11452.
  • [Hochreiter and Schmidhuber, 1997] Hochreiter, S. and Schmidhuber, J. (1997). Long short-term memory. Neural computation, 9(8):1735–1780.
  • [Jankowski et al., 2011] Jankowski, N., Duch, W., and Grkabczewski, K. (2011). Meta-learning in computational intelligence, volume 358. Springer Science & Business Media.
  • [Koch et al., 2015] Koch, G., Zemel, R., and Salakhutdinov, R. (2015). Siamese neural networks for one-shot image recognition. In ICML deep learning workshop, volume 2.
  • [Kuncheva, 2004] Kuncheva, L. I. (2004). Combining pattern classifiers: methods and algorithms. John Wiley & Sons.
  • [Lake et al., 2011] Lake, B., Salakhutdinov, R., Gross, J., and Tenenbaum, J. (2011). One shot learning of simple visual concepts. In Proceedings of the Annual Meeting of the Cognitive Science Society, volume 33.
  • [Lake et al., 2015] Lake, B. M., Salakhutdinov, R., and Tenenbaum, J. B. (2015). Human-level concept learning through probabilistic program induction. Science, 350(6266):1332–1338.
  • [Lan et al., 2019] Lan, L., Li, Z., Guan, X., and Wang, P. (2019). Meta reinforcement learning with task embedding and shared policy. International Joint Conference on Artificial Intelligence.
  • [Leibfried et al., 2017] Leibfried, F., Grau-Moya, J., and Ammar, H. B. (2017). An information-theoretic optimality principle for deep reinforcement learning. Deep Reinforcement Learning Workshop NIPS 2018.
  • [Lindig-Leon et al., 2019] Lindig-Leon, C., Gottwald, S., and Braun, D. A. (2019). Analyzing abstraction and hierarchical decision-making in absolute identification by information-theoretic bounded rationality. Frontiers in Neuroscience, 13:1230.
  • [Mnih et al., 2015] Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G., et al. (2015). Human-level control through deep reinforcement learning. Nature, 518(7540):529.
  • [Ortega and Braun, 2013] Ortega, P. A. and Braun, D. A. (2013). Thermodynamics as a theory of decision-making with information-processing costs. Proceedings of the Royal Society of London A: Mathematical, Physical and Engineering Sciences, 469(2153).
  • [Ortega et al., 2019] Ortega, P. A., Wang, J. X., Rowland, M., Genewein, T., Kurth-Nelson, Z., Pascanu, R., Heess, N., Veness, J., Pritzel, A., Sprechmann, P., et al. (2019). Meta-learning of sequential strategies. arXiv preprint arXiv:1905.03030.
  • [Ravi and Larochelle, 2017] Ravi, S. and Larochelle, H. (2017). Optimization as a model for few-shot learning. International Conference on Learning Representations.
  • [Schach et al., 2018] Schach, S., Gottwald, S., and Braun, D. A. (2018). Quantifying motor task performance by bounded rational decision theory. Frontiers in neuroscience, 12.
  • [Schmidhuber, 2015] Schmidhuber, J. (2015). Deep learning in neural networks: An overview. Neural networks, 61:85–117.
  • [Schmidhuber et al., 1997] Schmidhuber, J., Zhao, J., and Wiering, M. (1997). Shifting inductive bias with success-story algorithm, adaptive levin search, and incremental self-improvement. Machine Learning, 28(1):105–130.
  • [Schulman et al., 2015] Schulman, J., Moritz, P., Levine, S., Jordan, M., and Abbeel, P. (2015). High-dimensional continuous control using generalized advantage estimation. International Conference on Learning Representations.
  • [Snell et al., 2017] Snell, J., Swersky, K., and Zemel, R. (2017). Prototypical networks for few-shot learning. In Advances in Neural Information Processing Systems, pages 4077–4087.
  • [Sutton, 1996] Sutton, R. S. (1996). Generalization in reinforcement learning: Successful examples using sparse coarse coding. In Advances in neural information processing systems, pages 1038–1044.
  • [Sutton and Barto, 2018] Sutton, R. S. and Barto, A. G. (2018). Reinforcement learning: An introduction. MIT press.
  • [Sutton et al., 2000] Sutton, R. S., McAllester, D. A., Singh, S. P., and Mansour, Y. (2000). Policy gradient methods for reinforcement learning with function approximation. In Advances in neural information processing systems, pages 1057–1063.
  • [Thrun and Pratt, 2012] Thrun, S. and Pratt, L. (2012). Learning to learn. Springer Science & Business Media.
  • [Vezhnevets et al., 2019] Vezhnevets, A. S., Wu, Y., Leblond, R., and Leibo, J. (2019). Options as responses: Grounding behavioural hierarchies in multi-agent rl. arXiv preprint arXiv:1906.01470.
  • [Vinyals et al., 2016] Vinyals, O., Blundell, C., Lillicrap, T., Wierstra, D., et al. (2016). Matching networks for one shot learning. In Advances in neural information processing systems, pages 3630–3638.
  • [Von Neumann and Morgenstern, 2007] Von Neumann, J. and Morgenstern, O. (2007). Theory of games and economic behavior (commemorative edition). Princeton university press.
  • [Watkins and Dayan, 1992] Watkins, C. J. and Dayan, P. (1992). Q-learning. Machine learning, 8(3-4):279–292.
  • [Yao et al., 2019] Yao, H., Wei, Y., Huang, J., and Li, Z. (2019). Hierarchically structured meta-learning. In International Conference on Machine Learning, pages 7045–7054.
  • [Yuksel et al., 2012] Yuksel, S. E., Wilson, J. N., and Gader, P. D. (2012). Twenty years of mixture of experts. IEEE transactions on neural networks and learning systems, 23(8):1177–1193.
  • [Zintgraf et al., 2018] Zintgraf, L. M., Shiarlis, K., Kurin, V., Hofmann, K., and Whiteson, S. (2018). Caml: Fast context adaptation via meta-learning. Internationcal Conference on Learning Representations.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description