Deep Tile Coder: an Efficient Sparse Representation Learning Approach with applications in Reinforcement Learning
Representation learning is critical to the success of modern large-scale reinforcement learning systems. Previous works show that sparse representation can effectively reduce catastrophic interference and hence provide relatively stable and consistent boostrap targets when training reinforcement learning algorithms. Tile coding is a well-known sparse feature generation method in reinforcement learning. However, its application is largely restricted to small, low dimensional domains, as its computational and memory requirement grows exponentially as dimension increases. This paper proposes a simple and novel tile coding operation—deep tile coder, which adapts tile coding into deep learning setting, and can be easily scaled to high dimensional problems. The key distinction of our method with previous sparse representation learning method is that, we generate sparse feature by construction, while most previous works focus on designing regularization techniques. We are able to theoretically guarantee sparsity and importantly, our method ensures sparsity from the beginning of learning, without the need of tuning regularization weight. Furthermore, our approach maps from low dimension feature space to high dimension sparse feature space without introducing any additional training parameters. Our empirical demonstration covers classic discrete action control and Mujoco continuous robotics control problems. We show that reinforcement learning algorithms equipped with our deep tile coder achieves superior performance. To our best knowledge, our work is the first to demonstrate successful application of sparse representation learning method in online deep reinforcement learning algorithms for challenging tasks without using a target network.
Key Words.:tile coding; reinforcement learning; deep neural networks; representation learning
ifaamas \acmDOIdoi \acmISBN \acmConferencePreprint. Under review. \acmYear2020 \copyrightyear2020 \acmPrice
Department of Computing Science, University of Alberta
Representation learning (RL) plays a significant role in learning efficiency of large machine learning systems. Particularly, the performance of RL agents in function approximation setting can be largely affected by the quality of the representation Talvitie and Bowling (2015); Heravi (2019); Le et al. (2017); Liu et al. (2018); Chandak et al. (2019); Caselles-Dupré et al. (2018); Madjiheurem and Toni (2019). One reason is that many RL algorithms require to use the training targets which typically involves boostrapping estimates from the function approximator itself. This imposes a strong requirement of the representation’s robustness. As pointed out by several previous works McCloskey and Cohen (1989); French (1991); Liu et al. (2018); Ghiassian et al. (2018), the desired properties of a good representation at least include: 1) reducing catastrophic interference, and 2) preventing forgetting. Learning sparse representation is a promising direction to achieve the two properties, as updated parameters are likely to only have local effect, i.e., a small part of the function domain can get affected. Besides, encouraging sparsity has some other benefits. For example, it provides a way to identify interesting features, since only a few entries in a feature vector can be activated and they have be informative to express concerned quantities. From another perspective, sparse feature is more likely to be linear independent Cover (1965) than dense one. Traditional machine learning community has been actively studying sparse feature learning, including radial basis functions, distributed representation Kanerva (1988); Ratitch and Precup (2004); Bengio et al. (2013); OâReilly and Munakata (2000), and tile coding encouraged by reinforcement learning research Sutton (1996a); Whiteson et al. (2007); Sutton and Barto (2018). Incremental sparse feature learning methods are typically formulated as a matrix factorization problem Mairal et al. (2009b, a); Le et al. (2017).
It should be noted that, in reinforcement learning, although several works Le et al. (2017); Liu et al. (2018); Rafati and Noelle (2019) have shown the potential benefits of using sparse feature, to our best knowledge, no existed work utilizes sparse feature to solve challenging tasks. It is because that: first, some methods are computationally too expensive to be used in an incremental manner for high dimensional problems; second, many sparse feature learning algorithms themselves are difficult optimization problems and are therefore too complex to be adapted into online reinforcement learning setting.
In order to solve challenging RL problems, deep neural networks are typically used as function approximators Silver et al. (2016); Mnih et al. (2015). Arguably, one of the most important and popular techniques for successfully training a deep RL algorithm is to use a target network, i.e. a separate slowly-updating network used for computing boostrap target Mnih et al. (2015), which is inspired by neural fitted Q iteration Riedmiller (2005). Such technique, however, can largely slower down the learning progress as the updated information is not used immediately when computing the target Ghiassian et al. (2018); Liu et al. (2018). Previous works by Liu et al. (2018) suggest that sparse representation is a promising direction for removing the need of using a target network. Liu et al. (2018) empirically studies the utility of sparse feature in RL problems and proposes a regularization technique for learning sparse feature. However, their proposed strategy requires to pretrain the neural network and does not enable online reinforcement learning for control problems. Rafati and Noelle (2019) suggests that sparse feature can reduce the chance of failure of a RL algorithm by reducing catastrophic interference and provides some empirical evidence.
This paper proposes deep tile coder, inspired by tile coding which is well-known in RL problems, and is typically used in linear function approximation setting on small domains Sutton (1996a); Sutton and Barto (2018). Our method leverages the power of a neural network and can be easily scaled to high dimensional problems. We develop a differentiable tile coding method and hence it is compatible with backpropagation algorithms. Our deep tile coder is general enough to be used in any deep learning algorithms. In addition, our method does not introduce additional training parameters in the process of generating high dimensional sparse features, and hence can be naturally scaled to extremely high sparse feature dimensions. We conduct rigorous experiments to empirically demonstrate the utility of our algorithm on a variety of challenging reinforcement learning problems, ranging from benchmark discrete control to continuous robotics control.
We use bold letter for vectors () and bold capital for matrix . Subscript of a vector indicates a scalar at the corresponding location, i.e. is the th element in the vector. indicates the set of integers .
In this section, we briefly review tile coding, which inspires our Deep Tile Coder (DTC) method introduced in Section 3. Then we review some background in reinforcement learning and highlight some particular challenges in deep RL algorithms, as it is the main empirical demonstration in our work.
2.1 Tile coding
Tile coding111We refer readers to http://www.incompleteideas.net/tiles.html for software and specific usage examples, and to https://medium.com/criteo-labs/tile-coding-an-efficient-sparse-coding-method-for-real-valued-data-e787eddf630a for nice and more specific explanations. is a well-known sparse feature generation method in reinforcement learning Sherstov and Stone (2005); Sutton (1996b); Sutton and Barto (2018) and it works as following. Suppose our sample is a scalar and we intend to convert it to a sparse feature vector. Tile coding specifies a set of tiling . Each of these tilings has length and the th tiling can be denoted as a segment where is called the offset value and is typically much smaller than the segment length . Then we specify the number of tiles , which can be thought of as the resolution of the discretization on each of these segments/tilings. In Figure 1, we show two tilings which can be denoted as two segments (i.e. ). Each tiling has tiles (), meaning that we discretize each of the segment into small intervals each of which has length : . The scalar hits second tile on both tilings which gives respectively, and hence it can be coded by concatenating the two vectors: .
A nice property of tile coding is that, with multiple tilings, we likely increase the generalization power of our tile coded feature. To see why, consider Figure 1 again and assume that we only use the tiling for coding. Then the two inputs do not share any information. However, with two tilings, benefiting from the offset value , the two inputs hit the same tile on the second tiling, from which they can share the same weight unit in learning tasks.
In general, given a scalar and an arbitrarily fixed , must fall into at least one of the intervals in .222At most two if hits on the middle between two tiles, depending on implementation. As a result, on , we can define a dimensional binary vector where each element indicates whether falls into the corresponding interval or not. By concatenating all such vectors, we acquire a sparse feature representation for our scalar input in the space , and there are at most nonzero entries in this concatenated vector. One can see that tile coding has the nice property that the sparsity can be guaranteed by choosing appropriate parameter setting.
This procedure generalizes to high dimensional input , in which case each tiling in becomes a high dimensional object. For example, when , we can define each tiling as a square with tiles on it. Then, given an input , we identify the tile hit by . Then a dimensional binary vector can be defined on each square and concatenating all those binary vectors gives us a dimensional binary vector with at most nonzero entries. One can see that, the computational and memory cost of tile coding grows exponentially as the input dimension increases.
If we know the two elements in are always independent, then it is reasonable to tile code them independently and concatenate the corresponding sparse binary vectors. Sutton (1996b) demonstrates such usage of tile coding by choosing subsets of input variables and tile code those lower dimensional input vectors independently. Such strategy requires significant engineering or preprocessing based on domain knowledge as dependency among input variables is usually unknown Bellemare et al. (2012). Furthermore, numerous combinations may be needed for tile coding in high dimensional case.
In section 3, we introduce our deep tile coder which takes advantage of a deep neural network to learn a hidden representation, and we design a differentiable tile coding operation to generate sparse feature from the hidden representation without introducing additional training parameters. The differentiability of deep tile coder allows it to be used in backpropagation algorithms when training neural networks. Similar to the vanilla tile coding, the sparsity can be guaranteed for deep tile coder.
2.2 Reinforcement Learning
Reinforcement learning is typically formulated within the discounted Markov Decision Process (MDP) framework Sutton and Barto (2018); Szepesvári (2010). A discounted MDP is described by a tuple , where is the state space, is the action space, is the transition probability kernel, is the reward function, and is the discount factor. At each time step , the agent observes a state and takes an action . Then the environment transits to the next state according to the transition probability distribution, i.e., , and the agent receives a scalar reward according to the reward function . A policy is a mapping from a state to an action (distribution) . For a given state-action pair , the action-value function under policy is defined as where is the return of a sequence of transitions by following the policy .
In control setting, the goal of an agent is to find an optimal policy such that some performance measure can be optimized. Policy gradient methods typically use the mean reward or initial state value objective and directly perform gradient ascent with respect to (w.r.t.) policy parameters Sutton et al. (1999); Sutton and Barto (2018). Value-based methods compute the value function (e.g., by performing approximate value iteration and its variants), and obtain the near-optimal policy based on the obtained value function Watkins and Dayan (1992); Szepesvári (2010). A popular value-based deep RL algorithm is Deep Network (DQN) Mnih et al. (2015), which updates parameter in network by sampling a mini-batch of experiences from a experience replay buffer Lin (1992) at each environment time step. That is, we sample a mini-batch of transitions in the form of to update parameter in by minimizing:
and is computed by one-step boostrap target with a separate target network parameterized by :
The target network parameter is updated by copying periodically. It should be noted that the target network technique is also popular in policy-based approaches, particularly in actor-critic algorithms Konda (2002); Haarnoja et al. (2018b, a), the target network is frequently used for the purpose of learning a critic. For example, a popular actor-critic continuous control algorithm is deep deterministic policy gradient (DDPG) by Lillicrap et al. (2016), which is built upon deterministic policy gradient theorem Silver et al. (2014). Let be the actor network parameterized by , and be the critic. The critic is updated with a target network in the similar way as done in DQN except that the maximum action value is computed by using the actor’s output action: , where are target network parameters of critic, actor respectively.
The target network technique is not in accordance with the spirit of fully-incremental, online reinforcement learning and it can potentially slower down the learning progress Kim et al. (2019); Liu et al. (2018); Ghiassian et al. (2018). The updated information is not immediately reflected when computing the boostrap target. However, it is empirically considered as a successful strategy for stabilizing the training process van Hasselt et al. (2018); Mnih et al. (2015); Yang et al. (2019). In Section 4, we empirically demonstrate that DQN with our deep tile coder can significantly outperform vanilla DQN, no matter we use a target network or not.
3 Deep Tile Coder
In this section, we first introduce our main approach for generating sparse representation called Deep Tile Coder (DTC) in Section 3.1. Then we provide some simple theoretical guarantee regarding the sparsity of learned sparse representation in Section 3.2. We attach the python code of tensorflow-based et. al (2015) implementation in the Appendix 6.1.
3.1 Algorithm description
We leverage the representation power of a neural network and consider the outputs of a hidden layer as conditionally independent; then tile coding can be applied to each individual feature unit and the final sparse feature vector can be acquired by concatenating all of the corresponding onehot vectors. Importantly, in order to conveniently train neural networks by backpropagation algorithms, we develop a differentiable tile coding operation which maps the learned hidden representation to sparse representation and this mapping process does not introduce any additional training parameters. Then the sparse representation can be further used to solve desired tasks such as regression, classification, decoder, etc. (typically through a linear operation). Notice that, considering hidden layer outputs as independent are not new; it has been frequently used in factor analysis Weber (1970), mean field approximation for variational inference Blei et al. (2016), etc.
Given an input , let be some hidden layer whose output values are determined by parameter vector . We write as shorthand unless clarification is needed. Now we would like to convert to sparse feature vector by deep tile coding. Assume that the layers before are powerful enough to capture the dependencies and hence we can think of each hidden unit as independent with each other. Consider first that we use a bounded activation function so we have . For example, choosing sigmoid activation gives . Given tile width , we denote a tiling as a -dimensional vector (i.e. tiles, ) where the integer :333Note that the activation function type, and are chosen by users, we assume they are chosen such that is an integer.
Comparing with the tiling introduced in Section 2.1, the above definition of tiling is corresponding to use the cut off points between tiles. One difference with vanilla tile coding is that we consider the offset values of different tilings are the same (i.e ) across tilings, since those offset values are constants. We design the following function to map a scalar to a onehot vector:
where is an indicator function which operates on the input element-wise. It returns if the corresponding element is positive and otherwise. Vector minus scalar is computed by subtracting from each element in . Then we can convert each element to a onehot vector through this function.
We slightly abuse the notion of onehot vector here. The function (2) gives a vector with exactly one nonzero entry except when is exactly equal to one of , in which case, there are two nonzero entries. This means when hits the middle of two neighboring tiles, we think that it is activating both tiles. We analyze this in Lemma 1. In the Appendix 6.2, we provide an operation for generating a vector with exactly one nonzero entry and discuss the possibility of using more complicated/general DTC, which may be of independent interest to certain research community.
Consider that we use a bounded activation function, i.e. . Let the tile width and the tiling with four tiles has length and hence it can be denoted as a vector . Assume we now want to convert to a onehot vector. As , the desired output onehot vector should be (i.e. hits the second tile on the tiling). We now verify that we can acquire this onehot vector through the above function (2).
which is the desired result. For , going through the same operation as above gives us onehot vectors and the concatenation of these vectors is a dimensional vector with nonzero entries.
Instead of using a binary vector, where the activated unit is , we can use another hidden layer to give an activation strength to make it more expressive. That is, we can acquire the final sparse vector by taking product between (2) and a hidden unit from another hidden layer : . Notice that this scalar-vector product does not increase the number of nonzero entries and hence sparsity can be still guaranteed.
It should be noted that, the above function is problematic for training the neural network with backpropagation algorithm since it has zero derivative everywhere except the non-differentiable point at . We now propose to approximate the function by the following function:
where is some small constant parameter for controlling the sparsity, is an indicator function operates element-wise and returns if the input Boolean variable is true otherwise , and is element-wise product. Note that, the original indicator function can be acquired by setting . When , gradient can be backpropagated through this approximation for all entries which are less or equal than . Replacing by in (2), we can approximate the function as:
We summarize the algorithm which maps a vector to a -dimensional sparse vector in Algorithm 1 by using the above function (7). The algorithm takes two input vectors and . The former is used to compute which entry in the sparse vector should be activated, and the latter is used to give a specific activation strength for activated units as discussed in the previous Remark 2. Notice that, our DTC can be plugged into any neural network architecture and be trained with any loss function in an end-to-end manner. Furthermore, our DTC algorithm itself does not introduce any additional training parameters, in contrast to the regularization-based methods. Figure 2 shows an example of using DTC in a feedforward neural network. The example shows that after sharing the first hidden layer, two-stream second hidden layers are used for computing which tiles to hit and for computing the activation strength respectively. In practice, users can flexibly design the hidden layer for computing activation strength. We provide some suggestions in Appendix 6.1.
In principle, we need to assume that the hidden units must be bounded. This may limit the generality of the neural network. However, in Section 4.2, we empirically show that an unbounded function can be also used in practice with an additional loss to penalize the out of boundary values.
3.2 Theoretical Analysis
We now provide some simple theoretical analysis for our deep tile coding method. The first lemma verifies that the function indeed gives the desired onehot vector; the second lemma provides sparsity guarantee for our DTC algorithm.
Onehot vector verification. Let the tile width be some reasonable value so that is an vector with evenly spaced increasing elements as defined in (1). Fixed arbitrary .
1) If for some (define the out of boundary value ), then the function gives an onehot vector where the th entry is ;
2) If for some , returns a vector with two nonzero entries at th, th positions.
By assumption, in either case, the first maximum operation in is
For the second maximum operation,
1) since , ; and this implies:
So has positive entry everywhere except the th position. Hence gives a vector where every entry is except the th entry which is . Then is a onehot vector where the th entry is one.
2) when , then and
It follows that the vector has two zero entries at th and th entry, and gives us a vector with ones at th and th entry.
Particularly in the second case, when , is a zero vector and is positive everywhere except the first entry, so still gives a onehot vector where the only nonzero entry is the first entry. ∎
Denote . Consider first, since the case simply gives us one more nonzero entry. Similar to the above proof for Lemma 1,
taking the sum of the two equations gives us a vector as following:
We count the number of entries less than in this vector from the th position where the corresponding entry is zero.
First, count number of entries on the left side of the th position. Since the th position is zero, which indicates , hence and it follows that . Then . Then the total number of entries on the left side of the th position is at most .
Second, count the number of entries on the right side of the th position. Since , . Hence the maximum number of entries on the right side of th position is .
As a result, taking into consideration the case that , the number of nonzero entries is at most by processing a single element in (i.e. a single for loop in Algorithm 1). After the for loop, we would have at most nonzero elements and hence the corresponding proportion of nonzero entries is at most
Typically is chosen as some small value. Consider . Then even for a tiling with tiles, the proportion of nonzero entries is no more than . As we empirically verify later in Section 4.1, the proportion of nonzero entries is very low. Furthermore, we want to emphasize that DTC achieves sparsity by construction/design instead of learning through some regularization. Hence, sparsity is guaranteed at the beginning of learning.
4 Empirical Demonstration in Reinforcement Learning
In this section, we firstly show empirical results on benchmark discrete action domains with extensive runs. Then we demonstrate the utility of our algorithm on challenging Mujoco robotics continuous control domains. The naming rule of the baselines we used is as following.
(NoTarget)DQN: the DQN algorithm with or without using target network respectively.
(NoTarget)DQN-DTC: (NoTarget)DQN equipped with our DTC to acquire sparse feature and an action value is defined as a linear function of the sparse feature.
4.1 Discrete control
The purposes of the experiment are: 1) rigorously compare DQN using our DTC with several baselines with extensive runs and sufficient number of training steps to ensure that the algorithm does not diverge in the long term; 2) show that using DTC can significantly improve performance and can learn stably in the sense that it has low standard error across runs; 3) verify the sparsity of the sparse feature obtained by DTC. We use hidden units on MountainCar, CartPole, Acrobot and use hidden units on LunarLander as it is a more difficult task. Across all experiments, for DTC setting, we use , and hence number of tiles . We use units for second hidden layer to ensure the feature before DTC is bounded within . Note that our DTC maps from a -dimensional vector to a -dimensional sparse feature for computing action-values (and from to dimensions on LunarLander domain). In order to ensure fair comparison, for DQN and NoTargetDQN, we sweep over hidden unit types between ReLU, tanh to optimize the algorithms. Furthermore, we test DQN/NoTargetDQN: 1) with the same size -network as we have for DTC versions before DTC operation; 2) with a larger -network whose second layer hidden size is the same as sparse feature dimension generated by DTC.
Figure 3 shows the learning curves of different algorithms on benchmark domains from OpenAI Brockman et al. (2016): MountainCar (MCar), CartPole, Acrobot, LunarLander (Lunar). From this figure, one can see that: 1) with or without using a target network, DQN with DTC can significantly outperform the version without using DTC; 2) DTC versions have significantly lower standard errors/variances in most of the figures; 3) NoTargetDQN-DTC outperforms DQN-DTC in general, which indicates a potential gain by removing the target network; 4) without using DTC, NoTargetDQN cannot perform well in general, this clearly indicates the utility of sparse feature and coincides with conclusions/conjectures from several previous works Liu et al. (2018); Rafati and Noelle (2019); 5) simply by using larger neural network cannot guarantee performance improvement.
Given our DTC setting, Lemma 2 guarantees that the proportion of nonzero entries in our learned feature representation should be no more than . This measure of sparsity is typically called instance sparsity. In below Table 1, one can see that our learned feature has lower proportion of nonzero entries than the upper bound provided by the lemma. Additionally, we report overlap sparsity French (1991) which is defined as
given two sparse vectors . It can be thought of as a measure of the level of representation interference. Low overlap sparsity potentially indicates less feature interference between different input samples. We compute an estimate of sparsity by sampling a mini-batch of samples from experience replay buffer and taking the average of them. The reported numbers in the table are acquired by taking the average of those estimates across k training steps. It should be noted that, the sparsity of DTC is achieved by the design of tile coding, i.e. the choice of the number of tiles and . This explains that the sparsity achieved on each domain is very similar to each other, since we use the same and number of tiles across all tests.
4.2 Continuous control
The purpose of our robotics continuous control experiments is to show that: 1) DTC can work even with unbounded activation function by adding an additional loss for out of boundary values; 2) our sparse feature can be used to solve challenging continuous Mujoco Todorov et al. (2012) control problems which indicates the practical utility of DTC. To our best knowledge, this is the first time sparse representation learning method is used in an online manner to solve challenging continuous control problems.
In order to use unbounded activation function, we introduce an additional out of boundary loss:
where indicates the hidden layer right before using DTC and is the bound for the tilings. The intuition of this loss is as following. The activation function is unbounded (i.e. linear), but we have to use a bound to do tile coding. Consider that we use a tiling: . Then we enforce a constraint so that most of the values in should be within the boundary. It should be noted that, for those values which are out of boundary, they do not activate any tile; as a result, the effect of going out of boundary does not increase the density of the representation.
We are able to keep exactly the same DTC setting across all continuous control experiments: we use (i.e. tiles on each tiling, the same as we do for the above discrete control experiments). We use ReLU units for all algorithms except for NoTargetDDPG-DTC, whose second hidden layer activation function is linear (no activation function). Then DTC is applied to this hidden layer to acquire sparse feature and the action value is a linear function of it. Figure 4 shows that our DDPG equipped with DTC can always achieve superior performance than, or at least comparable performance with, vanilla DDPG; while vanilla DDPG without using a target network performs significantly worse than our algorithm on most of the domains. This further verifies the practical effectiveness of using DTC for RL problems.
We propose a novel and simple sparse representation learning method, Deep Tile Coder (DTC), which can efficiently map dense representation to high dimensional sparse representation without introducing additional training parameters. We design a differentiable tile coding operation so that DTC can be conveniently incorporated into any neural network architecture and be trained with any loss function in an end-to-end manner. We empirically study the utility of DTC in RL algorithms on various benchmark domains. Our experimental results show that RL algorithms equipped with DTC is able to learn with lower variance and does not need to use a target network. Our DTC method should be an important step towards fully incremental, online reinforcement learning.
We would like to point out several interesting future directions. First, the complex tile coding as discussed in Appendix 6.2 may have special utility for improving interpretability of deep learning, because defining tilings with different resolutions implicitly forces a neural network to learn finer sensors on those hidden units which are assigned finer resolutions. Second, it is worthy investigating the property of our onehot operation by considering it as a special type of activation function. Note that if the previous hidden layer is linear (without using any activation function), our function can be thought of as a special composition of ReLU which maps a one dimensional scalar to a high dimensional vector.
6.1 Reproducible Research
We now provide details to reproduce our experimental results.
All discrete action domains are from OpenAI Gym Brockman et al. (2016) with version . Deep learning implementation is based on tensorflow with version et. al (2015). We use Adam optimizer Kingma and Ba (2014), Xavier initializer Glorot and Bengio (2010), mini-batch size , buffer size k, and discount rate across all experiments. All activation functions are ReLU except: the output layer of the value is linear, and the second hidden layer is using tanh for our DTC versions. The output layers were initialized from a uniform distribution .
We set the episode length limit as for MountainCar and keep all other episode limit as default settings. We use warm-up steps for populating the experience replay buffer before training. Exploration noise is without decaying. For DQN, we use target network moving frequency , i.e. update target network parameters every training steps. The learning rate is for all algorithms. For each random seed, we evaluate one episode every environment time steps and keep a small noise when taking action.
On Mujoco domains, we use exactly the same setting as done in the original DDPG paper Lillicrap et al. (2016) except that we use a smaller neural network size relu units for DDPG and NoTargetDDPG. For our algorithm, we use linear second hidden layer before the DTC operation. We use warm-up time steps to populate the experience replay buffer and we evaluate each algorithm every environment time steps and we start evaluation after k time steps.
We attach our core part of DTC python code as below.
import tensorflow as tf def Iplus(x, eta): return (tf.cast(x <= eta, tf.float32)*x + tf.cast(x > eta, tf.float32)) // sparse_dim = dk def dtc(shoot, strength, c, d, sparse_dim, delta, eta): x = tf.reshape(shoot, [-1, d, 1]) strength = tf.reshape(strength, [-1, d, 1]) sparsevec = (1.0 - Iplus(tf.nn.relu(c - x) \\ + tf.nn.relu(x - delta - c), eta)) * strength return tf.reshape(sparsevec, [-1, sparse_dim])
A brief discussion on the additional activation strength.
We empirically found that on reinforcement learning experiments, there is no clear difference by using or not using a separate activation strength layer. There may be at least two reasons: 1) it is not always necessary to have optimal value function to acquire optimal policy (consider shifting all values by a constant, or add some small perturbations without changing the maximum action); 2) the catastrophic interference problem may matter more than learning an accurate action value function in reinforcement learning setting. However, we do find using activation strength can significantly improve performance in regular machine learning tasks.
6.2 More Complicated Tiling Design
As we mentioned in Remark 1, the function can actually give two nonzero entries and the interpretation is that hitting the middle between two tiles can activate both tiles. For completeness, we also provide the below function which yields a rigorous onehot vector for each tiling and is differentiable:
The idea is quite simple: we basically use an indicator function to judge whether is equal to any one of the cut off values on the tiling or not. Since we use approximation for the indicator function which would finally give multiple nonzero entries for each tiling, we do not see the necessity of using the above more complicated function.
Another type of complex deep tile coding.
Throughout the paper, we use a constant tiling represented by a vector for all elements in the vector . In fact, we can make this more general by defining tilings with different tiles/resolutions (s). That is, for each , we can have a specialized and . Then this would give a sparse feature vector with dimension . This is a way to incorporate human knowledge, as we can assume that some of the feature units should be more important, and providing a finer tiling for those units should force the neural network to learn to make those units informative.
6.3 Additional Results on Image Autoencoder
The purpose of the additional experiments is to investigate alternative ways for approximating the indicator function other than . We test another possibility to approximate . Recall that the function outputs if input is positive and outputs if zero. As a result, we can also use hyperbolic tangent function tanh in the observance of for any . In implementation, we choose some large values. Notice that, this way is not as good as previously defined in term of the sparsity control: we can no longer provide rigorous bound for the sparsity. When is too small we get dense representation and when is too large the tanh units can “die”/be inactive.
We conduct some experiments in regular machine learning setting by using a popular image dataset Fashion-Mnist by Xiao et al. (2017). We attempt to learn an autoencoder to reconstruct the images from the Fashion-Mnist dataset. Due to the relatively low dimension of the images in the dataset, we use two layer fully connected ReLU units to encode an input image to -dimensional vector. As for decoding, we include several intuitive/interesting baselines described below which may intrigue researchers from certain areas.
RPLinear: after encoding an image to , we use Gaussian random projection Bingham and Mannila (2001) (RP) to project it to the same dimension as the sparse feature dimension achieved by our DTC and the recovered image is linear in the projected feature, i.e. . This is contrary to a regular usage of random projection, which is typically to reduce feature dimension. The rationality of our design stems from: 1) we can think that the neural network is trying to learn an low dimensional embedding which is compatible with the random projection and 2) similar to our method, this projection process does not introduce additional training parameters.
L1SparseNonLinear: add one more hidden ReLU layer (NonLinear) after the encoded feature and this layer uses penalty to enforce sparse feature. We sweep regularization weight from . Note that this baseline has roughly more training parameters than other algorithms.
TC-Ind/Tanh: use DTC to map to sparse feature and the reconstructed image is linear in the sparse feature. Ind and Tanh indicate or approximation for function in DTC respectively.
From Figure 5, we can see that: 1) even though L1SparseNonLinear has larger number of training parameters, our algorithm TC-Ind/Tanh can still significantly outperform it. This highlights the advantage of our DTC method; 2) the utility of using either tanh approximation or seems to be dependent on the activation function type of the low dimensional embedding; 3) a naive random projection to high dimensional space performs significantly worse than our DTC sparse projection.
- The arcade learning environment: an evaluation platform for general agents. CoRR abs/1207.4708. Cited by: §2.1.
- Representation learning: a review and new perspectives. IEEE Transactions on Pattern Analysis and Machine Intelligence. Cited by: §1.
- Random projection in dimensionality reduction: applications to image and text data. In Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 245–250. Cited by: 1st item.
- Variational Inference: A Review for Statisticians. arXiv e-prints. Cited by: §3.1.
- OpenAI gym. CoRR. Cited by: §4.1, §6.1.
- Continual state representation learning for reinforcement learning using generative replay. CoRR abs/1810.03880. Cited by: §1.
- Learning action representations for reinforcement learning. CoRR abs/1902.00183. Cited by: §1.
- Geometrical and Statistical Properties of Systems of Linear Inequalities with Applications in Pattern Recognition.. IEEE Trans. Electronic Computers. Cited by: §1.
- TensorFlow: large-scale machine learning on heterogeneous systems. Note: Software available from tensorflow.org External Links: Cited by: §3, §6.1.
- Using semi-distributed representations to overcome catastrophic forgetting in connectionist networks. In Annual Cognitive Science Society Conference, Cited by: §1, §4.1.
- Two geometric input transformation methods for fast online reinforcement learning with neural nets. CoRR abs/1805.07476. Cited by: §1, §1, §2.2.
- Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, Cited by: §6.1.
- Soft actor-critic: off-policy maximum entropy deep reinforcement learning with a stochastic actor. In Proceedings of the 35th International Conference on Machine Learning, J. Dy and A. Krause (Eds.), Proceedings of Machine Learning Research, pp. 1861–1870. Cited by: §2.2.
- Soft actor-critic algorithms and applications. CoRR. Cited by: §2.2.
- Learning representations in reinforcement learning. Ph.D. Thesis, University of California, Merced. Cited by: §1.
- Sparse Distributed Memory. MIT Press. Cited by: §1.
- DeepMellow: removing the need for a target network in deep q-learning. pp. 2733–2739. Cited by: §2.2.
- Adam: a method for stochastic optimization. arXiv:1412.6980. Cited by: §6.1.
- Actor-critic algorithms. Cited by: §2.2.
- Learning sparse representations in reinforcement learning with sparse coding. arXiv:1707.08316. Cited by: §1, §1.
- Continuous control with deep reinforcement learning. In ICLR, Cited by: §2.2, §6.1.
- Self-Improving Reactive Agents Based On Reinforcement Learning, Planning and Teaching.. Machine Learning. Cited by: §2.2.
- The utility of sparse representations for control in reinforcement learning. CoRR abs/1811.06626. External Links: Cited by: §1, §1, §1, §2.2, §4.1.
- Representation learning on graphs: A reinforcement learning application. CoRR abs/1901.05351. Cited by: §1.
- Supervised dictionary learning. In Advances in Neural Information Processing Systems, Cited by: §1.
- Online dictionary learning for sparse coding. In International Conference on Machine Learning, Cited by: §1.
- Catastrophic Interference in Connectionist Networks: The Sequential Learning Problem. Psychology of Learning and Motivation. Cited by: §1.
- Human-level control through deep reinforcement learning. Nature. Cited by: §1, §2.2, §2.2.
- Computational explorations in cognitive neuroscience understanding the mind by simulating the brain. Cited by: §1.
- Learning sparse representations in reinforcement learning. arXiv e-prints. External Links: Cited by: §1, §1, §4.1.
- Sparse distributed memories for on-line value-based reinforcement learning. In Machine Learning: ECML PKDD, Cited by: §1.
- Neural fitted Q iteration–first experiences with a data efficient neural reinforcement learning method. In European Conference on Machine Learning, Cited by: §1.
- Function approximation via tile coding: automating parameter choice. pp. 194–205. External Links: Cited by: §2.1.
- Mastering the game of go with deep neural networks and tree search. Nature 529, pp. 484–503. Cited by: §1.
- Deterministic policy gradient algorithms. In ICML, pp. I–387–I–395. Cited by: §2.2.
- Policy gradient methods for reinforcement learning with function approximation. In Proceedings of the 12th International Conference on Neural Information Processing Systems, Cited by: §2.2.
- Reinforcement learning: an introduction. Second edition, The MIT Press. Cited by: §1, §1, §2.1, §2.2, §2.2.
- Generalization in reinforcement learning: Successful examples using sparse coarse coding. In Advances in Neural Information Processing Systems, Cited by: §1, §1.
- Generalization in reinforcement learning: successful examples using sparse coarse coding. In Advances in Neural Information Processing Systems 8, pp. 1038–1044. Cited by: §2.1, §2.1.
- Algorithms for reinforcement learning. Morgan Claypool Publishers. Cited by: §2.2, §2.2.
- Pairwise relative offset features for atari 2600 games. Cited by: §1.
- MuJoCo: a physics engine for model-based control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, Cited by: §4.2.
- Deep reinforcement learning and the deadly triad. CoRR. External Links: Cited by: §2.2.
- Q-learning. Machine Learning. Cited by: §2.2.
- Modern factor analysis. Biometrische Zeitschrift 12 (1), pp. 67–68. Cited by: §3.1.
- Adaptive tile coding for value function approximation. Cited by: §1.
- External Links: Cited by: §6.3.
- A theoretical analysis of deep q-learning. CoRR abs/1901.00137. External Links: Cited by: §2.2.