Abstract
Deep Reinforcement Learning (DRL) is a trending field of research, showing great promise in many challenging problems such as playing Atari, solving Go and controlling robots. While DRL agents perform well in practice we are still missing the tools to analayze their performance and visualize the temporal abstractions that they learn. In this paper, we present a novel method that automatically discovers an internal Semi Markov Decision Process (SMDP) model in the Deep Q Network’s (DQN) learned representation. We suggest a novel visualization method that represents the SMDP model by a directed graph and visualize it above a tSNE map. We show how can we interpret the agent’s policy and give evidence for the hierarchical state aggregation that DQNs are learning automatically. Our algorithm is fully automatic, does not require any domain specific knowledge and is evaluated by a novel likelihood based evaluation criteria.
Visualizing Dynamics: from tSNE to SEMIMDPs
Nir Ben Zrihem* bentzinir@gmail.com
Tom Zahavy* tomzahavy@campus.technion.ac.il
Shie Mannor shie@ee.technion.ac.il
Electrical Engineering Department, The Technion  Israel Institute of Technology, Haifa 32000, Israel
DQN is an offpolicy learning algorithm that uses a Convolutional Neural Network (CNN) (Krizhevsky et al., 2012) to represent the actionvalue function and showed superior performance on a wide range of problems (Mnih et al., 2015). The success of DQN, and that of Deep Neural Network (DNN) in general, is explained by its ability to learn good representations of the data automatically. Unfortunately, its high representation power is also making it complex to train and hampers its wide use.
Visualization can play an essential role in understanding DNNs. Current methods mainly focus on understanding the spatial structure of the data.
For example, Zeiler & Fergus (2014) search for training examples that cause high neural activation at specific neurons, Erhan et al. (2009) created training examples that maximizes the neural activity of a specific neuron and Yosinski et al. (2014) interpreted each layer as a group. However, none of these methods analyzed the temporal structure of the data.
Good temporal representation of the data can speed up the prerformence of Reinforcement Learning (RL) algorithms (Dietterich, 2000; Dean & Lin, 1995; Parr, 1998; Hauskrecht et al., 1998), and indeed there is a growing interest in developing hierarchical DRL algorithms. For example, Tessler et al. (2016) pretrained skill networks using DQNs and developed a Hierarchical DRL Network (HDRLN). Their architecture learned to control between options operating at different temporal scales and demonstrated superior performance over the vanilla DQN in solving tasks at Minecraft. Kulkarni et al. (2016) took a different approach, they manually predefined subgoals for a given task and developed a hierarchical DQN (hDQN) that is operating at different time scales. This architecture managed to learn how to solve both the subgoals and the original task and outperformed the Vanilla DQN in the the challenging ATARI game ’Montezuma’s Revenge’. Both these methods used prior knowledge about the hierarchy of a task in order to solve it. However it is still unclear how to automatically discover the hierarchy apriori.
Interpretability of DQN policies is an urging issue that has many important applications. For example, it may help to distil a cumbersome model into a simple one (Rusu et al., 2015) and will increase the human confidence in the performance of DRL agents. By understanding what the agent has learned we can also decide where to grant it control and where to take over. Finally, we can improve learning algorithms by finding their weaknesses.
The internal model principle (Francis & Wonham, 1975), ”Every good key must be a model of the lock it opens”, was formulated mathematically for control systems by Sontag (2003), claiming that if a system is solving a control task, it must necessarily contain a subsystem which is capable of predicting the dynamics of the system. In this work we follow the same line of thought and claim that DQNs are learning an underlying spatiotemporal model of the problem, without implicitly being trained to. We identify this model as an Semi Aggregated Markov Decision Process (SAMDP), an approximation of the true MDP that allows human interpretability.
Zahavy et al. (2016) showed that by using handcrafted features, they can interpret the policies learned by DQN agents using a manual inspection of a tDistributed Stochastic Neighbor Embedding (tSNE) map (Van der Maaten & Hinton, 2008). They also revealed that DQNs are automatically learning temporal representations such as hierarchical state aggregation and temporal abstractions. On the other hand, they use a manual reasoning of a tSNE map, a tedious process that requires careful inspection as well as an experienced eye.
However, we suggest a method that is fully automatic. Instead of manually designing features, we use clustering algorithms to reveal the underlying structure of the tSNE map. But instead of naively applying classical methods, we designed novel timeaware clustering algorithms that take into account the temporal structure of the data. Using this approach we are able to automatically reveal the underlying dynamics and rediscover the temporal abstractions showed in (Zahavy et al., 2016). Moreover, we show that our method reveals an underlying SMDP model and confront this hypothesis qualitatively, by designing a novel visualization tool, and quantitatively, by developing likelihood criteria which we later test empirically.
The result is an SMDP model that gives a simple explanation on how the agent solves the task  by decomposing it automatically into a set of subproblems and learning a specific skill at each. Thus, we claim that we have found an internal model in DQN’s representation, which can be used for automatic subgoal detection in future work.

Learn : Train a DQN agent.

Evaluate : Run the agent, record visited states, neural activations and Qvalues.

Reduce : Apply tSNE.

Cluster : Apply clustering on the data.

Model : Fit an SMDP model. Estimate the transition probabilities and reward values.

Visualize : Visualize the SMDP above the tSNE.
We train DQN agents using the Vanilla DQN algorithm (Mnih et al., 2015). When training is done, we evaluate the agent at multiple episodes, using an greedy policy. We record all visited states and their neural activations, as well as the Qvalues and other manually extracted features. We keep the states in their original visitation order in order to maintain temporal relations. Since the neural activations are of high order we apply tSNE dimensionality reduction so we are able to visualize it.
We define an SMDP model over the set of tSNE points using a vector of cluster labels , and a transition probability matrix where indicates the empirical probability of moving from cluster to . We define the entropy of a model by: , i.e., the average entropy over transition probability from each cluster weighted by its size.
We note that threw this entire paper, by an SMDP model we only refer to the induced Markov Reward Process of the DQN policy. Recall that the DQN agent is learning a deterministic policy, therefore, in deterministic environments (e.g., the Atari2600 emulator), the underlying SMDP should in fact be deterministic and an entropy minimizer.
The data that we collect from the DQN agent is highly correlated since it was generated from an MDP. However, standard clustering algorithms assume the data is drawn from an i.i.d distribution, and therefore result with clusters that overlook the temporal information. This results with highentropy SMDP models that are too complicated to analayze and are not consistent with the data. For this aim, we use a variant of Kmeans that incorporates the data temporal information such that a point is assigned to a cluster , if its neighbours along the trajectory are also close to , using a temporal window.
We follow the analysis of (Hallak et al., 2013) and define criteria to measure the fitness of a model empirically. We define the Value Mean Square Error(VMSE) as the normalized distance between two value estimations:
The SMDP value is given by
(1) 
and the DQN value is evaluated by averaging the DQN value estimates over all MDP states in a given cluster (SMDP state): . Finally, the greedy policy with respect to the SMDP value is given by:
(2) 
The Minimum Description Length (MDL; (Rissanen, 1978)) principle is a formalization of the celebrated Occamâs Razor. It copes with the overfitting problem for the purpose of model selection. According to this principle, the best hypothesis for a given data set is the one that leads to the best compression of the data. Here, the goal is to find a model that explains the data well, but is also simple in terms of the number of parameters. In our work we follow a similar logic and look for a model that best fits the data but is still âsimpleâ.
Instead of considering ”simple” in terms of the number of parameters, we measure the simplicity of the spatiotemporal state aggregation. For spatial simplicity we define the Inertia: which measures the variance of MDP states inside a cluster (AMDP state). For temporal simplicity we define the entropy: , and the Intensity Factor which measures the fraction of in/out cluster transitions:
In Section id1 we explained how to create a tSNE map from DQN’s neural activations and in Section id1 we showed how to automatically design an SMDP model using temporalware clustering methods. In this section we explain how to fuse the SMDP model with the tSNE map for a clear visualization of the dynamics.
In our approach, an SMDP is represented by a directed graph. Each SMDP state is represented by a node in the graph and corresponds to a cluster of tSNE points (game states). In addition, the transition probabilities between the SMDP states are represented by weighted edges between the graph nodes. We draw the graph on top of the tSNE map such that it reveals the underlying dynamics. Choosing a good layout mechanism to represent a graph is a hard task when dealing with high dimensional data (Tang et al., 2016). We consider different layout algorithms for the position of the nodes, such as the spring layout that position nodes by using the FruchtermanReingold forcedirected algorithm and the spectral layout that uses the eigenvectors of the graph Laplacian (Hagberg et al., 2008). However, we found out that simply positioning each node at the average coordinates of each tSNE cluster gives a more clear visualization. The intuition behind it is that the tSNE algorithm was planned to solve the crowding problem and therefore outputs clusters that are well separated from each other.
Experimental setup.
We evaluated our method on two Atari2600 games, Breakout and Pacman. For each game we collected 120k game states. We apply the tSNE algorithm directly on the collected neural activations of the last hidden layer, similar to Mnih et al. (2015). The input consists of game states with features each (the size of our DQN last layer). Since this data is relatively large, we preprocessed it using Principal Component Analysis to dimensionality of 50 and used the Barnes Hut tSNE approximation (Van Der Maaten, 2014). The input to the clustering algorithm consists of game states with features each (two tSNE coordinates and the Value estimate). We applied the SpatioTemporal Cluster Assignment with k=20 clusters and w=2 temporal window size. We run the algorithm for iterations and choose the best SMDP in terms of minimum entropy (we will consider other measures in future work). Finally we visualize the SMDP using the visualization method explained in Section id1.
Simplicity.
Looking at the resulted SMDPs it is interesting to note that the transition probability matrix is very sparse, i.e., the transition probability from each state is not zero only for a small subset of the states, thus, indicating that our cluster are located in time. Inspecting the mean image of each cluster we can see that the clusters are also highly spatially located, meaning that the states in each cluster share similar game position. Figure 1 shows the SMDP for Breakout. The mean image of each cluster shows us the ball location and direction (in red), thus characterizes the game situation in each cluster. We also observe that states with low entropy follow a well defined skill policy. For example cluster 10 has one main transition ans show a well defined skill of carving the left tunnel (see the mean image). In contrast, clusters 6 and 16 has transitions to more clusters (and therefore higher entropy) and a much less defined skill policy (presented by its relatively confusing mean state). Figure 2 shows the SMDP for Pacman. The mean image of each cluster shows us the agent’s location (in blue), thus characterizes the game situation in each cluster. We can see that the agent is spending its time in a very defined areas in the state space at each cluster. For example, cluster 19 it is located in the northwest part of the screen and in cluster 9 it is located in southeast. We also observe that clusters with more transitions, e.g., clusters 0 and 2, suffer from less defined mean state.
Model Evaluation. We evaluate our model using three different methods. First, the VMSE criteria (Figure 3, top): high correlation between the DQN values and the SMDP values gives a clear indication to the fitness of the model to the data. Second, we evaluate the correlation between the transitions induced by the policy improvement step and the trajectory reward . To do so, we measure the empirical distribution of choosing the greedy policy at state in that trajectory. Finally we present the correlation coefficients at each state: (Figure 3, center). Positive correlation indicates that following the greedy policy leads to high reward. Indeed for most of the states we observe positive correlation, supporting the consistency of the model. The third evaluation is close in spirit to the second one. We create two transition matrices using k toprewarded trajectories and k leastrewarded trajectories respectively. We measure the correlation of the greedy policy with each of the transition matrices for different values of k (Figure 3 bottom). As clearly seen, the correlation of the greedy policy and the top trajectories is higher than the correlation with the bad trajectories.
In this work we considered the problem of visualizing dynamics. Starting with a tSNE map of the neural activations of a DQN and ending up with an SMDP model describing the underlying dynamics. We developed clustering algorithms that take into account the temporal aspects of the data and defined quantitative criteria to rank candidate SMDP models based on the likelihood of the data and an entropy simplicity term. Finally we showed in the experiments section that our method can successfully be applied on two Atari2600 benchmarks, resulting in a clear interpretation for the agent policy.
Our method is fully automatic and does nor require any manual or game specific work. We note that this is a work in progress, it is mainly missing the quantitative results for the different likelihood criteria. In future work we will finish to implement the different criteria followed by the relevant simulations.
References
 Dean & Lin (1995) Dean, Thomas and Lin, ShieuHong. Decomposition techniques for planning in stochastic domains. 1995.
 Dietterich (2000) Dietterich, Thomas G. Hierarchical reinforcement learning with the MAXQ value function decomposition. J. Artif. Intell. Res.(JAIR), 13:227–303, 2000.
 Duda et al. (2012) Duda, Richard O, Hart, Peter E, and Stork, David G. Pattern classification. John Wiley & Sons, 2012.
 Erhan et al. (2009) Erhan, Dumitru, Bengio, Yoshua, Courville, Aaron, and Vincent, Pascal. Visualizing higherlayer features of a deep network. Dept. IRO, Université de Montréal, Tech. Rep, 4323, 2009.
 Francis & Wonham (1975) Francis, Bruce A and Wonham, William M. The internal model principle for linear multivariable regulators. Applied mathematics and optimization, 2(2), 1975.
 Hagberg et al. (2008) Hagberg, Aric A., Schult, Daniel A., and Swart, Pieter J. Exploring network structure, dynamics, and function using NetworkX. In Proceedings of the 7th Python in Science Conference (SciPy2008), August 2008.
 Hallak et al. (2013) Hallak, Assaf, DiCastro, Dotan, and Mannor, Shie. Model selection in markovian processes. In Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2013.
 Hauskrecht et al. (1998) Hauskrecht, Milos, Meuleau, Nicolas, Kaelbling, Leslie Pack, Dean, Thomas, and Boutilier, Craig. Hierarchical solution of Markov decision processes using macroactions. In Proceedings of the Fourteenth conference on Uncertainty in artificial intelligence, pp. 220–229. Morgan Kaufmann Publishers Inc., 1998.
 Krizhevsky et al. (2012) Krizhevsky, Alex, Sutskever, Ilya, and Hinton, Geoffrey E. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105, 2012.
 Kulkarni et al. (2016) Kulkarni, Tejas D, Narasimhan, Karthik R, Saeedi, Ardavan, and Tenenbaum, Joshua B. Hierarchical deep reinforcement learning: Integrating temporal abstraction and intrinsic motivation. arXiv preprint arXiv:1604.06057, 2016.
 Lin (1993) Lin, LongJi. Reinforcement learning for robots using neural networks. Technical report, DTIC Document, 1993.
 Lunga et al. (2014) Lunga, Dalton, Prasad, Santasriya, Crawford, Melba M, and Ersoy, Ozan. Manifoldlearningbased feature extraction for classification of hyperspectral data: a review of advances in manifold learning. Signal Processing Magazine, IEEE, 31(1):55–66, 2014.
 MacQueen et al. (1967) MacQueen, James et al. Some methods for classification and analysis of multivariate observations. 1967.
 Mnih et al. (2015) Mnih, Volodymyr, Kavukcuoglu, Koray, Silver, David, Rusu, Andrei A, Veness, Joel, Bellemare, Marc G, Graves, Alex, Riedmiller, Martin, Fidjeland, Andreas K, Ostrovski, Georg, et al. Humanlevel control through deep reinforcement learning. Nature, 518(7540), 2015.
 Parr (1998) Parr, Ronald. Flexible decomposition algorithms for weakly coupled Markov decision problems. In Proceedings of the Fourteenth conference on Uncertainty in artificial intelligence, pp. 422–430. Morgan Kaufmann Publishers Inc., 1998.
 Rissanen (1978) Rissanen, Jorma. Modeling by shortest data description. Automatica, 14(5):465–471, 1978.
 Rusu et al. (2015) Rusu, Andrei A, Colmenarejo, Sergio Gomez, Gulcehre, Caglar, Desjardins, Guillaume, Kirkpatrick, James, Pascanu, Razvan, Mnih, Volodymyr, Kavukcuoglu, Koray, and Hadsell, Raia. Policy distillation. arXiv preprint arXiv:1511.06295, 2015.
 Sontag (2003) Sontag, Eduardo D. Adaptation and regulation with signal detection implies internal model. Systems & control letters, 50(2):119–126, 2003.
 (19) Stolle, Martin and Precup, Doina. Learning options in reinforcement learning. Springer.
 Sutton et al. (1999) Sutton, Richard S, Precup, Doina, and Singh, Satinder. Between MDPs and semiMDPs: A framework for temporal abstraction in reinforcement learning. Artificial Intelligence, 112(1), August 1999.
 Tang et al. (2016) Tang, Jian, Liu, Jingzhou, Zhang, Ming, and Mei, Qiaozhu. Visualizing largescale and highdimensional data. In Proceedings of the 25th International Conference on World Wide Web. International World Wide Web Conferences Steering Committee, 2016.
 Tenenbaum et al. (2000) Tenenbaum, Joshua B, De Silva, Vin, and Langford, John C. A global geometric framework for nonlinear dimensionality reduction. Science, 290(5500), 2000.
 Tessler et al. (2016) Tessler, Chen, Givony, Shahar, Zahavy, Tom, Mankowitz, Daniel J, and Mannor, Shie. A deep hierarchical approach to lifelong learning in minecraft. arXiv preprint arXiv:1604.07255, 2016.
 Van Der Maaten (2014) Van Der Maaten, Laurens. Accelerating tSNE using treebased algorithms. The Journal of Machine Learning Research, 15(1):3221–3245, 2014.
 Van der Maaten & Hinton (2008) Van der Maaten, Laurens and Hinton, Geoffrey. Visualizing data using tSNE. Journal of Machine Learning Research, 9(25792605):85, 2008.
 Ward (1963) Ward, Joe H. Hierarchical grouping to optimize an objective function. Journal of the American Statistical Association, 58(301):236–244, 1963.
 Yi et al. (2000) Yi, TauMu, Huang, Yun, Simon, Melvin I, and Doyle, John. Robust perfect adaptation in bacterial chemotaxis through integral feedback control. Proceedings of the National Academy of Sciences, 97(9):4649–4653, 2000.
 Yosinski et al. (2014) Yosinski, Jason, Clune, Jeff, Bengio, Yoshua, and Lipson, Hod. How transferable are features in deep neural networks? In Advances in Neural Information Processing Systems, pp. 3320–3328, 2014.
 Zahavy et al. (2016) Zahavy, Tom, Zrihem, Nir Ben, and Mannor, Shie. Graying the black box: Understanding dqns. arXiv preprint arXiv:1602.02658, 2016.
 Zeiler & Fergus (2014) Zeiler, Matthew D and Fergus, Rob. Visualizing and understanding convolutional networks. In Computer Vision–ECCV 2014, pp. 818–833. Springer, 2014.