Affect-based Intrinsic Rewards for Exploration and Learning

Affect-based Intrinsic Rewards for Exploration and Learning

Abstract

Positive affect has been linked to increased interest, curiosity and satisfaction in human learning. In reinforcement learning, extrinsic rewards are often sparse and difficult to define, intrinsically motivated learning can help address these challenges. We argue that positive affect is an important intrinsic reward that effectively helps drive exploration that is useful in gathering experiences. We present a novel approach leveraging a task-independent intrinsic reward function trained on spontaneous smile behavior that captures positive affect. To evaluate our approach we trained several downstream computer vision tasks on data collected with our policy and several baseline methods. We show that the policy based on intrinsic affective rewards successfully increases the duration of episodes, the area explored and reduces collisions. The impact is the increased speed of learning for several downstream computer vision tasks.

1

I Introduction

Fig. 1: We present a novel approach leveraging a positive affect-based intrinsic reward to motivate exploration. We use this policy to collect data for self-supervised pre-training and then use the learned representations for multiple downstream computer vision tasks. The red regions highlight the parts of the architecture trained at each stage.

Reinforcement learning (RL) is most commonly achieved via policy specific rewards that are designed for a predefined task or goal. Such extrinsic rewards can be sparse and difficult to define and/or only apply to the task at hand. We are interested in exploring the hypothesis that RL frameworks can be designed in a task agnostic fashion and that this will enable us to efficiently learn general representations that are useful in solving several tasks related to perception in robotics. In particular, we consider intrinsic rewards that are akin to affect mechanisms in humans and encourage efficient and safe explorations. These rewards are task-independent; thus, the experiences they gather are not specific to any particular activity and can be harnessed to build general representations. Furthermore, intrinsically motivated learning can have advantages over extrinsic rewards as they can reduce the sample complexity by producing rewards signals that indicate success or failure before the episode ends [29].

A key question we seek to answer is how to define such an intrinsic policy. We propose a framework that comprises mechanisms motivated by human affect. The core insight is that learning agents motivated by drives such as delight, fear, curiosity, hunger, etc. can garner rich experiences that are useful in solving multiple types of tasks. For instance, in reinforcement learning contexts there is a need for an agent to adequately explore its environment [16]. This can be performed by randomly selecting actions, or employing a more intelligent strategy, directed exploration, that incentives exploration of unexplored regions [10]. Curiosity is often defined by using the prediction error as the reward signal [6, 32]. As such, the uncertainty, or mistakes, made by the system are assumed to represent what the system should want to learn more about. However, this is a simplistic view as it fails to take into account that new stimuli are not always very informative or useful [36]. Savinov et al.uses the analogy of becoming glued to a TV channel surfing when there is the rest of the world outside the window. Their work proposed a new novelty bonus that features episodic memory. McDuff and Kapoor [29] took another approach focusing on safe exploration, proposing intrinsic rewards mimicking responses to that of a human’s sympathetic nervous system (SNS) to avoid catastrophic mistakes during the learning phase, but not necessarily promoting exploration.

In this paper, we specifically focus on the role of positive emotions and study how such intrinsic motivations can enable learning agents to explore efficiently and learn useful representations. In research on education, positive affect has been shown to be related to increased interest, involvement and arousal in learning contexts [28]. Kort et al.’s [25] model of emotions in learning posits that the states of curiosity and satisfaction are associated with positive affect in the context of a constructive learning experience. Human physiology is informative about underlying affective states. Smile behavior [22] and physiological signals [29] have been effectively used as feedback in learning systems but not in the context of intrinsic motivation or curiosity. We leverage facial expressions as an unobtrusive measure of expressed positive affect. The key challenges here entail both designing a system that can first model the intrinsic reward appropriately, and then building a learning framework that can efficiently use the data to solve multiple downstream tasks related to perception in robotics.

The core contributions of this paper (summarized in Fig. 1) are to (1) present a novel learning framework in which reward mechanisms motivated by positive affect-mechanisms in humans are used to carry out explorations while being agnostic to any specific tasks, (2) show how the data collected in such an unsupervised manner can be used to build general representations useful for solving downstream tasks with minimal task-specific fine-tuning, and (3) report the results on experiments showing that the framework improves exploration as well as enabling efficient learning for solving multiple data-driven tasks. In summary, we argue that such an intrinsically motivated learning framework inspired by affective mechanisms can be effective in increasing the coverage during exploration, decreasing the number catastrophic failures and that the garnered experiences can help us learn general representations for solving tasks including depth estimation, scene segmentation, and sketch-to-image translation.

Ii Related Work

Our work is inspired by intrinsically motivated learning [12, 19, 41]. One key property of intrinsic rewards is that they are non-sparse [36], this helps aid learning even if the signal is weak. Much of the work in this domain uses a combination of intrinsic and extrinsic rewards in learning. Curiosity is one example of an intrinsic reward that grants a bonus when an agent discovers something new and is vital for discovering successful behavioral strategies. For example, [6, 32] models curiosity via the prediction error as a surrogate and shows that such intrinsic reward mechanism performed similarly to hand-designed extrinsic rewards in many environments. Similarly, Savinov et al.[36] defined a different curiosity metric based on how many steps it takes to reach the current observation from those in memory, thus capturing environment dynamics. Their motivation was that previous approaches were too simplistic in assuming that all changes in the environment should be considered equal. McDuff and Kapoor [29] provided an example of how an intrinsic reward mechanism could motivate safer learning, utilizing human physiological responses to shorten the training time and avoid critical states. Our work is inspired by the prior art, but with the key distinction that we specifically aim to build intrinsic reward mechanisms that are visceral and trained on signals correlated with human affective responses.

Imitation learning (IL) is a popular method for deriving policies. In IL, the model is trained on previously generated examples to imitate the recorded behavior. It has been successfully implemented using data collected from many domains [2, 3, 4, 7, 33, 34]. Simulated environments have been successfully used for training and evaluating IL systems [13, 40]. We use IL as a baseline in our work and perform experiments to show how a combination of IL and positive affect-based rewards can lead to greater and safer exploration.

One of our goals is to explore whether our intrinsic motivation policy can help us learn general representations. Fortunately, the rise of unsupervised generative models, such as generative adversarial networks (GANs) [18] and variational auto-encoders (VAE) [24], has led to progress across many interesting challenges in computer vision that involve image-to-image translation [21, 42]. In our work, we use three tasks in our data evaluation process: scene segmentation [8], depth estimation [17], and sketch-to-image translation [11]. The first two tasks are common in driving scenarios and augmented reality [31, 38, 39], while the third is known for helping people render visual content [14] or synthesize imaginary images [9]. In this paper, we show that by pre-training a VAE in a self-supervised way first using our exploration policy we can obtain better results on all three tasks.

Iii Our Framework

Fig. 1 describes the overall framework. The core idea behind the proposed methodology is that the agents have intrinsic motivations that lead to extensive exploration. Consequently, an agent on its own is able to gather experiences and data in a wide variety of conditions. Note that unlike traditional machine intelligence approaches, the agent is not fixated on a given task - all it is encouraged to do is explore as extensively as possible without getting into perilous situations. The rich data that is being gathered then needs to be harnessed into building representations that will eventually be useful in solving many perceptions tasks. Thus, the framework consists of three core components: (1) A positive-affect based exploration policy, (2) a self-supervised representation learning component and (3) mechanisms that utilize these representations efficiently to solve various tasks.

Iii-a Affect-Based Exploration Policy

The approach here is to create a model that encourages the agent to explore the environment. We want a reward mechanism that positively reinforces behaviors that mimics a human’s affective responses and lead to discovery and joy.

Positive Intrinsic Reward Mechanism: In our work, we use a Convolutional Neural Network (CNN) to model the affective responses of a human, as if we were in the same scenario as the agent. We train the CNN model to predict human smile responses as the exploration evolves. Based on the fact that positive affect plays a central role in curiosity and learning [25], we chose to measure smiles as an approximate measure of positive affect. Smiles are consistently linked with positive emotional valence [5, 23] and have a long history of study [26] using electromyography [5] and automated facial coding [27]. We must emphasize that in this work we were not attempting to model the psychological processes that cause people to smile explicitly. We are only using smiles as an outward indicator of situations that are correlated with positive affect as people explore new environments. In particular, the NN was trained to infer the reward directly given that an action was taken when at state . We defer the details of the NN architecture, the process of data collection and the training procedure to Sections IV and APPENDIX.

Choosing Actions with Intrinsic Rewards: Given the intrinsic reward mechanism, we can use any off-the-shelf sequential decision-making framework, such as RL [29], to learn a policy. It is also feasible to modify an existing policy that is trained to explore or collect data. While the former approach is a desirable one theoretically, it requires a very large number of training episodes to return a useful policy. We focus on the later, where we assume there exists a function which can predict a vector of actions probabilities when the agent observes state . Formally, given an observation and a model , the next action, , is gets selected as: . Such a function can be trained on human demonstrations while they explore the environment.

We then use the intrinsic positive affect model to change the action selection such that it biases the actions that promise to provide better intrinsic rewards. Intuitively, instead of simply using the output of the pre-trained policy to decide on the next action, we consider the impact of the intrinsic motivation for every possible action ן consideration. Formally, given the positive affect model , a pre-trained exploration policy , observation , the next action, , being selected becomes:

(1)

The above equation adds a weighted intrinsic motivation component to the action probabilities from the original model . The weighting parameter defines the trade-off between the original policy and the effect of the intrinsic reward.

Fig. 2: An example of the smile response for a six-minute (360s) period during one of the driving sessions. Frames from the environment and from the webcam video are shown as a reference.

Iii-B Self-Supervised Learning

Given the exploration policy, the agent has the ability to explore and collect rich data. The next component aims to use this data to build rich representations that eventually could be used for various visual recognition and understanding tasks.

The challenge here is that since the collected data was task agnostic there are no clear labels that could be used for supervised learning. We consequently, use the task of jointly learning an autoencoder and decoder, through a low-dimensional latent representation. Formally, we use a variational autoencoder (VAE) to build such representations. For example, a VAE can be trained to restore just the input image, with the loss constructed as the combination of negative log-likelihood and KL divergence, as follows:

(2)

Where the encoder is denoted by , the decoder is denoted by and denotes the low-dimensional projection of the input . The key intuition here is that if the VAE can successfully encode and decode frames then implicitly it is considering aspects such as depth, segmentation, and textures that are critical to making successful predictions. Thus, it should be possible to tweak and fine-tune these VAE networks to solve a host of visual recognition and perception tasks with minimal effort.

Iii-C Fine-tuning for Vision Tasks

Given the VAE representation, our goal now is to reuse the learned weights to solve standard machine perception tasks. Formally, given some labeled data corresponding to a visual task, similar to supervised learning, we optimize the negative log-likelihood:

(3)

Note that the goal is to minimally modify the network. In our experiments, we show how we can solve depth map estimation and scene segmentation by only tweaking the weights for the first or last few layers just before the decoder output. We also show how we can use those weights for sketch-to-image translation, even with a small amount of annotated samples.

Iv Experiments

We conducted experiments to analyze (1) the potential advantages of affect-based exploration policy over other heuristics, (2) the ability to learn general representations and (3) if it was possible to solve diverse computer vision tasks by building upon and minimally modifying the model.

We used a high fidelity simulation environment [37] for autonomous systems, contains a customized 3D maze (dimensions: 2,490 meters by 1,500 meters), a top-down view can be seen in Fig. 3. The maze is composed of walls and ramps, frames from the environment can be seen in Fig. 2. The agent we used was a vehicle capable of maneuvering comfortably within the maze. To generate random starting points that allowed us to deploy the agent into the maze, we constructed a navigable area, according to the vehicle dimensions and surroundings (the green region in Fig. 3).

Iv-a Data and Model Training

Affect-based Intrinsic Reward: We collected data from five subjects (four males, one female; ages: 23 - 40 years) exploring in the simulated environment. All participants were qualified drivers with multiple years of driving experience. Simultaneously we collected synchronized videos of their face. The participants drove for an average of 11 minutes each, providing a total of over 64,000 frames. The protocol was approved by our institutional review board. The participants were told to explore the environment but were given no additional instruction about other objectives. We used a well-validated and open-source algorithm to calculate the smile response of the drivers from the webcam videos [1]. When evaluated on a large set of videos of naturalistic facial expressions (very similar to ours) the smile detection had a 0.85 correlation with human coding. An example of a smile response from one subject can be seen in Fig. 2. Using these data we trained our affect-based intrinsic motivation model. The image frames from the camera sensor in the environment served as input to the network and the smile probability in the corresponding webcam frame served as output. The input frames were downsampled to pixels and normalized to be in the range . The model architecture is described in Section APPENDIX.

Map Random Straight IL IL + [29] Affect-based
(a)
(b)
(c)
(d)
(e)
(f)
Fig. 3: Visualization of the experiment from Table I using heat maps. From this visualization we can observe that the better the policy, the longer the paths that are recorded during the trials.
# Method Duration () Coverage () Coverage/sec () Collisions
(1) Random 7.57 107.79 14.23 230
(2) Straight 8.32 115.33 13.86 206
(3) IL 52.87 727.46 13.75 38
(4) IL + [29] 87.63 952.82 10.87 23
(5) Affect-based 79.76 1059.29 13.28 27
TABLE I: Evaluation of the driving policies. Given a random starting point, duration is the average time the car drove before a collision and coverage is the average area the car covered.

Exploration Policy: We first train the base policy by imitation learning, where we recorded data while a single human driver was driving in the simulation using the vehicle. The data set has 50,000 images, which were normalized to [0, 1] and corresponding human actions. The model is a CNN trained to classify the desired steering angle. The input space of the model contains four consecutive images, down-sampled to , similar to many DQN applications [30]. The action space is discrete and composed of five possible different steering angles: 40, 20, 0, -20 and -40°. The model architecture is described in Section APPENDIX. To increase the variations in the collected data and cope better with sharp turns, we shifted sampled frames and post-processed the steering angle accordingly [40]. The final exploration policy embeds this as described in Section III-A, and considers the affective-rewards. Specifically, the reward mechanism was computed for each one of the steering angles, so the positive intrinsic values represent the values inferred from looking directly towards the respective driving directions. We set which was determined via cross-validation, as described in Section APPENDIX.

Representation Learning: As the vehicle explores the environment the data it sees is used to train the VAE. Each episode was initiated by placing the vehicle at a random starting point and letting it drive until the collision. Here we use the task of frame restoration to train the VAE model to restore down-sampled images. The model architecture is described in Section APPENDIX. For evaluating performance on depth map estimation and scene segmentation, we collected images with ground truth, captured by placing simulated cameras randomly in the environment. For sketch-to-image translation, we used the same method except that the sketches were computed by finding the image contours.

Fig. 4: (Left) Test loss as a function of a number of episodes for depth map estimation, scene segmentation, and sketch-to-image translation. Results are averaged over 30 trials. The error bars reflect the standard error. (Right) Test loss as we vary the number of layers tuned for scene segmentation. Fine-tuning just a couple of layers is enough for this task.
Depth Map Estimation Scene Segmentation Sketch-to-Image Translation

Input

(a)
(b)
(c)
(d)
(e)
(f)
(g)
(h)
(i)

GT

(j)
(k)
(l)
(m)
(n)
(o)
(p)
(q)
(r)

IL

(s)
(t)
(u)
(v)
(w)
(x)
(y)
(z)
(aa)

Affect

(ab)
(ac)
(ad)
(ae)
(af)
(ag)
(ah)
(ai)
(aj)
Fig. 5: Samples generated using VAEs trained on each of the three tasks. Notice that there are fewer distortions in the depth map estimations, better classification of structures in the scene segmentation, and better generation of the images from sketches when using our proposed policy.

Iv-B Results

How good is affect-driven exploration? We compare the proposed method to four additional methods - random, straight, IL and IL + [29]. For the random policy, we simply draw a random action for each timestamp according to a uniform distribution, regardless of the input space. For the straight policy, the model drives straight, without changing course. The IL policy is simply the base policy , but without the intrinsic affect motivation. For [29], to compare their method with ours, we combined their intrinsic reward function with the base policy, the same as we did in ours.

For this experiment, we select a starting point randomly and let the policy drive until a collision is detected. Then, we reset the vehicle to a random starting position. We continue this process for seconds, with the vehicle driving at a speed of . We then consider the mean values of the duration and of the total area covered during exploration per episode. Longer episodes reflect that the policy is able to reason about free spaces and consider that with the vehicle dynamics, whereas higher coverage suggests that the policy indeed encourages novel experiences and that the vehicle is not simply going in circles (or stationary). Coverage is defined as a union of circles, with a meter radius centered around the car.

Fig. 3 shows the map of the environment and how different policies explore the space. The heat signature (yellow is more), indicates the amount of time a vehicle spends at that location. We observe that our policy driven by intrinsic reward is able to go further and cover significantly bigger space in the allocated seconds. We present the numerical results in Table I, which shows that despite being [29] better at holding onto the track, it is less good in exploring the field and cover less area per seconds in average.

How well can we solve other tasks? In this experiment, we explore how well the VAE trained on the task-agnostic data collected via the exploration (as described in Section III-B) can help solve depth-map estimation, scene segmentation, and sketch-to-image translation. First, we explore how much do we really need to perturb the original VAE model to get reasonable performance on these tasks, with only retraining the top few layers of the decoder. Fig. 4 plots the log loss, where loss is the task-specific reconstruction L2 loss, as the training evolves over several epochs. The figure shows curves when we choose different numbers of layers to retrain. What we find is that the biggest gains in performance happen with just the top-2 layers. Note that since the loss is in log-scale the difference in performance between training the top-2 layers and retraining a higher number of layers is small. Note that for the sketch-to-image translation it was necessary to retrain the encoder as the input sketches differ from images.

Next, we also study the effect of exploration policies on the three tasks. For these experiments, as the vehicle explores the environment, we in-parallel train the VAE model on the data gathered so far and measure the performance on the three-tasks. We report our results by averaging over 30 trials. Fig. 4 shows the mean test L2 loss as the number of episodes evolves for various exploration policies. While for most of the policies we observe that the error goes down with the number of episodes, for affect-based policy achieves lower errors with fewer episodes. For example, on the task of scene-segmentation, the number of episodes required to achieve a loss of 0.006, is approximately half the number of episodes when using the proposed method than when using the IL policy. We can also see that despite IL + [29] have longer episodes, affect-based reaches convergence faster.

Finally, besides the L2 loss, we also examine the realism of the output using the Frechet Inception Distance (FID) [20], a metric that is frequently used to evaluate the realism of images generated using GANs [18]. The results are presented in Table II and show that better FID scores are obtained using the proposed framework.

How efficient is the framework? Given that our reward mechanism requires additional computation, we also consider the performance penalty we might be paying. We conducted timed runs for each policy and logged the average frame rate. Our code is implemented using TensorFlow 2.0, with CUDA 10.0 and CUDNN 7.6. For our framework, we observed an average of fps using a mobile GTX1060 and fps using an RTX2080Ti, which was only slightly worse than fps and fps respectively, using the IL policy. The results were averaged over seconds of driving.

V Discussion and Conclusions

This paper explores how using positive affect as an intrinsic motivation for an agent can spur exploration. Greater exploration by the agent can lead to a better representation of the environment and this, in turn, leads to improved performance in a range of downstream tasks. We argued that positive affect is an important drive that spurs safer and more curios behavior. Modeling positive affect as an intrinsic reward led to an exploration policy with 51% longer duration, 46% greater coverage and 29% fewer collisions in comparison to IL. Fig. 3 shows this as a heat map.

Central to our argument is that affective responses to stimuli are intrinsic sources of feedback that lead to exploration and discovery of examples that generalize across contexts. We used our general representations to perform experiments across multiple perception tasks. Comparing performance with and without our affective reward we found a large benefit in using the policy with intrinsic motivation based on the positive affect signal. Qualitative examples (see Fig. 5) show that this led to a better reconstruction of the respective outputs across the tasks. Furthermore, exploration was larger using positive affect as a reward signal than when using a physiological fight-or-flight response [29] which fits with our hypothesis that positive affect is one drive for curiosity.

Method Frame Rest. Sketch-to-Img.
Random 276.1 273.6
Straight 260.9 271.4
IL 275.4 276.3
IL + [29] 248.7 260.7
Affect-based 242.7 257.8
TABLE II: FID scores calculated for the image generation tasks. The FID was calculated on 2000 test images. We compute the metric between the reconstruction and the ground truth. The results are averaged over 30 runs.

Here we were not attempting to mimic affective processes. But rather to show that functions trained on affect like signals can lead to improved performance. Smiles are complex nonverbal behaviors that are both common and nuanced; while smiles are interpreted as expressing positive emotion they communicate a variety of interpersonal states [15, 35]. This work establishes the potential for affect like mechanisms in robotics. Extension to other physiological signals presents a further opportunity worth exploring.

Appendix

V-a Network Architectures

Table III shows the network architecture for the driving policy described in Subsection III-A. The architecture contains three convolutional layers and two dense layers. The input shape is (four consecutive grayscale images). The output is a vector of probabilities with dimensions equal to the number of possible actions.

Table IV shows the network architecture for the affect-based reward function described in Subsection III-A. The architecture contains three convolutional layers and two dense layers. Batch normalization is applied prior to each intermediate layer. The input shape is and contains a single RGB image. The output is a vector of two intrinsic values. The first is the positive intrinsic value that is being used by our affect-based policy, and the second is the intrinsic value from [29] which is being examined using our model architecture in the fourth baseline described in Subsection IV-B.

Layer Act. Out. shape Parameters
Conv2D ReLU 4.1k
Conv2D ReLU 8.2k
Conv2D ReLU 9.2k
Dense ReLU 256 401k
Dense Softmax 5 1.2k
Total trainable parameters 424k
TABLE III: CNN architecture for the navigation policy.
Layer Act. Out. shape Parameters
Conv2D ReLU 2.4k
Conv2D ReLU 24.6k
Conv2D ReLU 49.2k
Dense ReLU 2048 8.4M
Dense Linear 2 4k
Total trainable parameters 8.4M
TABLE IV: CNN architecture for the affect-based reward function.
Layer Act. Out. shape Parameters
Encoding Layers
Conv2D ReLU 3.1k
Conv2D ReLU 131k
Conv2D ReLU 524k
Dense ReLU 1024 9.4M
Dense Linear 16 16.4k
Decoding Layers
Dense ReLU 1024 9.2k
Dense ReLU 6272 6.4M
Conv2D Transpose ReLU 262k
Conv2D Transpose ReLU 131k
Conv2D Transpose Sigmoid 3k
Total trainable parameters 16.9M
TABLE V: Architecture for the convolutional VAE model.

Table V shows the network architecture for the VAE model described in Subsection III-B. The encoder and the decoder composed of five layers each, with batch normalization prior to each intermediate layer. The input shape is and contains a single RGB image. The output of the encoder is an 8-dimensional latent space representation. The output shape of the decoder is for a single RGB/segmentation image, or for a depth estimation map.

For more details about the training procedure and parameters, our code is publicly available on https://github.com/microsoft/affectbased.

V-B Reward Multiplication Factor

Our method relies on adding the reward component on top of the action probabilities such that it will maximize the exploration results. To find the best , we performed the experiment shown in Table I for a range of potential values, and then we used this to perform the experiment shown in Fig. 4. An example of an experiment to find the can be seen in Fig. 6.

Fig. 6: Mean coverage as a function of gamma when searching for the right gamma for IL + [29]. The plot shows that resulted in the highest coverage.

Acknowledgment

We thank the subjects that helped us creating the dataset for the affect-based reward model, and the rest of the team for the help with the simulation and the experiments.

Footnotes

  1. footnotetext: Our implementation is on https://github.com/microsoft/affectbased.

References

  1. T. Baltrusaitis, A. Zadeh, Y. C. Lim and L. Morency (2018) Openface 2.0: facial behavior analysis toolkit. In 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018), pp. 59–66. Cited by: §IV-A.
  2. A. Billard and M. J. Matarić (2001) Learning human arm movements by imitation:: evaluation of a biologically inspired connectionist architecture. Robotics and Autonomous Systems 37 (2-3), pp. 145–160. Cited by: §II.
  3. V. Blukis, N. Brukhim, A. Bennett, R. A. Knepper and Y. Artzi (2018) Following high-level navigation instructions on a simulated quadcopter with imitation learning. arXiv preprint arXiv:1806.00047. Cited by: §II.
  4. M. Bojarski, D. Del Testa, D. Dworakowski, B. Firner, B. Flepp, P. Goyal, L. D. Jackel, M. Monfort, U. Muller and J. Zhang (2016) End to end learning for self-driving cars. arXiv preprint arXiv:1604.07316. Cited by: §II.
  5. S. Brown and G. E. Schwartz (1980) Relationships between facial electromyography and subjective experience during affective imagery. Biological psychology 11 (1), pp. 49–62. Cited by: §III-A.
  6. Y. Burda, H. Edwards, D. Pathak, A. Storkey, T. Darrell and A. A. Efros (2019) Large-scale study of curiosity-driven learning. In ICLR, Cited by: §I, §II.
  7. L. Cardamone, D. Loiacono and P. L. Lanzi (2009) Learning drivers for torcs through imitation using supervised methods. In 2009 IEEE symposium on computational intelligence and games, pp. 148–155. Cited by: §II.
  8. L. Chen, G. Papandreou, I. Kokkinos, K. Murphy and A. L. Yuille (2017) DeepLab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. CoRR, 2017 IEEE Transactions on Pattern Analysis and Machine Intelligence abs/1606.00915. Cited by: §II.
  9. T. Chen, M. Cheng, P. Tan, A. Shamir and S. Hu (2009) Sketch2photo: internet image montage. In ACM transactions on graphics (TOG), Vol. 28, pp. 124. Cited by: §II.
  10. T. Chen, S. Gupta and A. Gupta (2019) Learning exploration policies for navigation. In International Conference on Learning Representations, External Links: Link Cited by: §I.
  11. W. Chen and J. Hays (2018) Sketchygan: towards diverse and realistic sketch to image synthesis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9416–9425. Cited by: §II.
  12. N. Chentanez, A. G. Barto and S. P. Singh (2005) Intrinsically motivated reinforcement learning. In Advances in neural information processing systems, pp. 1281–1288. Cited by: §II.
  13. F. Codevilla, M. Miiller, A. López, V. Koltun and A. Dosovitskiy (2018) End-to-end driving via conditional imitation learning. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 1–9. Cited by: §II.
  14. M. Eitz, J. Hays and M. Alexa (2012) How do humans sketch objects?. ACM Trans. Graph. 31 (4), pp. 44–1. Cited by: §II.
  15. P. Ekman and W. V. Friesen (1982) Felt, false, and miserable smiles. Journal of Nonverbal Behavior 6 (4), pp. 238–252. External Links: Document Cited by: §V.
  16. M. Frank, J. Leitner, M. Stollenga, A. Förster and J. Schmidhuber (2014) Curiosity driven reinforcement learning for motion planning on humanoids. Frontiers in neurorobotics 7, pp. 25. Cited by: §I.
  17. C. Godard, O. Mac Aodha and G. J. Brostow (2017) Unsupervised monocular depth estimation with left-right consistency. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 270–279. Cited by: §II.
  18. I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville and Y. Bengio (2014) Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680. Cited by: §II, §IV-B.
  19. N. Haber, D. Mrowca, S. Wang, L. F. Fei-Fei and D. L. Yamins (2018) Learning to play with intrinsically-motivated, self-aware agents. In Advances in Neural Information Processing Systems, pp. 8388–8399. Cited by: §II.
  20. M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler and S. Hochreiter (2017) Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Advances in Neural Information Processing Systems, pp. 6626–6637. Cited by: §IV-B.
  21. P. Isola, J. Zhu, T. Zhou and A. A. Efros (2017) Image-to-image translation with conditional adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1125–1134. Cited by: §II.
  22. N. Jaques, J. McCleary, J. Engel, D. Ha, F. Bertsch, R. Picard and D. Eck (2018) Learning via social awareness: improving a deep generative sketching model with facial feedback. arXiv preprint arXiv:1802.04877. Cited by: §I.
  23. K. S. Kassam (2010) Assessment of emotional experience through facial expression. Harvard University. Cited by: §III-A.
  24. D. P. Kingma and M. Welling (2013) Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114. Cited by: §II.
  25. B. Kort, R. Reilly and R. W. Picard (2001) An affective model of interplay between emotions and learning: reengineering educational pedagogy-building a learning companion. In Proceedings IEEE International Conference on Advanced Learning Technologies, pp. 43–46. Cited by: §I, §III-A.
  26. M. LaFrance (2011) Lip service: smiles in life, death, trust, lies, work, memory, sex, and politics. WW Norton & Company. Cited by: §III-A.
  27. B. Martinez, M. F. Valstar, B. Jiang and M. Pantic (2017) Automatic analysis of facial actions: a survey. IEEE transactions on affective computing. Cited by: §III-A.
  28. J. C. Masters, R. C. Barden and M. E. Ford (1979) Affective states, expressive behavior, and learning in children.. Journal of Personality and Social Psychology 37 (3), pp. 380. Cited by: §I.
  29. D. McDuff and A. Kapoor (2019) Visceral machines: risk-aversion in reinforcement learning with intrinsic physiological rewards. International Conference on Learning Representations (ICLR). Cited by: §I, §I, §I, §II, §III-A, Fig. 3, §IV-B, §IV-B, §IV-B, TABLE I, TABLE II, §V, Fig. 6, §V-A.
  30. V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra and M. Riedmiller (2013) Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602. Cited by: §IV-A.
  31. X. Pan, Y. You, Z. Wang and C. Lu (2017) Virtual to real reinforcement learning for autonomous driving. arXiv preprint arXiv:1704.03952. Cited by: §II.
  32. D. Pathak, P. Agrawal, A. A. Efros and T. Darrell (2017) Curiosity-driven exploration by self-supervised prediction. In International Conference on Machine Learning (ICML), Cited by: §I, §II.
  33. S. Ross, G. Gordon and D. Bagnell (2011) A reduction of imitation learning and structured prediction to no-regret online learning. In Proceedings of the fourteenth international conference on artificial intelligence and statistics, pp. 627–635. Cited by: §II.
  34. S. Ross, N. Melik-Barkhudarov, K. S. Shankar, A. Wendel, D. Dey, J. A. Bagnell and M. Hebert (2013) Learning monocular reactive uav control in cluttered natural environments. In 2013 IEEE international conference on robotics and automation, pp. 1765–1772. Cited by: §II.
  35. M. Rychlowska, R. E. Jack, O. G. B. Garrod, P. G. Schyns, J. D. Martin and P. M. Niedenthal (2017) Functional smiles: Tools for love, sympathy, and war. Psychological Science 28 (9), pp. 1259–1270. Cited by: §V.
  36. N. Savinov, A. Raichuk, R. Marinier, D. Vincent, M. Pollefeys, T. Lillicrap and S. Gelly (2018) Episodic curiosity through reachability. arXiv preprint arXiv:1810.02274. Cited by: §I, §II.
  37. S. Shah, D. Dey, C. Lovett and A. Kapoor (2018) Airsim: high-fidelity visual and physical simulation for autonomous vehicles. In Field and service robotics, pp. 621–635. Cited by: §IV.
  38. J. E. Swan, A. Jones, E. Kolstad, M. A. Livingston and H. S. Smallman (2007) Egocentric depth judgments in optical, see-through augmented reality. IEEE transactions on visualization and computer graphics 13 (3), pp. 429–442. Cited by: §II.
  39. D. Wang, C. Devin, Q. Cai, P. Krähenbühl and T. Darrell (2019) Monocular plan view networks for autonomous driving. arXiv preprint arXiv:1905.06937. Cited by: §II.
  40. D. Zadok, T. Hirshberg, A. Biran, K. Radinsky and A. Kapoor (2019) Explorations and lessons learned in building an autonomous formula sae car from simulations. SIMULTECH. Cited by: §II, §IV-A.
  41. Z. Zheng, J. Oh and S. Singh (2018) On learning intrinsic rewards for policy gradient methods. In Advances in Neural Information Processing Systems, pp. 4644–4654. Cited by: §II.
  42. J. Zhu, T. Park, P. Isola and A. A. Efros (2017) Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE international conference on computer vision, pp. 2223–2232. Cited by: §II.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
410009
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description