Information Minimization In Emergent Languages
There is a growing interest in studying the languages emerging when neural agents are jointly trained to solve tasks that require communication through discrete messages. We investigate here the information-theoretic complexity of such languages, focusing on the most basic two-agent, one-symbol, one-exchange setup. We find that, under common training procedures, the emergent languages are subject to an information minimization pressure: The mutual information between the communicating agent’s inputs and the messages is close to the minimum that still allows the task to be solved. After verifying this information minimization property, we perform experiments showing that a stronger discrete-channel-driven information minimization pressure leads to increased robustness to overfitting and to adversarial attacks. We conclude by discussing the implications of our findings for the studies of artificial and natural language emergence, and for representation learning.
Information Minimization In Emergent Languages
Eugene Kharitonov Facebook AI Research email@example.com Rahma Chaabouni Facebook AI Research / LSCP firstname.lastname@example.org Diane Bouchacourt Facebook AI Research email@example.com Marco Baroni Facebook AI Research / ICREA firstname.lastname@example.org
noticebox[b]Preprint. Under review.\end@float
The ability to communicate in a human-like, discrete language is considered as an important stepping stone towards a general AI (Mikolov et al., 2016; Chevalier-Boisvert et al., 2019). This vision revived research interest in artificial neural agents communicating with each other through discrete messages111That is, at each time, the communicating agent must produce one symbol from a fixed vocabulary. to jointly solve a task (Lazaridou et al., 2016; Kottur et al., 2017; Havrylov and Titov, 2017; Choi et al., 2018; Lazaridou et al., 2018). These simulations might also help to shed light on how human language itself evolved (Kirby, 2002; Hurford, 2014; Graesser et al., 2019).
We study here the communication biases that arise in this setup. Specifically, we are interested in the properties of the emergent protocols (languages), namely their complexity and redundancy w.r.t. the task being solved. We use simple signaling games (Lewis, 1969), as introduced to the current language emergence literature by Lazaridou et al. (2016) and adopted in several later studies (Havrylov and Titov, 2017; Bouchacourt and Baroni, 2018; Lazaridou et al., 2018). There are two agents, Sender and Receiver, provided with individual inputs at the beginning of each episode. Sender sends a single message to Receiver, and Receiver has to perform an action based on its input and the received message. Importantly, there is no direct supervision on the message protocol. We consider agents that are deterministic functions of their inputs (after training).
As an example, consider an image classification task where an image is shown to Sender, and Receiver has to chose its class out of equiprobable classes, solely based on the message from Sender. Would Sender’s messages encode the target class? In this case, the protocol would be “simple” and only transmit bits. Alternatively, Sender could try to describe the full image, and let Receiver decide on its class. The emergent protocol would be more “complex”, encoding more information about the image.
We find that, once the agents are successfully trained to jointly solve a task, the emergent protocol tends to minimize mutual information between Sender input and messages or, equivalently in our setup, the entropy of the messages. In other words, the agents consistently approximate the simplest successful protocol. In the example above, the protocol will transmit bits.
More precisely, since the agents are deterministic, we can connect entropy and mutual information between Sender’s input , messages , and Receiver’s output (the chosen action) by using the standard Data Processing Inequalities (Cover and Thomas, 2012):
Once an agent pair is successfully trained to solve the task, our empirical measurements indicate that the messages in the emergent protocol tend to approach the lower bounds and :
This effect holds even if there is a large gap to the upper-bound . Moreover, by controlling the amount of redundant information that is provided to Receiver on the side, without changing other parameters, we observe that the emergent protocol becomes simpler (lower and ) as the amount of information directly provided to the Receiver grows. In other words, the emergent protocol ‘adapts’ to minimize the information that passes through it.
Our paper is split in two parts. First, we study the complexity of the protocols emerging in two setups, in function of various factors. Next, having empirically established the aforementioned information minimization property, we show that neural networks that rely on a discrete communication channel exhibit desirable properties shared with architectures incorporating the Information Bottleneck (Tishby et al., 1999) principle: robustness to overfitting and against adversarial noise. In the conclusion, we outline the implications of our findings for the study of discrete-channel architectures in AI as well as in the investigation of the origin and usage of human language.
We study two signaling games. In Guess Number, the agents are trained to recover an integer-representing vector with uniform Bernoulli-distributed components. This simple setup gives us full control over the amount of information needed to solve the task. The second task uses more naturalistic data, as the agents are jointly trained to classify MNIST (LeCun et al., 1998) images.
Throughout the text, we denote the input to Sender (Receiver) by (); denotes messages; is Receiver’s output; is the ground-truth output.
Guess Number We sample uniformly at random an 8-bit integer , by sampling its 8 bits independently from the uniform Bernoulli distribution. All bits are revealed to Sender as a 8-dimensional binary vector . The last bits are revealed to Receiver () as its input . Sender sends a single-symbol message to Receiver. In turn, Receiver outputs a vector that recovers all the bits of and should be equal to . By changing , we control the minimal information that Sender has to send Receiver so that it can answer perfectly: (similarly, if the task is solved, ).
In this game, Sender has a linear layer that maps the input vector to a hidden representation of size 10, followed by a leaky ReLU activation. Next is a linear layer followed by a softmax over the vocabulary. Receiver linearly maps both its input and the message to 10-dimensional vectors, concatenates them, applies a fully connected layer with output size 20, followed by a leaky ReLU. Finally, another linear layer and a sigmoid nonlinearity are applied. When trained with REINFORCE and Stochastic Computation graph approach (see Section 2.2), we increased the sizes of hidden layers 3x, as this led to a more robust convergence. When trained with Gumbel-Softmax relaxation, the hidden sizes are not changed.
Image Classification In this game, the agents are jointly trained to classify 28x28 MNIST images. As input , Sender receives the entire image. Receiver also receives 28x28 images as , but only the bottom () rows are informative, as we zero-out the top rows. Sender sends a single-symbol message to Receiver, which, based on the message and its own input , has to select the image class (one digit out of 10). All pixels are scaled to be in .
The agents have generally the same architecture as in Guess Number. Input images are embedded by LeNet-1 instances LeCun et al. (1990) into 400-dimensional vectors. In Sender, this embedded vector is passed to a fully connected layer, followed by a softmax selecting a vocabulary symbol. Similarly, Receiver embeds the input image into a 400-dimensional vector by applying a LeNet-1 instance and concatenates it with a 400-dimensional embedding of the message from Sender. This concatenated vector is passed to a fully connected layer, followed by softmax.
In both games, we fix vocabulary to 1024 symbols (experiments with other vocabulary sizes are in Supplementary). No parts of the agents are pre-trained or shared. The optimized loss depends on the gradient estimation method used (see Section 2.2). We denote it , and it is a function of Receiver’s output and the ground-truth output . When training with REINFORCE, we use 0/1 loss: both agents get 0 only if all bits of were correctly recovered (Guess Number) or the correct image class was chosen (Image Classification). When training with the Gumbel-Softmax relaxation or the Stochastic Computation Graph approach, we use binary cross-entropy (Guess Number) and negative log-likelihood (Image Classification).
2.2 Training with discrete channel
Training to communicate with discrete messages is non-trivial, as backpropagation through communicated messages is not possible. Current language emergence work mostly uses Gumbel-Softmax relaxation (e.g. (Havrylov and Titov, 2017)) or REINFORCE (e.g. (Lazaridou et al., 2016)) to get gradient estimates. We also explore the Stochastic Graph optimization approach. We plug the obtained gradient estimates into the Adam optimizer (Kingma and Ba, 2014).
Gumbel-Softmax relaxation The core observation is that samples from the Gumbel-Softmax (Maddison et al., 2016; Jang et al., 2016) distribution (a) are reperameterizable, hence allow gradient-based training, and (b) approximate samples from the corresponding Categorical distribution. To get a sample that approximates an -dimensional Categorical distribution with probabilities , we draw i.i.d. samples from Gumbel(0,1) and use them to calculate a vector with components:
where is the temperature hyperparameter. As tends to , the samples get closer to one-hot samples; as , the components become uniform. During training, we use these relaxed samples as messages from Sender, making the entire Sender/Receiver setup differentiable.
REINFORCE by Williams (1992) is a standard reinforcement learning algorithm. In our setup, it estimates the gradient of the expectation of the loss w.r.t. the parameter vector as follows:
The expectations are estimated by sampling from Sender and, after that, sampling from Receiver. We use the running mean baseline (Greensmith et al., 2004; Williams, 1992) as a control variate. Another common trick we adopt is to add an entropy regularization term (Williams and Peng, 1991; Mnih et al., 2016) that favors higher entropy. We impose entropy regularization on the outputs of the agents with coefficients (Sender) and (Receiver).
Stochastic computational graph estimator In our setup, the gradient estimate approach of Schulman et al. (2015) reduces to computing the gradient of the following surrogate function:
In this case, we do not sample actions of Receiver and gradients of its parameters are obtained with standard backpropagation (the first term in Eq. 5). Sender’s messages are sampled, and its gradient are calculated akin to REINFORCE (the second term in Eq. 5). We apply entropy regularization on Sender’s output (with coefficient ) and use the mean baseline .
2.3 Experimental protocol
At test-time, we select Sender’s message symbol greedily, hence the messages are discrete and Sender represents a (deterministic) function of its input , . Consequently, message entropy and mutual information between the messages and Sender’s inputs are equal:
Calculating the entropy of the distribution of discrete messages is straightforward. In Guess Number, we enumerate all 256 possible values of as inputs, save the messages from Sender and calculate entropy . As MNIST contains a test set, we use it for the Image Classification analysis. Due to a higher computational complexity, we use the first 16k test images.
We run the training procedure for each task and agent hyper-parameter combination. We select the runs that achieved a high level of task-specific performance (training accuracy above 0.99 for Guess Number and validation accuracy above 0.98 for the harder MNIST classification task). Next, for each selected run, we calculate message entropy as outlined above. We thus study the typical behavior of the agents provided they succeeded at the game.
3.1 Information minimization property
Guess Number In Figure 1, the horizontal axes span the number of bits (binary digits) of that Receiver lacks, . The vertical axis reports information content of the protocol, measured by mutual information between input and messages, (, Eq. 6). Each integer on the horizontal axis corresponds to a game configuration, and for each such configuration we aggregate multiple (successful) runs with different hyperparameters and random seeds. The Lower bound lines indicate the minimal amount of bits Sender has to send in a particular configuration for the task to be solvable, . The upper bound (not shown) is equal to bits.
Firstly, consider the configurations where the Receiver’s input is insufficient to answer correctly in all cases (at least one binary digit hidden, ). From Figure 0(a), we observe that the transmitted information is strictly monotonically increasing with the number of binary digits hidden from Receiver. Thus, even if Sender sees the very same input in all configurations, a more nuanced protocol only develops when it is necessary. Moreover, for every configuration with , the information transmitted by the protocol remains close to the lower bound. This information minimization property holds for all the considered training approaches across all configurations.
Consider next the configuration where Receiver is getting the whole integer as its input (, leftmost configuration in Figure 1, corresponding to 0 on x axis). Based on the observations above, one would expect that the protocol would transmit nearly zero information in this case (as no information needs to be transmitted). However, the measurements indicate that the protocol is encoding considerably more information. It turns out that this information is entirely ignored by Receiver. To demonstrate this, we fed all possible distinct inputs to Sender, obtained the corresponding messages, and shuffled them to destroy any information about the inputs they might carry. The shuffled messages were then passed to Receiver alongside with its own (un-shuffled) inputs. The overall performance was not affected by this manipulation, confirming the hypothesis that Receiver ignores messages. We conclude that in this case there is no apparent information minimization pressure on Sender simply because there is no communication. This experiment is reported in Supplementary.
We further consider the effect of various hyperparameters. In Figure 0(b), we split the results obtained with Gumbel-Softmax by relaxation temperature. As discussed in Section 2.2, lower temperatures more closely approximate discrete communication, hence providing a convenient control of the level of discreteness imposed at training time (recall that at test time we select the symbol greedily from the softmax layer of Sender in all cases). The figure shows that lower temperatures consistently lead to lower values. This implies that, as we increase the “level of discreteness” at training-time, we get stronger information minimization pressures.
Similarly, in Figures 0(c) & 0(d), we report mutual information when training with Stochastic Graph optimization and REINFORCE across degrees of entropy regularization. We report curves corresponding to values which converged in more than three configurations. For REINFORCE, we see that there is a weak tendency for a higher to cause a higher information content of the protocol (only violated by ).
Image Classification As the models are more complex, we only had consistent success when training with Gumbel-Softmax. In contrast to Guess Number, it is hard to estimate the lower bound on the information Sender has to convey to allow success in all configurations. The upper bound is bits (bounded by the vocabulary size). The entropy of the labels is bits, same as .
In Figure 1(a) we aggregate all successful runs. As before, we observe that the information encoded by the protocol only grows when the Receiver’s own input becomes less informative. In the configuration where Receiver has no input, message entropy is not significantly higher than bits, compatible with the “simplest protocol” scenario exemplified in Section 1, where Sender is directly passing image labels. In Figure 1(b), we split the runs by temperature. As in Guess Number, lower temperatures increase information minimization pressure.
Summarizing, when communicating through a discrete channel, there is consistent pressure for the emergent protocol to encode as little information about the input as necessary. This holds across games, training methods and hyperparameters. Moreover, when training with Gumbel-Softmax, temperature controls the strength of the minimization pressure, confirming the relation between the latter and discreteness.
3.2 Robustness of discrete channel
Some recent approaches, inspired by the Information Bottleneck of Tishby et al. (1999), try to control the amount of the information about the input that is stored in a representation (Achille and Soatto, 2018). Experiment shows that this kind of regularization is beneficial, leading to robustness to overfitting (Fischer, 2019) and to adversarial attacks (Alemi et al., 2016; Fischer, 2019). We demonstrate here that our relaxed emergent discrete protocols also possess these properties.
We focus on the Image Classification game. We use the same architecture as above, but Receiver has no input apart from the messages and, consequently, no “vision” module. The agents are trained with Gumbel-Softmax relaxation. However, for practicality, we do not switch to fully discrete communication at test time. We only remove the noise at test-time, effectively reducing Sender’s output to softmax with temperature. We refer to this architecture as GS.
We consider two baseline architectures without relaxed discrete channel. In Linear, the fully connected output layer of Sender is directly connected to the linear embedding input of Receiver. Softmax (SM) places a softmax activation (with temperature) after the Sender’s output layer and passes the result to Receiver. At test time, SM coincides with GS with the same temperature.
Learning in presence of random labels Following Zhang et al. (2016), we study how successful the agents are in learning to classify MNIST images in presence of randomly-shuffled training examples (the test set is untouched). We vary temperature and amount of training examples with shuffled labels. We use temperatures and , having checked that the agents trained with these temperatures reach a test accuracy of 0.98 when trained on the original training set.
In Figure 2(a) we report training accuracy when all labels are shuffled. Linear and SM with fit the random labels almost perfectly within the first 150 epochs. With , GS and SM achieve the accuracy of 0.8 in 150 epochs. When GS with is considered, the agents only start to improve over random guessing after 150 epochs, and accuracy is well below 0.2 after 200 epochs. As expected, test set performance is at chance level (Figure 2(b)). In the next experiment, we shuffle labels for a randomly selected half of the training instances. Train and test accuracies are shown in Figures 2(c) and 2(d), respectively. All models initially fit the true-label examples (train accuracy , test accuracy ). With more training, the baselines (e.g. Linear) and GS with start fitting the randomly labeled examples, too: train accuracy grows, while test accuracy falls. In contrast, GS with does not fit random labels in 200 epochs, and its test accuracy stays high.
We interpret the results as follows. For “successful” overfitting, the agents need to coordinate label memorization. This requires passing large amounts of information through the channel. With low temperature (more closely approximating a discrete channel), this is hard, due to the stronger information minimization pressure. To support the hypothesis, we ran an experiment where one fully connected layer of size 400x400 is either added to Sender (just before the channel) or to Receiver (just after the channel). We predict that, with the higher (less pressure), the training curves will be very close, as in both cases the capacity can be used for memorization equally easy. With lower (more pressure), the curves would be more distant. Figures 2(e) & 2(f) borne out the prediction. Finally, on comparing GS and SM in the above experiments, we conclude that the training-time discretization noise in GS is instrumental for the information minimization behavior that we observe.
Adversarial attacks To study the robustness of our architectures to undirected adversarial attacks, we train them with different random seeds and implement white-box attacks on the trained models, varying temperature and the allowed perturbation norm, . We use the standard Fast Gradient Sign Method (FGSM) of Goodfellow et al. (2014). The original image is perturbed to along the direction that maximizes the loss of Receiver’s output w.r.t. ground-truth class :
where controls the norm of the perturbation. Under an attack with a fixed , a more robust method would have a smaller accuracy drop. To avoid the numerical stability issues akin to those reported by Carlini and Wagner (2016), all computations are done in 64-bit floats. Figure 3(a) shows that, as the relaxation temperature decreases, the accuracy drop also decreases. The highest robustness is achieved with . Comparison with the baselines (Figure 3(b)) confirms that relaxed discrete training with improves robustness. However, this robustness comes at the cost of harder training: 2 out of 5 random seeds did not reach a high performance level (0.98) after 250 epochs.
4 Related Work
Recent literature on emergent language analysis concentrates on making sense of individual utterances, linking them to higher-level concepts (e.g., ImageNet categories) and studying properties such as their compositionality (e.g., Lazaridou et al., 2016, 2018; Havrylov and Titov, 2017; Kottur et al., 2017; Bouchacourt and Baroni, 2018). Our work is rooted in this line of research; but, in contrast, we focus on the information-theoretic properties of the learned protocols. Evtimova et al. (2018) uses information theory to study the progress of the protocol during learning.
Discrete latent representations are studied in many places (e.g., van den Oord et al., 2017; Jang et al., 2016; Rolfe, 2016). However, these works focus on ways to learn discrete representations, rather than analyzing the properties of representations that are independently emerging on the side. Other studies, inspired by the Informational Bottleneck method of Tishby et al. (1999), control the complexity of neural-network-induced representations by regulating their information content (Strouse and Schwab, 2017; Fischer, 2019; Alemi et al., 2016; Achille and Soatto, 2018). While they externally impose an information bottleneck, we observe that information minimization is an intrinsic feature in learning to communicate with a discrete channel.
Finally, Tieleman et al. (2018) show that communication-based learning helps to build robust representations. They use community-based autoencoders with a continuous latent space, where random pairs of encoders and decoders ‘talk’ to each other.
5 Discussion and future work
We have shown in controlled experiments that, when two neural networks learn to solve a task jointly through a discrete communication code, the latter stays close to the lower bound on the amount of information that needs to be passed through for task success. We further presented empirical evidence that this information minimization property is beneficial, as the communication protocol naturally acts as a useful bottleneck. In particular, imposing progressively more discreteness on the communication channel leads to robustness to noise and to adversarial attacks.
The practical motivation that is usually given for discrete communication protocols is that we eventually want intelligent machines to be able to communicate with humans, and the latter use a discrete protocol. Indeed, discrete messages are not required in multi-agent scenarios where no human in the loop is foreseen (Sukhbaatar et al., 2016). Our experiments suggest that, long before agent communication reaches the level of complexity that would allow AIs to have conversations with people, there are independent reasons to encourage discreteness, as it provides a source of robustness in a noisy world. Future applied work should test, in more general settings, discrete communication as a form of representation learning. On the other hand, if the goal is to develop an advanced communication protocol to mimic human language, it is important to design complex tasks that require a large amount of information to be exchanged, as the agents will converge to the simplest protocol they can get away with.
The regularization properties of the emergent discrete channel could also have shaped human language evolution. The discrete nature of the latter is often traced back to the fact that it allows us to produce an infinite number of expressions by combining a finite set of primitives (e.g., Berwick and Chomsky, 2016). However, it is far from clear that the need to communicate an infinite number of concepts could have provided the initial pressure to develop a discrete code. More probably, once such code independently emerged, it made it possible to develop an infinitely expressive language (Bickerton, 2014; Collier et al., 2014). Our work suggests that discrete coding is advantageous already when communication is about a limited number of concepts and just one symbol is transmitted at a time. We would like next to compare performance of human subjects equipped with novel continuous vs. discrete non-linguistic communication protocols, adopting the methods of experimental semiotics (Galantucci, 2009). We expect discrete protocols to favor generalization and robustness.
Finally, information minimization trends have also been observed in many natural language phenomena, ranging from an organization of color terms that minimizes average bit transmission to the avoidance of redundant coding of syntactic information (Gibson et al., 2019). Many have sought a functional explanation for this tendency, as one motivated by the need to minimize speakers’ effort (e.g., Zipf, 1949; Futrell et al., 2015; Fedzechkina et al., 2017; Mahowald et al., 2018). We discovered the same trend in neural network agents that solve a task jointly through a discrete communication channel. One could see this as a challenge to functional explanations of analogous phenomena in language. Perhaps, information minimization is a general property of emerging discrete-channel communication, due to some yet-to-be understood mathematical properties of the setup. Alternatively, information minimization in neural networks might also be due to least-effort factors. We will pursue this topic in future studies.
- Mikolov et al.  Tomas Mikolov, Armand Joulin, and Marco Baroni. A roadmap towards machine intelligence. In International Conference on Intelligent Text Processing and Computational Linguistics, pages 29–61. Springer, 2016.
- Chevalier-Boisvert et al.  Maxime Chevalier-Boisvert, Dzmitry Bahdanau, Salem Lahlou, Lucas Willems, Chitwan Saharia, Thien Huu Nguyen, and Yoshua Bengio. BabyAI: A platform to study the sample efficiency of grounded language learning. In ICLR, 2019.
- Lazaridou et al.  Angeliki Lazaridou, Alexander Peysakhovich, and Marco Baroni. Multi-agent cooperation and the emergence of (natural) language. arXiv preprint arXiv:1612.07182, 2016.
- Kottur et al.  Satwik Kottur, José MF Moura, Stefan Lee, and Dhruv Batra. Natural language does not emerge’naturally’in multi-agent dialog. arXiv preprint arXiv:1706.08502, 2017.
- Havrylov and Titov  Serhii Havrylov and Ivan Titov. Emergence of language with multi-agent games: Learning to communicate with sequences of symbols. In NIPS, 2017.
- Choi et al.  Edward Choi, Angeliki Lazaridou, and Nando de Freitas. Compositional obverter communication learning from raw visual input. arXiv preprint arXiv:1804.02341, 2018.
- Lazaridou et al.  Angeliki Lazaridou, Karl Moritz Hermann, Karl Tuyls, and Stephen Clark. Emergence of linguistic communication from referential games with symbolic and pixel input. arXiv preprint arXiv:1804.03984, 2018.
- Kirby  Simon Kirby. Natural language from artificial life. Artificial life, 8(2):185–215, 2002.
- Hurford  James Hurford. The Origins of Language. Oxford University Press, Oxford, UK, 2014.
- Graesser et al.  Laura Graesser, Kyunghyun Cho, and Douwe Kiela. Emergent linguistic phenomena in multi-agent communication games. https://arxiv.org/abs/1901.08706, 2019.
- Lewis  David Lewis. Convention harvard university press. Cambridge, MA, 1969.
- Bouchacourt and Baroni  Diane Bouchacourt and Marco Baroni. How agents see things: On visual representations in an emergent language game. In EMNLP, 2018.
- Cover and Thomas  Thomas M Cover and Joy A Thomas. Elements of Information Theory. John Wiley & Sons, 2012.
- Tishby et al.  N. Tishby, F. Pereira, and W. Bialek. The information bottleneck method. In Proceedings of the 37th Annual Allerton Conference on Communication, Control and Computing. University of Illinois Press, 1999.
- LeCun et al.  Yann LeCun, Léon Bottou, Yoshua Bengio, Patrick Haffner, et al. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
- LeCun et al.  Yann LeCun, Bernhard E Boser, John S Denker, Donnie Henderson, Richard E Howard, Wayne E Hubbard, and Lawrence D Jackel. Handwritten digit recognition with a back-propagation network. In NIPS, 1990.
- Kingma and Ba  Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- Maddison et al.  Chris J Maddison, Andriy Mnih, and Yee Whye Teh. The concrete distribution: A continuous relaxation of discrete random variables. arXiv preprint arXiv:1611.00712, 2016.
- Jang et al.  Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparameterization with Gumbel-Softmax. arXiv preprint arXiv:1611.01144, 2016.
- Williams  Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8(3-4):229–256, 1992.
- Greensmith et al.  Evan Greensmith, Peter L Bartlett, and Jonathan Baxter. Variance reduction techniques for gradient estimates in reinforcement learning. JMLR, 5(Nov):1471–1530, 2004.
- Williams and Peng  Ronald J Williams and Jing Peng. Function optimization using connectionist reinforcement learning algorithms. Connection Science, 3(3):241–268, 1991.
- Mnih et al.  Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In ICML, 2016.
- Schulman et al.  John Schulman, Nicolas Heess, Theophane Weber, and Pieter Abbeel. Gradient estimation using stochastic computation graphs. In NIPS, 2015.
- Achille and Soatto  Alessandro Achille and Stefano Soatto. Information dropout: Learning optimal representations through noisy computation. IEEE TPAMI, 40(12):2897–2905, 2018.
- Fischer  Ian Fischer. The conditional entropy bottleneck, 2019. URL https://openreview.net/forum?id=rkVOXhAqY7.
- Alemi et al.  Alexander A Alemi, Ian Fischer, Joshua V Dillon, and Kevin Murphy. Deep variational information bottleneck. arXiv preprint arXiv:1612.00410, 2016.
- Zhang et al.  Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530, 2016.
- Goodfellow et al.  Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572, 2014.
- Carlini and Wagner  Nicholas Carlini and David Wagner. Defensive distillation is not robust to adversarial examples. arXiv preprint arXiv:1607.04311, 2016.
- Evtimova et al.  Katrina Evtimova, Andrew Drozdov, Douwe Kiela, and Kyunghyun Cho. Emergent communication in a multi-modal, multi-step referential game. In ICLR, 2018.
- van den Oord et al.  Aaron van den Oord, Oriol Vinyals, and Koray Kavukcuoglu. Neural discrete representation learning. In NIPS, 2017.
- Rolfe  Jason Tyler Rolfe. Discrete variational autoencoders. arXiv preprint arXiv:1609.02200, 2016.
- Strouse and Schwab  DJ Strouse and David J Schwab. The deterministic information bottleneck. Neural computation, 29(6):1611–1630, 2017.
- Tieleman et al.  Olivier Tieleman, Angeliki Lazaridou, Shibl Mourad, Charles Blundell, and Doina Precup. Shaping representations through communication. 2018.
- Sukhbaatar et al.  Sainbayar Sukhbaatar, Arthur Szlam, and Rob Fergus. Learning multiagent communication with backpropagation. In NIPS. 2016.
- Berwick and Chomsky  Robert Berwick and Noam Chomsky. Why Only Us: Language and Evolution. MIT Press, Cambridge, MA, 2016.
- Bickerton  Derek Bickerton. More than Nature Needs: Language, Mind, and Evolution. Harvard University Press, Cambridge, MA, 2014.
- Collier et al.  Katie Collier, Balthasar Bickel, Carel van Schaik, Marta Manser, and Simon Townsend. Language evolution: Syntax before phonology? Proceedings of the Royal Society B: Biological Sciences, 281(1788):1–7, 2014.
- Galantucci  Bruno Galantucci. Experimental semiotics: A new approach for studying communication as a form of joint action. Topics in Cognitive Science, 1(2):393–410, 2009.
- Gibson et al.  Edward Gibson, Richard Futrell Steven Piantadosi, Isabelle Dautriche, Kyle Mahowald, Leon Bergen, and Roger Levy. How efficiency shapes human language. Trends in Cognitive Science, 2019. In press.
- Zipf  George Zipf. Human Behavior and the Principle of Least Effort. Addison-Wesley, Boston, MA, 1949.
- Futrell et al.  Richard Futrell, Kyle Mahowald, and Edward Gibson. Large-scale evidence of dependency length minimization in 37 languages. Proceedings of the National Academy of Sciences, 112(33):10336–10341, 2015.
- Fedzechkina et al.  Maryia Fedzechkina, Elissa Newport, and Florian Jaeger. Balancing effort and information transmission during language acquisition: Evidence from word order and case marking. Cognitive Science, 41:416–446, 2017.
- Mahowald et al.  Kyle Mahowald, Isabelle Dautriche, Edward Gibson, and Steven Piantadosi. Word forms are structured for efficient use. Cognitive Science, 42:3116–3134, 2018.
Appendix A Supplementary
Appendix B Hyperparameters
In our experiments, we used the following hyperparameter grids.
Guess Number (Gumbel-Softmax) Vocab. size: [256, 1024, 4096]; temperature, : [0.5, 0.75, 1.0, 1.25, 1.5]; learning rate: [0.001, 0.0001]; max. number of epochs: 250; random seeds: [0, 1, 2, 3]; batch size: 8; early stopping thr.: 0.99; bits shown to Receiver: [0, 1, 2, 3, 4, 5, 6, 7, 8].
Guess Number (REINFORCE) Vocab. size: [256, 1024, 4096]; Sender entropy regularization coef., : [0.01, 0.05, 0.025, 0.1, 0.5, 1.0]; Receiver entropy regularization coef., : [0.01, 0.1, 0.5, 1.0]; learning rate: [0.0001, 0.001, 0.01]; max. number of epochs: 1000; random seeds: [0, 1, 2, 3]; batch size: 2048; early stopping thr.: 0.99; bits shown to Receiver: [0, 1, 2, 3, 4, 5, 6, 7, 8].
Guess Number (Stochastic Computation Graph approach): Vocab. size: [256, 1024, 4096]; Sender entropy regularization coef., : [0.01, 0.025, 0.05, 0.075, 0.1, 0.25]; learning rate: [0.0001, 0.001]; max. number of epochs: 1000; random seeds: [0, 1, 2, 3]; batch size: 2048; early stopping thr.: 0.99; bits shown to Receiver: [0, 1, 2, 3, 4, 5, 6, 7, 8].
Image Classification experiments Vocab. size: [16, 64, 256, 1024, 4096]; temperature, : [0.25, 0.5, 0.75, 1.0, 1.25, 1.5, 2.0]; learning rate: [0.01, 0.001, 0.0001], max. number of epochs: 100; random seeds: [0, 1, 2, 3]; batch size: 32; early stopping thr.: 0.98; rows shown to Receiver: [0, 4, 8, 12, 16, 20, 24, 28].
Fitting random labels experiments Vocab. size: 1024; temperature, : [1.0, 10.0]; learning rate: 1e-4, max. number of epochs: 200; random seeds: [0, 1, 2, 3, 4]; batch size: 32; early stopping thr.: ; prob. of label corruption: [0.0, 0.5, 1.0].
Adversarial attack experiments Vocab. size: 1024; temperature, : [0.1, 1.0, 10.0]; learning rate: 1e-4, max. number of epochs: 200; random seeds: [0, 1, 2, 3, 4]; batch size: 32; early stopping thr.: 0.98.
Appendix C Does vocabulary size affect the results?
We repeat the same experiments as in Section 3 of the main text while varying vocabulary size. Note that, to make Guess Number solvable across each configuration, the vocabulary has to contain at least 256 symbols. Similarly, for Image Classification, vocabulary size must be of at least 10. We tried vocabulary sizes of 256, 1024, 4096 for Guess Number, and 16, 64, 256, 1024, 4096 for Image Classification. The results are reported in Figures 5 (Guess Number) and 6 (Image Classification). We observe that there is little qualitative variation over vocabulary size, hence the conclusions we had in Section 3 are robust to variations of this parameter.
Appendix D Communication analysis
d.0.1 How much does Receiver rely on messages?
We supplement the experiments of Section 3 of the main text by studying the degree to which Receiver relies on messages. In particular, we show that: (a) when Receiver has the full input (), it ignores the messages (b) in Image Classification, a higher relaxation temperature not only results in more informative messages, but also into the Receiver more strongly depending on them.
We measure the degree to which Receiver relies on the messages from Sender by constructing a setup where we break communication, but still let Receiver rely on its own input. More precisely, we first enumerate all test inputs for Sender and Receiver . We obtain messages that correspond to Sender’s inputs, and shuffle them. Next, we feed the shuffled messages alongside Receiver’s own (unshuffled) inputs and compute accuracy, as a measure of Receiver’s dependence on the messages. This procedure preserves the marginal distribution of messages Receiver receives, but destroys all the information Sender transmits.
Guess Number It is clear that, without the messages, Receiver with bits of input can only reach accuracy of . In Figure 6(a) we report results aggregated by training method. Receiver is extremely close to the accuracy’s higher bound in all configurations. Moreover, when Receiver gets the entire input, the accuracy drop after the shuffling is tiny, hence its reliance on the message is minimal.
Image Classification In contrast to Guess Number, it is hard to find an upper bound on performance of Receiver without messages analytically. To get an estimate, we train a Sender/Receiver pair where Sender has all its input zeroed. In Figure 6(b) we report the results of the shuffling experiments, split by the relaxation temperature used during training. We first observe that, again, when Receiver has the entire image as its input, its reliance on messages is small. Further, for all configurations, we see that as the relaxation temperatures are getting lower, so does Receiver’s reliance on the messages.
d.0.2 Visual prototypes of Image Classification messages
To visualize the emerging representations associated to messages, we run the following experiment. For each ground-truth class, we iterate over test set images and find the message which is most likely associated with the class.222In all cases, the distribution of messages for a fixed class is extremely peaky. Next, for each of these messages, we generate an image that maximizes the probability that it will be the one being sent. For that, we start with a random image and optimize its pixels while the model is fixed.333We use Adam [Kingma and Ba, 2014], learning rate 1e-3, 5000 steps. To ensure that the image pixels are within interval, the image is parameterized as a sigmoid of a real-valued vector of an appropriate size. We report the resulting images in Figure 8. The images obtained for the lower temperature model are somewhat visually closer to prototypical digits, compared to those of the higher-temperature model. This qualitatively suggests that the representations are more disentangled in the lower-temperature regime.