Language Acquisition Test for Human-Level Artificial Intelligence
Despite recent advances in many application-specific domains, we do not know how to build a human-level artificial intelligence (HLAI). We conjecture that learning from others’ experience with the language is the essential characteristic that differentiates human intelligence from the rest. Humans can update the action-value function only with the verbal description as if they experience states, actions, and corresponding rewards sequences first hand. In this paper, we present our ongoing effort to build an environment to facilitate the research for models of this capability. In this environment, there are no explicit definitions of tasks or rewards given when accomplishing those tasks. Rather the models experience the experience of the human infants from fetus to 12 months. The agent should learn to speak the first words as a human child does. We expect the environment will contribute to the research for HLAI.
We made a lot of progress in artificial intelligence (AI). Despite this, the limitation of the current state of the art is most apparent in robotics. When laypersons think about an AI robot, they expect to verbally interact with it to get many services like a human butler. However, we do not know how to program such a robot yet.
In this paper, we try to answer following questions.
What is the fundamental difference between human intelligence and other animals?
What does it mean to understand the language?
How can we build an environment for human-like learning?
We also introduce our ongoing effort to build a language acquisition environment for human-level artificial intelligence (HLAI). We explain why such an environment is required and how it differs from existing language acquisition environments. Let us begin by explaining what distinguishes the human-level intelligence from the rest.
2 Level of Intelligence
Let us start our discussion with the following question:
Is an earthworm intelligent?
The answer will depend on the definitions of intelligence. Legg and Hutter proposed the following definition for intelligence after considering more than 70 prior definitions Legg and Hutter (2007b, a).
Intelligence measure an agent’s ability to achieve goals in a wide range of environments.
This definition is universal in that it can be applied to a diverse range of agents such as earthworms, rats, humans, and even computer systems. For biological agents, maximizing gene replication or inclusive fitness is generally accepted as the ultimate goal Dawkins (2016). Earthworms have light receptors and vibration sensors. They move according to those sensors to avoid the sun or moles Darwin (1892). It increases their chance of survival and inclusive fitness Hamilton (1964). Therefore we can say that earthworms are intelligent.
However, there are differences in intelligence between earthworms and more advanced agents such as rats and humans. In this paper, we propose three levels of intelligence to guide the AI research. Figure 1 shows a summary of this idea.
Level 1 Intelligence
In this categorization, earthworms have Level 1 intelligence, where there is no learning occurring at the individual level. Their central nerve system (CNS) or brains have a hard-coded mapping from sensory input to the corresponding action. This hard-coded function is often called as an instinct and updated with evolution Tinbergen (1951). While this is an improvement over the bacteria or plankton’s random behaviors, the problem with this approach is that adaptation is very slow because the update to the neural circuit happens through evolution. For example, if there is an abrupt climate change due to the meteor crash, agents with Level 1 intelligence will have difficulty adapting to the new environment.
Level 2 Intelligence
The next level in intelligence is individual-level learning. Relying on evolution for new rules is too slow. If an individual agent can learn new rules such as a new type of food, it would increase the probability of successful survival and gene spreading. Agents with Level 2 intelligence can learn new rules during its lifetime, showing higher intelligence than Level 1 intelligence. However, to enable learning at the individual level, two functional modules are required.
The first is a memory to store newly developed rules. The newly developed neocortex serves this memory function. The second module is a reward system to judge the merit of the state. We stated that the goal of a biological agent is to spread genes. However, the correct assessment is not possible at the individual agent level. For example, an agent may lay eggs in a hostile environment that no descendant will survive. Still, the agent can not know this because it would die before this happens. Therefore, an agent with level 2 intelligence requires a function to estimate whether the current stimulus or state is good or bad during an agent’s life.
We point out that the environment does not provide a reward. Instead, it is an agent that produces a reward signal, which is the agent’s estimate of the value of the current state. A dollar bill can be rewarding for some cultures but might not generate any reward for a tribal human who has never seen any money before. As for another example, when we eat three burgers for lunch, the reward for the first and third burger will be different, even though it is the same object for the sake of the environment.
However, this is different from the standard framework for reinforcement learning, where a reward is determined from the environment. Legg and Hutter used a standard RL framework for the formal definition of universal intelligence. However, they admitted that a more accurate framework would consist of an agent, an environment, and a separate goal system that interpreted the state of the environment and rewarded the agent appropriately. Another way to resolve this conflict is how we view an agent. An agent might represent whole rats or humans. But for the sake of AI research, we are mostly interested in the subset of the brain where learning occurs. Therefore we might call this subset as an agent. In that case, an environment might include other parts of the body where learning is not happening, such as the body, sensory organs, a reward system, and the old brain.
Level 3 Intelligence
While learning with reward is better than using evolution to improve brains, an agent must experience the stimulus to learn from it. However, there is a limitation in learning with direct experience. For example, a rabbit cannot try random action in front of the lion to learn optimal behavior. It would be too late for the rabbit to adjust the action-value function, and this experience cannot be transferred to others. Level 3 intelligence overcome this limitation by learning from others’ experiences. A language is a tool for learning from others. Humans’ technological achievements were possible because we can learn from others and contribute new knowledge. Isaac Newton said, “If I have seen further, it is by standing on the shoulders of Giants.” Language is an invention that enabled this. Therefore, the main feature of level 3 intelligence is learning from other’s experiences using language.
3 Clarifying Language Skill
However, we need to clarify what we mean by learning with language. For example, dolphins are known to use a verbal signal to coordinate Janik and Sayigh (2013). Monkeys have been taught sign language Arbib et al. (2008). Are dolphins and monkeys level 3 intelligence? Similarly, there have been many previous works that demonstrated various aspects of language skills. Voice agents can understand the spoken language and can answer simple questions Kepuska and Bohouta (2018). Agents have been trained to follow verbal commands to navigate Hermann et al. (2017); Chaplot et al. (2018); Chen et al. (2019); Das et al. (2018); Shridhar et al. (2020). GPT-3 by open AI can generate articles published as Op-Ed in the Guardians Brown et al. (2020); GPT-3 (2020). Some models can do multiple tasks in language as evaluated in the GLUE benchmark or DecaNLP Wang et al. (2018); McCann et al. (2018). Models exhibit superior performance in all categories than humans except Winograd Schema Challenge Levesque et al. (2012), where models perform slightly less than humans Raffel et al. (2020). Do these models have level 3 intelligence?
Using language has many aspects. In this paper, we claim that learning from others’ experiences is the language’s essential function. We will explain this with a simple example and then formalize it in the context of reinforcement learning.
Let’s say that you have never tried Cola before. Now for the first time in your life, you see this dark, sparkling liquid that somehow looks dangerous. You have a few available actions, including drinking and running away. Randomly you might select to drink. It tastes good. It rewards you. Now your action value to the same situation has changed such that you will choose to drink it more deliberately. It is the change induced by direct experience.
Learning with language means that it should bring a similar change in your mind when you hear someone say, “Cola is a black, sparkling drink. I drank it, and it tasted good.” Figure 3 shows this with the notation in Markov decision process (MDP) Sutton and Barto (1998).
If we apply this aspect of language use, previous environments lack the following aspects.
Use of Rewards: Using reward signals generated by environments will be sufficient for the implementation of level 2 intelligence. However, for level 3 intelligence, the reward system is also part of the agent, as shown in the previous example. Many previous environments rely on the environment’s explicit rewards when the tasks are done correctly Chen et al. (2019); Hermann et al. (2017); Chaplot et al. (2018); Chen et al. (2019); Das et al. (2018); Shridhar et al. (2020). This results in over-fitting to those specific tasks, which makes the transfer of the verbal skills challenging.
Grounded Language and Embodied Exploration: The language symbols need to bring changes in the policy. It means that the language symbols need to be grounded with sensory input and the actions in the embodied agents. Some environments that use only the text lack this grounding. Narasimhan et al. (2015); Côté et al. (2018).
Shallow interaction with large number of items and vocabulary: Previous Environments tend to pour large items and vocabulary into the training. However, as Smith and Slone pointed out, human infants begin to learn a lot about a few things Smith and Slone (2017). We need to build upon basic concepts before we can learn advanced concepts.
Therefore, we claim that we need a new environment to have these properties.
4 An Environment for Language Acquisition like a Human Child Does
We introduce our ongoing effort to build a Simulated Environment for Developmental Robotics (SEDRo) Pothula et al. (2020). SEDRo provides diverse experiences similar to that of human infants from the stage of a fetus to 12 months of age Turing (1950). SEDRo also simulates developmental psychology experiments to evaluate the progress of intellectual development in multiple domains. In SEDRo, there are a caregiver character, surrounding objects in the environment (e.g., toys, cribs, and walls), and the agent. The agent will interact with the simulated environment by controlling its body muscles according to the sensor signals. Interaction between the agent and the caregiver allows cognitive bootstrapping and social-learning, while interactions between the agent and the surrounding objects are increased gradually as the agent gets into more developed stages. The caregiver character can also interact with the surrounding objects to introduce them to the agent at the earlier development stages.
In SEDRo, the agent can learn up to 12th Months’ verbal capacity that speaks first words. As a concrete example, let us review how they will learn the word water. The agent has a sensor which indicates the thirsty. When the sensor value is larger than the threshold, the agent will choose the crying behavior by the pre-programmed instincts. When the mother hears the crying, she will investigate and bring water, which the agent will drink. At the same time, the mother says sentences such as “Water!”. Therefore the agent associates the auditory signal, visual signal, action sequence, and the rewards generated by relieving thirst. More specifically, we conjecture that they will learn to predict vectors’ sequence where vectors are encoding of the auditory, visual, and somatosensory signals. After enough association has been established, the agent might say “Wada,” and the mother brings water. Please note that there are no explicit rewards in this scenario. SEDRo will support this learning to learn the language.
The verbal speech is approximated by the sparse distributed representations (SDR). Speech is encoded to a 512-dimensional vector, where about 10 of them are randomly selected for each alphabet. At each timestep, the corresponding speech signal is represented as the sequence of the vectors. Noise can be added by randomly changing some of the bits.
The take-away messages in our paper for the researchers who develop or use language acquisition environment is following:
RL researchers: Using reward signal generated by environments will be effective for the implementation of level 2 intelligence. But for level 3 intelligence, the reward system is also part of the agent and using the prediction error or intrinsic errors can be a viable option as the researchers in language models do.
Language model researchers: The language symbols need to bring the changes in the policy. It means that the language symbols need to be grounded with sensory input, the actions, and the corresponding rewards in the embodied agents.
- Primate vocalization, gesture, and the evolution of human language. Current anthropology 49 (6), pp. 1053–1076. Cited by: §3.
- Language models are few-shot learners. arXiv preprint arXiv:2005.14165. Cited by: §3.
- Gated-attention architectures for task-oriented language grounding. In Thirty-Second AAAI Conference on Artificial Intelligence, Cited by: 1st item, §3.
- Touchdown: natural language navigation and spatial reasoning in visual street environments. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 12538–12547. Cited by: 1st item, §3.
- Textworld: a learning environment for text-based games. In Workshop on Computer Games, pp. 41–75. Cited by: 2nd item.
- The formation of vegetable mould through the action of worms: with observations on their habits. Vol. 37, Appleton. Cited by: §2.
- Embodied question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 2054–2063. Cited by: 1st item, §3.
- The selfish gene. Oxford university press. Cited by: §2.
- A robot wrote this entire article. are you scared yet, human?. The Guardian. External Links: Cited by: §3.
- The genetical evolution of social behaviour. ii. Journal of theoretical biology 7 (1), pp. 17–52. Cited by: §2.
- Grounded language learning in a simulated 3d world. arXiv preprint arXiv:1706.06551. Cited by: 1st item, §3.
- Communication in bottlenose dolphins: 50 years of signature whistle research. Journal of Comparative Physiology A 199 (6), pp. 479–489. Cited by: §3.
- Next-generation of virtual personal assistants (microsoft cortana, apple siri, amazon alexa and google home). In 2018 IEEE 8th Annual Computing and Communication Workshop and Conference (CCWC), pp. 99–103. Cited by: §3.
- A collection of definitions of intelligence. Frontiers in Artificial Intelligence and applications 157, pp. 17. Cited by: §2.
- Universal intelligence: a definition of machine intelligence. Minds and machines 17 (4), pp. 391–444. Cited by: §2.
- The winograd schema challenge. In Thirteenth International Conference on the Principles of Knowledge Representation and Reasoning, Cited by: §3.
- The natural language decathlon: multitask learning as question answering. arXiv preprint arXiv:1806.08730. Cited by: §3.
- Language understanding for text-based games using deep reinforcement learning. arXiv preprint arXiv:1506.08941. Cited by: 2nd item.
- SEDRo: a simulated environment for developmental robotics. External Links: Cited by: §4.
- Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21 (140), pp. 1–67. Cited by: §3.
- Alfred: a benchmark for interpreting grounded instructions for everyday tasks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10740–10749. Cited by: 1st item, §3.
- A developmental approach to machine learning?. Frontiers in psychology 8, pp. 2124. Cited by: 3rd item.
- Introduction to reinforcement learning. Vol. 135, MIT press Cambridge. Cited by: §3.
- The study of instinct.. Cited by: §2.
- Computing machinery and intelligence. Mind 59, pp. 433–460. Cited by: §4.
- Glue: a multi-task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461. Cited by: §3.