A Neural Network Architecture for Learning Word-Referent Associations in Multiple Contexts

A Neural Network Architecture for Learning Word-Referent Associations in Multiple Contexts


This article proposes a biologically inspired neurocomputational architecture which learns associations between words and referents in different contexts, considering evidence collected from the literature of Psycholinguistics and Neurolinguistics. The multi-layered architecture takes as input raw images of objects (referents) and streams of word’s phonemes (labels), builds an adequate representation, recognizes the current context, and associates label with referents incrementally, by employing a Self-Organizing Map which creates new association nodes (prototypes) as required, adjusts the existing prototypes to better represent the input stimuli and removes prototypes that become obsolete/unused. The model takes into account the current context to retrieve the correct meaning of words with multiple meanings. Simulations show that the model can reach up to 78% of word-referent association accuracy in ambiguous situations and approximates well the learning rates of humans as reported by three different authors in five Cross-Situational Word Learning experiments, also displaying similar learning patterns in the different learning conditions.

Self-Organizing Maps, Cross-Situational Word Learning, Context, Learning Representations, Neurocomputational Model.


1 Introduction

Language is surely a vital and distinctive trait of human beings. Even though language acquisition by young children is an active research topic in cognitive sciences, a number of open issues persist, despite the achievements of the field. For instance, we do not know exactly how humans acquire the meaning of words, an essential part of the language acquisition process. In this article, we propose a word learning model composed of a set of neural modules, or schemes (Arbib, 2008), that simultaneously compete and cooperate to perform higher-level tasks. The model was proposed considering the evidence brought by the literature of neurolinguistics and psycholinguistics about the characteristics of the word learning capabilities displayed by humans. With that, the proposed model is able to simulate multiple statistical characteristics displayed by humans when they learn new words.

We assume that word learning may be studied disregarding the interference of other aspects of language acquisition, such as the acquisition of grammar, semantics, and pragmatics. Therefore, according to Bloom (2002), in order to learn the meaning of a word, an individual must learn three different elements: (i) the concept or meaning of the word (referent); (ii) the sound or lexical representation of the word (label); and (iii) the association between referent and label. Each of these challenging tasks will be addressed in this article.

A classic example (Quine, 1960) illustrates the difficulties that children and foreign language learners have to handle to correctly match words and referents. When a native speaker of an unknown language sees a white rabbit and pronounces “gavagai”, one might understand this as clear evidence that the word “gavagai” means rabbit. However, such sound could also mean “white”, “furry”, “food”, “let’s go hunting” or even something completely unrelated with rabbit, such as “it is going to rain”. The expression “gavagai” could even be a composition of two or three words with their own meanings.

One possible strategy to address the problem described by Quine (1960), is known as “cross-situational word learning” (CWSL) (Yu and Smith, 2007). In this type of learning, the words would not be learned after a single exposure. The learning process would consider information from multiple learning trials. Thus, a learner who is unable to decide unambiguously the meaning of a word after a single trial would form a new knowledge subject to be further strengthened or weakened upon new evidence.

Currently, we can argue that word learning requires a set of cognitive abilities that are not yet fully understood (Bloom, 2002), such as theory of mind (the ability to simulate and understand the thought of others), concept acquisition, and fast mapping (the ability to associate referents and labels with few, or even one trial). In this article, we focus on the last two abilities of this list.

Concept acquisition may be seen as the ability to recognize and group similar referents together so that the category itself (concept) could be further associated with a label. Harnad (2005) points out that “To Cognize is to Categorize” and Perlovsky (2006) describes the mind as a hierarchy of multiple layers of concept-models, from simple elements like edges or moving dots to more abstract concept-models of objects, relationships, complete scenes, and so on.

The proposed model is compatible with these views because it defines the learning tasks mentioned above as a subspace clustering problem (Kriegel et al., 2005; Bassani and Araujo, 2015; Hu and Pei, 2018), in which the cluster prototypes capture the concept-models. At the current state, the model focuses on the lower levels of the concept-model hierarchy mentioned by Perlovsky, learning the referents, labels, and their associations for concrete nouns that can be depicted in static images, such as chair, table, and pen, in their different usage contexts (basic concept-models). The model learns such elements incrementally by creating new prototype nodes as required, adjusting the existing prototypes to better represent the auditory and visual input stimuli or removing prototypes that become obsolete/unused.

To achieve this, we specify a neurocomputational architecture composed of four layers: (i) the first layer extracts the perceptions from raw visual data (the referents) and auditory data (the labels); (ii) the second layer creates a more suitable representation for labels and referents; (iii) the third layer recognizes the current context and; (iv) the fourth layer creates the associations between labels and referents in the different contexts in which they are used, thus forming the prototypes representing the basic concept-models learned by the model.

In order to evaluate the proposed model, we simulate the CSWL experiments carried out with human beings by Yu and Smith (2007), Yurovsky et al. (2013), and Trueswell et al. (2013). These experiments provide sound evidence on the operation of word learning mechanisms. Any model aiming to represent the functioning of these learning mechanisms must be able to reproduce to some extent the world learning patterns described in the following paragraphs.

Yu and Smith (2007) designed experiments to evaluate the abilities of humans in acquiring correct word-referent pairings and they have found compelling evidence that adult humans are able to learn label-referent pairings through CSWL. In their experiments, the stimuli consisted of slides containing 2, 3, or 4 pictures of unusual objects paired with 2, 3, or 4 pseudowords presented in the auditory form. These artificial words were generated by a computer program using standard phonemes in English. In this case, the label-referent pairs were formed by single and unique objects randomly chosen, used in three different training conditions of ambiguity.

Figure 1: Illustration of a trial in the 4x4 condition. The pictures of four objects (referents) are shown in the monitor while the sound of four pseudowords is presented auditorily over the speakers (labels).

The training conditions differ only in the number of labels and referents simultaneously presented to the subjects. Figure 1 illustrates a 4x4 condition, in which four objects (referents) were presented simultaneously on the screen, while the sound of 4 pseudowords (labels) were heard from the speakers. The results showed that the individuals were able to discover on average more than 16 out of the 18 pairs in the 2x2 condition and more than 13 out of the 18 pairs in the 3x3 condition.

Yurovsky et al. (2013) expanded the previous experiment including situations in which labels could be associated with more than one referent. They were interested in evaluating if there was competition occurring in the learning process and if it was local (among referents presented in the same trial) or global (among referents presented in different trials). Their results suggested that global competition is most likely to occur.

The computational models proposed in the literature for CSWL can be divided into two categories (Yu and Smith, 2007): the Hypothesis-Testing Models, in which the learner maintains a list of hypothesized pairings to be further confirmed or rejected due to a mutual exclusivity constraint and the Associative Models, a basic form of Hebbian learning which strengths associations between observed word-referent pairs.

Trueswell et al. (2013) designed experiments to compared the two hypotheses and their results suggested that subjects did not keep track of multiple candidate meanings for each label, hence, according to the authors, such experiments weaken the hypothesis that humans employ some kind of statistical learning of the word-referent pairings.

Current studies have focused on comparing these two modeling approaches in terms of how well they fit experimental data, but no consensus has emerged yet. For instance, Kachergis et al. (2017) found that an associative model which includes competition between familiarity and uncertainty biases reproduces better the individual and combined effects of frequency and contextual diversity on human learning. Khoe et al. (2019) found that this associative model better captures the full range of individual differences and conditions when learning is cross-situational, although the hypothesis testing approach outperforms it when there is no referential ambiguity during training.

The model proposed in this article differs from these studies by focusing in dealing with real-world data (raw images and phoneme sequences) and in employing a neural network architecture that can be used to simulate models of both categories, though in the present work the associative approach was considered.

The obtained results show that the proposed model is able to replicate the patterns of CSWL presented by humans. Additionally, the proposed model was also tested in scenarios in which there was ambiguity about the correct word-referent parings, with more than one association. We show that the model can take into account the context to solve ambiguity and choose the correct referent for ambiguous words.

The following sections of this article are structured as follows: Section 2 discusses the Associationism theory and presents the experimental evidence on word-referent associations. Section 3 describes correlated models for language acquisition. Section 4 presents the proposed modular architecture for replicating the CSWL experiments while Section 5 and Section 6 detail the two neural network models employed in the learning tasks, LARFDSSOM, and ART2 with Context. Section 7 describes the CSWL experiments performed by Yu and Smith (2007), Yurovsky et al. (2013), and Trueswell et al. (2013) along with the simulations carried out with the proposed model for replicating them. Finally, Section 8 discusses and summarizes the main conclusions drawn from the obtained results.

2 Associationism and Experimental Evidence About How Humans Learn Word-Referent Associations

Associationism is one of the most widely held theories of learning, appearing since John Locke (1700). According to it, learning is based on sensibility to covariation of the human brain. Richards and Goldfarb (1986) proposed that children could learn the meaning of a word by repeatedly associating its verbal label with their perceptual experience at the time that the label is used. For those perceptual properties that repeatedly co-occur with the label, the association strengthens.

We can find several pieces of evidence supporting Associationism in word learning. For instance, children’s first words often refer to things that they can see and touch; words are learned best in conditions in which an associative match would be easier to make. Additionally, the results of cross-situational word learning show that adults can learn word-referent associations with repeated co-occurrence. However, Associationism cannot explain all the observed word learning phenomena. Below, we list the most significant points collected by Bloom (2002) against a pure associationist theory of word learning.

  1. Associationism requires that label and referent are simultaneously present in the environment. However, studies show that about 30-50% of the time a word is used, young children are not attending to the object the adult is talking about (Collins, 1977; Harris et al., 1983; Bunce and Scott, 2017).

  2. Associationism predicts that before children have enough data to retrieve the right associations they would often make mapping errors unless they wait until having collected strong statistical evidence. However, it was observed that in certain situations, children can learn a new word even after a single exposition (Markson and Bloom, 1997; Frank and Goodman, 2014).

  3. Association between labels and perceptions does not explain how children learn labels of more abstract referents that they cannot see or touch. A significant number of children’s words refer to abstract conceptual categories such as “morning” or “day” (Nelson et al., 1993; Feijoo et al., 2017).

The view of the authors of this work is that the capability of statistical association is necessary, though not sufficient, for word learning, and it can serve as a basis for other higher cognitive functions. We are interested in verifying how well we can model the human word learning behavior in cross-situational word learning with a modular neural network that learns statistical correlations.

This modular network was built considering evidence collected from the literature of Psycholinguistics, Neurolinguistics and organized them in a modular architecture which presents similarities to those employed in Computational Linguistics (Allen, 1994).

Below, we present the evidence that we collected from the literature, separated by their field. In section Section 4 we present the proposed architecture and discuss how each piece of evidence was taken into account in its specification.

2.1 Evidence from Psycholinguistics

Cross-situational word learning: There is plenty of work (Yu and Smith, 2007; Yurovsky et al., 2013; Trueswell et al., 2013; Bunce and Scott, 2017) showing that human adults can robustly figure out the correct word-referent associations in ambiguous learning situations, in which the correct mapping of a word to an intended referent cannot be guaranteed. The learning rates and patterns presented by humans in different conditions of ambiguity provide valuable information for evaluating word learning models.

Correcting feedback is not a requirement: Correcting feedback may help learning, however, children do not require it to learn word meanings. Lieven (1994), reviews works showing that there are cultures in which adults do not even speak directly to children until they are using words in a meaningful manner. This suggests a computational model considering unsupervised or reinforcement learning.

“New word, new object” preference: Studies suggest that children are biased to consider that each word is associated with a single referent (Kagan, 1981; Markman and Wachtel, 1988). Therefore, if they are presented with a new word they will prefer to associate it with a currently unlabeled referent. This is also known as “mutual exclusivity”.

Object categorization can be biased by labels: Most labels are associated not with a singular object but with a category of similar objects (that share certain properties). For instance, the word “car” refers to a set of different types of vehicles that share certain features. Plunkett et al. (2008) show that the choice of what labels are presented for children as naming new objects can affect how they categorize these objects, biasing them to create certain categories that they would not create otherwise. Mayor and Plunkett (2010) created a neurocomputational model that successfully reproduced this behavior for simulated data.

Different features are relevant for each category: The properties young children attend to when categorizing a novel entity depend on its type (object versus a non-solid substance) (Soja et al., 1991), plant or rock (Keil, 1994), real or toy monkey (Carey, 1995), animal or tool (Becker and Ward, 1991). This suggests the employment of subspace clustering methods in the categorization of items to form the referent concepts. In subspace clustering, certain attributes can be more relevant than others for each category, and an item may belong to more than one category. For instance, consider the categorization of a red hexagon. This object belongs to different categories depending on the features that are taken into account. Regarding its color, it belongs to the category of red objects, while regarding its shape it belongs to the category of hexagonal objects. Finally, it belongs to a third category when taking both features into account.

Fast Mapping: Other studies (Carey and Bartlett, 1978; Dollaghan, 1985; Heibeck and Markman, 1987; Rice, 1990; Markson and Bloom, 1997) show that children and adults can learn word-referent associations after a few exposures (even one), without any explicit training or feedback, and even without any explicit act of naming.

Context can affect retrieved memories: Brainerd and Reyna (1998, 2008) have shown that in experiments in which a list of words with a shared central meaning are presented for subjects to memorize, after the memorization, the subjects are induced to recognize as having seen on the list words related with this central meaning even when they were not on the list (false memories). These experiments suggest that the contextual meaning formed during the pattern presentations plays an important role for memorization and is taken into account during recognition (Matzen and Benjamin, 2009). This behavior was modeled and reproduced by Araujo et al. (2010) with a modular neural network.

2.2 Evidence from Neurolinguistics

Hierarchical perceptual processing: Sensory information is processed to extract information that is relevant to the individual (perceptions), through innate or self-adaptive processes, probably in inferior cortical regions such as the visual cortex (Miikkulainen et al., 2005) and auditory cortex (Pasley et al., 2012). Superior cortical areas, such as V5 and the posterior parietal cortex integrate information to form more complete perceptions (Udesen and Madsen, 1992; Born and Bradley, 2005).

Mirror neurons: Certain neurons respond to correlated perceptual information from different modalities, such as verbal, visual and motor information about the same action or event, as observed in the sensory-motor cortex (Rizzolatti and Craighero, 2004; Pulvermuller, 2005).

Context recognition: Hippocampus and amygdala keep a historical record of the input stimuli, forming a kind of context (Fletcher et al., 1997; Aggleton and Brown, 1999).

Topographic-preserving input mapping: Nearby neurons in the brain respond to inputs with similar features as in certain areas of the brain where topographic maps are formed, especially in the primary motor, visual, and somatosensory cortical areas (Haykin, 1998; Spitzer, 1999; Miikkulainen et al., 2005).

3 Previous Language Acquisition Models Based on Self-Organizing Maps

Considering that children are able to acquire language without explicit feedback, several language acquisition models are based on unsupervised learning methods. Self-Organizing Maps (Kohonen, 1982) and Adaptive Resonant Theory (ART) (Grossberg, 1976, 1976) are two of the most prominent unsupervised learning neural networks. ART was employed for modeling human behavior in the task of memorization of word lists (Pacheco, 2004; Araujo et al., 2010), while several computational models for word learning are based on SOM (Ritter and Kohonen, 1989; Miikkulainen, 1997; Plunkett et al., 1992; Plunkett, 1997; Li et al., 2004; Silberman et al., 2007; Li et al., 2007). Refer to Li and Zhao (2013) for a review of SOM-based language acquisition models.

Ritter and Kohonen (1989) applied SOM to capture the semantic structure of words. Their pioneer work showed that implicit categories in the linguistic environment can be recognized by SOM.

Guenther and Gjaja (1996) have shown that a SOM fed with formant representation of different phonemic categories can simulate the perceptual magnet effect (Kuhl, 1991), an effect characterized by a warping of the perceptual space near central phonemic category, that allows certain sound categories to be considered as more similar to each other than to those patterns further away from the center.

The associative hypothesis is explicitly modeled by Hebbian learning in DISLEX, DevLex, and DevLex II models. The basic idea is that the activation of co-occurring lexical and semantic representations in each map leads to an adaptive formation of associative connections between them.

Miikkulainen (1997) introduced the DISLEX model to simulate dyslexia and aphasia. The model was the first to connect different SOMs through associative links. Each SOM represents a different type of linguistic information, such as phonological, orthographic and semantic. DISLEX has also been shown to be able to simulate patterns of bilingual language recovery in aphasic patients (Kiran et al., 2013).

Following this structure, two models, DevLex (Li et al., 2004) and DevLex II (Li et al., 2007), were proposed to simulate children’s early lexical development. Instead of employing maps with a fixed structure, in the DevLex family, new nodes are inserted in the map when required, to improve the accuracy of learning. DevLex has been shown to model patterns of lexical confusion as a function of word density and semantic similarity, simulating age-of-acquisition effects while learning a growing lexicon. DevLex II has been shown to simulate several empirical phenomena, including patterns of vocabulary spurt, the relationship between comprehension and production, fast mapping, lexical category development and, lexical overextension.

Silberman et al. (2007) employed a single layer SOM for simulating the associations between words and concepts in a semantic network that extracts semantic information from the CHILDES database (Macwhinney, 2010). The model was able to replicate learning patterns such as the effects of semantic priming that indicates faster response when recognizing a word semantically related to the information in the episodic memory, than when recognizing unrelated words.

Mayor and Plunkett (2010) presented a model for simulating fast mapping in early word learning. Their model included two SOMs, one fed with visual input representing artificial objects and the other fed with acoustic input representing words. The connections between the two SOMs were also adjusted by Hebbian learning. The model displayed learning patterns of early lexical category development, such as the tendency to attribute to a new object a known name of another object in the same category.

Despite the acknowledgeable achievements of these models, none of them was designed to replicate the CSWL experiments, which is an excellent source of data about word-referent associations. In this regard, Yu and Smith (2012) described and compared two competing types of models for CSWL: Hypothesis-Testing Models and Associative Models. In Associative Models (Yu and Smith, 2007), the representation is a large word-object matrix in which each cell contains the associative strength between one word and one object and a basic form of Hebbian learning is employed to strength associations between observed word-referent pairs. In the Hypothesis-Testing Models Medina et al. (2011); Trueswell et al. (2013), the learner maintains a list of hypothesized pairings (a single hypothesis for each word) to be further confirmed or rejected due to a mutual exclusivity constraint. Both types of models were shown to be able to replicate the patterns of CSWL and the main conclusion of the authors was that it is necessary to look at the components of models to understand how they contribute to overall learning.

Such models, however, were not modular and were not developed to work with real-world input data, such as images and sounds. This limits their ability to replicate the details of experiments carried out with humans. The next section describes the modular architecture we proposed to address those issues.

4 Proposed Modular Architecture

Figure 2 illustrates the proposed architecture, which is stratified in four layers. The first two layers are comprised of parallel modules that are specialized for each kind of stimuli (auditory or visual), while the third and fourth layers present one module each performing multisensory integration. Below we present a general description of each layer:

  1. – Perception: It extracts relevant information (perceptions) from the sensory data. The sensory-perception mapping modules present in this layer are specialized for each kind of input. The auditory module extracts phonemes from a sound (or from a text, for convenience), while the visual module extracts descriptions of interest points from image patches.

  2. – Representation: It consolidates perceptions that are distributed in space or time, creating a representation that is suitable for understanding a given stimulus. This layer contains representation modules specialized for each type of stimulus (visual or auditory). For instance, an isolated phoneme may carry little meaning, however, a sequence of phonemes could represent a word or a lemme (temporal consolidation). Similarly, in the visual processing, the description of a small patch of an image may carry little meaning, however, the description of a set of patches can carry information enough to represent an object or a scene (spacial consolidation).

  3. – Context: This layer contains the context module that receives the multisensory perceptions as input, accumulates sequences of these inputs, and clusters them to form a ”temporal context” that can be recognized afterward. The context recognition is important, for instance, to disambiguate the meaning of homophone/homograph words, such as mouse (animal or computer device). The recognized context is forwarded to the next layer together with the inputs received.

  4. – Association: The module in this layer, associates (or integrates) the perceptions of words, visual objects and contexts. This association is achieved by the means of perception clustering. Therefore, each cluster represents an association. For instance, each word can be associated with different meanings that occur in different contexts by being represented in more than one cluster. In the same way, a visual object can be associated with more than one word by being represented in more than one cluster. For instance, the object car can be associated with the words car and vehicle, in two different clusters.

Figure 2: Illustration of the processing layers of the model. A - Perception Acquisition; B - Representation; C - Context formation and recognition; and D - Association and context-dependent recognition.

Figure 2 indicates the learning models employed in each module, as well as how the information flows through the whole architecture. In the following subsections, we describe each module in more detail. The learning models are described afterward.

4.1 Sensory-Perceptive Mapping Modules

In the CSWL experiments, visual and auditory stimuli are simultaneously presented to the subjects, as depicted in Figure 1. In the proposed model, these two kinds of stimulus are processed in parallel in the first layer to produce a numeric representation of the perceptions as output as described in the following subsections.

The Auditory Sensory-Perceptive Mapping

The auditory input data consists of a stream of text representing the name of each object displayed on the scream. For instance, the string: ”mixer, canister, rasp, goblet”, would describe the objects in Figure 1.

In order to obtain a numeric representation of the auditory data, we followed a procedure similar to that described by Araujo et al. (2010). First, we convert each word to its respective phonetic representation. This step employs the CMU Pronouncing Dictionary (CMUdict) (Lenzo, 2007). Therefore, the example above is translated into: ”K AE N AH S T ER, R AE S P, G AA B L AH T, M IH K S ER”, in which, each phoneme is represented by its ARPAbet symbol, separated by spaces.

Afterward, each phoneme is translated into a vector of 12 real values ranging from -1 to +1 (see Table 2 in B). This numeric representation was built considering the place of pronunciation of each phoneme in the International Phonetic Alphabet (IPA) charts for vowels and consonants, encoding specific features for vowels (4 of them) and for consonants (8 of them). Therefore, when a vowel is represented, the features for consonants are set to zero, and when a consonant is represented, the features of vowels are set to zero. The rationale behind this procedure is to obtain similar representations for phonemes with similar sounds.

Finally, the representation of any sequence of words is a list of vectors, each vector describing the characteristics of one phoneme in the sequence. This list represents the perception output by the Auditory Sensory-Perceptive Mapping.

The Visual Sensory-Perceptive Mapping

The extraction of visual perceptions consists of detecting and describing numerically the parts of the object present in the image. In this article, we follow the literature of Unsupervised Object Discovery (Weber et al., 2000; Tuytelaars et al., 2010; Kinnunen et al., 2012), and we use the Scale Invariant Feature Transform (SIFT) to detect Points of Interest (POIs) and describe each POI as a vector of 128 values (Lowe, 1999), called “POI descriptor”. These POI descriptors are normalized by an L2 normalization.

In this module, each object in the screen is represented by a list of descriptors of the POIs detected and described by SIFT. For instance, in the 4x4 condition exemplified in Figure 1, we have four objects on the screen that will result in four lists of POI descriptor vectors, one list per object. These lists represent the perception output by the visual Sensory-Perceptive Mapping.

The outputs of both Sensory-Perceptive Mapping modules in Layer I are, then, fed as inputs to the respective Representation Modules in Layer II.

4.2 Representation Modules

A feature vector produced by both modules described above, considered in isolation, is not enough to identify the auditory or visual elements. For instance, one phoneme is not enough to identify a word, analogously, the descriptor of one POI of an image cannot identify an object. Therefore, it is necessary to compose the information from several feature vectors to properly describe an element of interest, thus allowing its recognition.

The basic idea employed in this module is to build a Bag-of-Features (BoF) representation, by determining and stringing the features distributed in space or time. This approach was used for Unsupervised Visual Object Discovery (UVOC) from Images by Tuytelaars et al. (2010) and Kinnunen et al. (2012). It derives from the Bag-of-Words (BoW) approach, a way to represent text (Salton and McGill, 1986) for categorization tasks. The BoF approach consists of two steps: first, similar features are clustered to create a dictionary of features called “codebook”, in which, the number of clusters determines the size of the features vector produced to represent the objects. After creating this dictionary, the objects are described by counting the number of features mapped in each cluster, thus, forming a histogram of occurrence, which is usually normalized.

In Tuytelaars et al. (2010), several clustering methods and types of histogram normalization were evaluated. The authors concluded that when there is one object category per image, even k-means yields good results, being outperformed only by spectral clustering. Kinnunen et al. (2012) considered SOM to be a viable alternative of clustering method for BoF. The authors obtained similar results to those presented by Tuytelaars et al. (2010). However, they found SOM to be more robust to the type of normalization applied to the histogram.

Instead of the traditional SOM, we employ LARFDSSOM in the representation layer to generate the codebook. LARFDSSOM is a suitable method for this task because it is capable of subspace clustering and it employs a locally weighted distance metric to adjust the relevances of the input dimensions. This is an important property when the input data present high dimensionality, since it is able to identify, for instance, which kinds of image patches are relevant for determining each object category and its associated phonetic representation. A detailed description of LARFDSSOM is provided in Section 5.

The representation module maps were pre-trained to learn a codebook, forming 28 clusters in the phonetic representation map and 37 clusters in the visual representation map. This training has occurred in advance since these maps represent the previous knowledge that each individual has about the phonetic structure of its native language and about the basic perceptual elements necessary to recognize objects.

4.3 Context Module

This module should associate a context with each newly received input, in a way to distinguish the same stimulus presented under distinct contexts, and also, to approximate different inputs when presented in similar contexts. In the brain, this role is played by the hippocampus where, several recurrent connections are observed in the cortical regions of memorization, hence recurrent neural networks seem to be a suitable approach. Hence, we applied the ART2 with context described in Section 6.

The visual and auditory representations are given as input to the ART2 With Context, which recognize the current context or create a new context if necessary. The outputs of the context module consist of the visual and auditory inputs, unchanged, associated with the context representation recognized by ART2 with Context, and stored by its context units, .

4.4 Association Module

The Association Module takes as input the three outputs of the context module, visual, auditory and contextual information to associate them. In this article, this task is also carried out by a LARFDSSOM. The map computes the activation of all existing nodes and the node with the highest activation, the winner node, represents the best association found. If its activation is above the threshold parameter, , this node is updated to slightly modify the previous association. Otherwise, a new node is inserted in the map to represent a new association learned as it is presented in its inputs.

It is worth pointing out that, as the nodes on the map are updated, they learn which input dimensions are relevant. This allows the nodes to take into account only the aspects of the visual, auditory and contextual information that present a certain level of correlation. For instance, if a word occurs frequently with the same sound in several different contexts, the node can learn that the context is irrelevant for this association. In another example, if certain aspects of the image correlate with a certain sound while others do not, the uncorrelated aspects are taken as irrelevant.

In our simulations, the LARFDSSOM was initialized with a single neuron randomly positioned in the input space and no limit was applied to the number of nodes created so that the network could grow as much as required to represent the associations found. The output of the association module is the activation of the winner node. If this value is above the threshold parameter, , it indicates that the pattern presented by the inputs of the network was recognized, thus, the visual, contextual, and auditory information are considered associated and, the higher this value, the stronger the association made by the map is. This allows us to compare object-sound associations in different contexts and to identify the strongest association.

During the recognition phase of the cross-situational word-learning simulations, all pairings of objects and sounds are presented as input for the model and the pair with the highest activation is considered as the strongest association made from the network.

4.5 How the Evidence was Taken Into Account

Each piece of evidence collected in the literature and described in Sections 2.1 and 2.2 was somehow taken into account in the proposition of the architecture, as indicated below:

Cross-situational word learning: The proposed model was designed to replicate the CSWL experiments, while keeping the main aspects of the structure of previous SOM-based language acquisition models.

Correcting feedback is not a requirement: The proposed model was developed based on unsupervised learning models, therefore it does not require correcting feedback.

“New word, new object” preference: Though this was not evaluated in our experiments, the similarity based competition employed in the learning model used in the association layer (LARFDSSOM), makes that stimuli significantly different from what was previously seen (novel stimuli) tend to be stored on new associations nodes.

Object categorization can be biased by labels: The proposed architecture was specially designed to take this into account by making both labels and referents as inputs to the association layer. This allows labels to affect the categorization of referents, by making their representations more similar/different. This is also true for the contextual information.

Different features are relevant for each category: This is a feature of LARFDSSOM, which learns the relevance of each input dimension for each category during the self-organization process.

Fast Mapping: LARFDSSOM can learn new associations in one shot.

Context can affect retrieved memories: In the proposed architecture, the current context is recognized and affects the information stored and retrieved, since it is part of the representation sent for the association layer.

Hierarchical perceptual processing: This inspired the layered architecture proposed, which takes raw sensory data as input, extracts perceptions, converts it to a more suitable representation which is fed to the context formation layer, and finally, forwarded to the association layer.

Mirror neurons: The nodes in the association layer perform the multisensory integration and can be activated by information of different modalities, similarly as the mirror neurons.

Topographic-preserving input mapping: This inspired the employment of a SOM-based model with a topographic-preserving characteristic in layers B and D.

The following two sections provide details about the implementation of LARFDSSOM (employed in representation and association layers) and ART2 with Context (employed in the Context Layer). All the source-code and datasets produced in Perception Layer are available online 2.

5 Subspace Clustering with Self-Organizing Maps

The Self-Organizing Map (SOM) proposed by Kohonen (1982), is a neural network trained with unlabeled data (unsupervised learning). It maps a high-dimensional data into a lower dimensional (usually bi-dimensional) grid of nodes (or neurons), compressing information while preserving the topological relationships of the original data.

The following characteristics of SOM are worth highlighting here:

  • It creates an abstraction and a simplified representation of the input data distribution (Haykin, 1998). Each node can be seen as a prototype representing similar input data.

  • Its topological properties correlate with what is observed in the sensory processing regions of the brain, where the input stimuli are represented in topologically ordered neural maps (Miikkulainen et al., 2005). In particular, sensory inputs such as tactile (Kaas et al., 1983), visual (Hubel and Wiesel, 1962, 1977), and acoustic (Suga, 1985) inputs are mapped to different areas of the cerebral cortex in a topologically orderly manner.

  • SOM-based models were applied to a variety of problems involving sensory processing, including voice recognition and image processing (Kangas, 1991; Venkateswarlu and Kumari, 2011; Abdelsamea et al., 2015; Chen et al., 2017);

These characteristics have made SOM a good candidate for modeling the processing of perceptions. However, as we mentioned in the previous section, traditional clustering algorithms (SOM included) are not adequate to create abstract representations in the form of perceptions, because they weight equally all input dimensions and because they map each input stimuli to a single cluster. These limitations prevent SOM from being able to correctly cluster this kind of data and create prototypes that represent the several possible abstractions associated with the same stimulus, as in the example of the red hexagon given above. Therefore, other SOM-based subspace clustering methods that address these limitations are considered here.

The Dimension Selective Self-Organizing Map (DSSOM) (Bassani and Araujo, 2012) was one step towards making SOM adequate for subspace clustering. By using a weighted Euclidean distance (Equation 1) to compare samples and prototypes it is able to adjust the relevance of each dimension to determine the winning node for each grid node. Thus, the model allows the weight of some dimensions to be even zeroed so that these dimensions do not influence the selection of data clustered by a given node. The adjustment of these weights is done adaptively during self-organization process.


where x is an input stimulus, is the -th prototype on the map, and is the weighting factor that the -th prototype applies to the -th input dimension.

These weighting factors are estimated from the variance of the input patterns clustered by each node on the grid. The higher the variance, the lower its weighting factor is. Moreover, DSSOM allows more than one node to win for a given input stimulus, so that, nodes that apply a set of weighting factors different from those considered by the previous winners can also group that stimulus.

DSSOM presented solid results, comparable to or better than previous subspace clustering methods from the data mining field. However, the fixed topology of DSSOM ( grid) requires strong knowledge about the data, and may not adequately represent the neighborhood topology of clusters that live in different subspaces. This issue was addressed in the map described in the next section, which is the method that we have chosen to employ in the proposed model for learning word-referent associations.

5.1 Local Adaptive Receptive Field Dimension Selective Self-Organizing Map - LARFDSSOM

LARFDSSOM (Bassani and Araujo, 2015) preserves the main characteristics of SOM and DSSOM. However, in LARFDSSOM the nodes are not organized in a fixed grid. Instead, it introduces a time-varying structure with a mechanism that inserts new nodes into the map whenever the winner node is not similar enough to the current input pattern. In order to achieve this, it defines an activation function (Equation 2), inversely related to the distance presented in Equation 1 and a threshold parameter (). When the activation of the winner node in response to an input pattern is below this threshold, a new node is inserted into the map, at the position of the input pattern.


where , is a small value to avoid division by zero, is the norm of the relevance vector, and is the weighted distance function shown in Equation 1.

The relevance vector is computed as an inverse function of the average distance of each node to the input patterns that it clusters, , i.e., the greater is the average distance in a dimension, the smaller is the respective relevance (Equation 3).


where is the learning rate given by: if is the winner node and if is a neighbor of the winner node, , and are respectively, the maximum, the minimum, and the mean of the components of the distance vector and , , , are parameters.

1 Initialize parameters , , , ... ;
2 Initialize the map with one node with initialized at the first input stimulus, , and ;
3 Initialize the variable ;
4 foreach input stimulus (x) do
5        Present x to the map;
6        Compute the activation of all nodes (Equation 2);
7        Find the winner with the highest activation ();
8        if  and  then
9               Create new node setting: , , and ;
10               Setup the neighborhood of node ;
12        else
13               Update the vectors c, , and of the winner and of its neighbors (Equation 3);
14               Set ;
16        end if
17       if  then
18               Remove nodes with ;
19               Update the connections of the remaining nodes;
20               Reset the number of wins of the remaining nodes: ;
21               ;
23        end if
24       ;
26 end foreach
Algorithm 1 Self-Organization Phase

Also, in LARFDSSSOM, nodes that do not cluster a minimum percentage () of the input patterns are periodically removed from the map (every competitions). Additionally, the neighborhood connects only nodes that take into account a similar subset of the input dimensions.

The operation of the map comprises three phases: organization, convergence and clustering phase. In the organization phase, the nodes compete to cluster each new input pattern, so that the winner and its neighbors are updated to approximate it and new nodes are created whenever the most activated node does not reach the threshold . The convergence phase is similar to the organization phase, with the exception that node insertion is not allowed. Finally, in the clustering phase, the consolidated map is not changed anymore, being used only for clustering.

In this article, for simulating the learning process of a subject going through the CSWL experiments, we employ the organization phase (shown in Alg. 1) without limiting the number of nodes in the map and with nodes being updated as per Equation 3, while the convergence phase is not used.

The clustering phase (shown in Alg. 2) is used for testing what the simulated subjects have learned to recognize.

1 foreach input pattern (x) in the dataset do
2        Present x to the map;
3        Compute the activation of all nodes (Equation 2);
4        Find the winner with the highest activation ();
5        if  then
6               repeat
7                      Assign x to the cluster of the winner node ;
8                      Find the next winner disregarding the previous winners;
10              until ;
11       else
12               x was not recognized;
14        end if
16 end foreach
Algorithm 2 Clustering with LARFDSSOM

6 Context Formation and Recognition with ART2

Since words can have different meanings in different context, taking context into account when recognizing words is a fundamental task in word learning. In this work, we employ for this task a neural network called ART2 with Context Araujo et al. (2010), based on ART2 Carpenter and Grossberg (1987) which is a model from the Adaptive Resonant Theory. Such an unsupervised incremental learning is capable of grouping patterns, associates stimuli of different natures, adjusts the degree of similarity of the grouped patterns, works with plasticity and stability, and presents some plausibility. Araujo et al. (2010) adapted ART2 by inserting context units with recurrent connections. These context units aim to store a history of the input patterns and make this context affect both pattern search and recognition phases.

The ART2 with Context, Figure 3, present the same input () and output () layers of ART2, however, context units and with recurrent connections were added to the model. Each unit contains a kind of average of the input values. Each unit stores the intensity of the occurrence of a property in the input pattern, in the internal representation of ART2 network, i.e., properly rescaled and with noise suppression. Each unit receives two connections: the new input pattern from and a feedback from itself with its own previously stored value. This feedback has a parameter which controls the weight of the previous value of each unit. At the end of the presentation of a sequence of stimuli, it is expected that the context formed and stored in units approximates an average representation of similar stimuli present in the sequence. The units serve as an interface between and layers, and they have a role equivalent to the units of the original ART2 model.

Figure 3: Architecture of ART2 with Context, composed of two layers: is the input layer, is the output layer; and the context units: , with a feedback loop, responsible for creating the context representation, and the units serving as an interface between and layers.

Algorithm 3 presents all the steps needed for training the ART2 with Context included in the proposed model, for which the parameters are:

: number of nodes in the layer. It is equal to the number of semantic features. and : fixed weights in . We set and = 10. : fixed weights used by the reset condition in [0,1] interval. : activation of the winner unit in within the [0,1] interval. The value 0.9 was used. : parameter to avoid division by zero when the norm of a vector is zero. The value 0.0001 was used. : parameter of noise suppression, typically . The input vector components with values lower than will have their values taken to zero. : learning rate. Used value: 0.001. : surveillance parameter. To determine the number of groups to be formed. Values in [0.7,1] interval produce effective control over the number of groups formed. Used value: 1. : maximum number of epochs, we set it to 1. : maximum number of iterations. Used value: 1 : weight of the context in the interval [0,1]. Used value: 0.9. : influence rate of the contextual information over the reset mechanism, inside [0,1] interval. Used value: 0. : effect equivalent to used for the context units. Used value: 0.9. : context learning rate: Used value: 0.8. The variables are the -th elements of the vectors P, Q, R, S, U, W, X, Y, UC, and PC. J: the node in with higher activation. reset: indicates if the winner node in layer cannot learn the presented pattern. T: the top-down matrix of weights. B: the bottom-up matrix of weights. The function is defined as:

1 Initialize: , , , , , , , , , , , , , , ;
2 for  do
3        for  do
4               Initialize activations in layer:
5               ; ;
6               Update activations in Layer:
7               ;
8               ;
9               Propagate values to :
10               ;
11               Rescale the context units:
12               ;
13               Propagate the context values to :
14               ;
15               Update activations in layer:
16               ;
17               ;
18               while reset do
19                      Find the unit in with highest activation :
20                      ;
21                      if  then
22                             ;
23                             ;
25                      end if
26                     if reset then
27                             ;
28                             ;
29                             if  then
30                                    ;
32                            else
33                                    ;
34                                    ;
36                             end if
38                     else
39                             for  do
40                                    Update the weights of the winner unit :
41                                    ;
42                                    ;
43                                    ;
44                                    ;
45                                    Rescale the updated vectors:
46                                    ; ;
47                                    ; ;
48                                    Update activations in Layer:
49                                    ; ; ;
50                                    ; ;
51                                    ;
53                             end for
55                      end if
57               end while
59        end for
61 end for
Algorithm 3 Training ART2 with Context.

The training algorithm (Algorithm 3) consists of the following: After the variable initializations (line 1) a loop is executed for each training epoch. For each input pattern the activations of the units in layers U, W, P, Q, X, and V are initialized (line 5) and updated to reflect the effects of the input pattern (lines 7 and 8). Then the values computed are propagated to the context units UC (line 10) and the new values are rescaled (line 12) and copied to PC units (line 14). Next, values stored on P and PC units are propagated to the layer, where a competition occurs among the groups. Each group responds with an activation (lines 16-17) and the loop started in line 19 repeats until a winner group is defined and updated. First, the group with higher activation is found (line 20). If this group was disabled (activation = -1), all groups were deactivated because of a reset sign, and a new group is created (lines 21-24). Otherwise, it is verified if the winner group is similar enough to the presented pattern (using the parameter). If not, the group is disabled and a reset occurs so that another group can be found (lines 25-33). If the winner group is considered similar enough to the input pattern, it is approximated to it (lines 36-40), the vectors updated are normalized (lines 42 and 43) and finally, the activations in layer are updated (lines 45-47).

The pattern recognition is done in a way very similar to the network training. The main difference is that there is no storage in the layer. Moreover, an adaptation of the parameter is done: it starts with an initial value next to 1 and is slightly reduced until a group is found in the layer.

The next section describes the simulations carried out with the proposed module.

7 Simulations

The simulations aimed to reproduce the CSWL experiments available in the literature, following the methodology introduced by Yu and Smith (2007) and further extended by others. Subsections from Section 7.3 to Section 7.7 describe the CSWL experiments considered in this work and the respective simulations carried out with the proposed model. Section 7.2 describes the dataset used in the simulations. Notice that we employ the term “Experiment” to refer to the actual experiments carried out by Yu and Smith (2007), Yurovsky et al. (2013), and Trueswell et al. (2013) with humans. The term “Simulation” refers to the simulations carried out with the proposed model, aiming to replicate each particular experiment. From Section 7.3 to Section 7.7 we describe very briefly the considered CSWL experiments and their respective simulations. The details of the mentioned experiments are described in the A. Such subsections describing the simulations are divided into the following parts: first (i) a detailed description of the experiment conducted by the authors is presented, then (ii) the procedures used to simulate the experiments are described, and finally, (iii) the results produced by the simulations are presented in comparison with the results obtained in the original experiments.

Furthermore, in Section 7.8, the model with the adjusted set of parameters is evaluated in the last simulation, which aims to cover a part of the model that was not evaluated in the previous experiments: the Context Module and its role in providing the correct meaning for words with different meanings in different contexts. Since no work with this objective was found in the literature, an experimental design is firstly proposed to evaluate this ability in individuals, then, the results produced by the simulations of this experiment are presented. Figure 4 illustrates the workflow for simulating the experiments. We start this subsections with the description of the parametric setup.

Figure 4: Workflow of simulations: the process for generating the visual and auditory representations is illustrated in the Dataset preparation box. This process was executed only once and the same representations were used in all experiments. In the Experiment simulation box, is illustrated the process for simulating an experiment. This process was repeated for each experiment with different inputs, selected from the representations dataset according to the experiment design.

7.1 Parameter Adjustment

The parameters of each module of the proposed model were adjusted only once to minimize the differences between the results of all experiments and their respective simulations. The exploration of possible parameter values was made by employing a Latin Hypercube Sampling (LHS) (Saltelli et al., 2009) and the best parameter set is presented in Table 7.1.

The dataset used in the experiments and the way that each stimulus was presented to the model is detailed in the next section.

Table 1: Best parameter values obtained with the LHS adjustment. Parameter Value Visual Representation Module – LARFDSSOM Activation threshold () 0.985 Lowest cluster percentage () 0.15% Relevance rate () 0.10 Max competitions () 0.021 Winner learning rate () Neighbors learning rate () Relevance smoothness () 0.007581760 Connection threshold () 0.50 Auditory Representation Module – LARFDSSOM Activation threshold () 0.935 Lowest cluster percentage () 0.001% Relevance rate () 0.10 Max competition () 2 Winner learning rate () 0.10 Neighbors learning rate () Relevance smoothness () 0.00394 Connection threshold () 0.50 Context Module – ART2 With Context Fixed weight in F1 () 10 Fixed weight in F1 () 10 Reset weight condition () 0.10 Winning Unit Activity in F2 () 0.9 A parameter to avoid division by zero () 0.0001 Noise suppression parameter () 0.0739221 Learning rate () 0.8 Surveillance parameter () 0.999 Number of Epochs () 1 Number of Iterations () 1 Backpropagation context parameter () 0.90 Context influences above reset mechanism () 0.0002 Winner Unit Activity in F2 for the context () 0.9 Context learning rate () 0.80 Association Module – LARFDSSOM Activity threshold () 0.999 Lowest cluster percentage () 17.5211% Relevance rate () 0.870879 Maximum competition () 10000 Winner learning rate () 0.465091 Neighbors learning rate () 0.0134102 Relevance smoothness () 1.31357 Connection threshold () 0.986745

7.2 The Real World Object Image and Label Dataset

In order to simulate the stimuli provided to the participants in the experiments of Yu and Smith (2007), we used 18 words of objects commonly found at home (armoire, bed, bowl, canister, chair, clock, computer, cooker, cup, desk, door, dresser, fork, knife, refrigerator, sofa, spoon, and telephone). In addition, 18 object images associated with these names were obtained from Google Image Search ®, using the respective word as the search term.

Figure 1 displays a sample of the object images collected. The complete dataset is available online (see footnote 1 on page 1). This dataset was used in all simulations presented in the following subsections.

7.3 Experiment 1: Word Learning Under Uncertainty

Yu and Smith (2007) evaluated the CSWL abilities of 38 undergraduate students dealing with slides containing pictures of unusual objects paired with pseudowords presented in auditory form. There were 3 groups of 18 pairs trained under different conditions concerning the number of labels and pictures that are presented (2 and 2, 3 and 3, or 4 and 4). Each subject was presented to 1word and 4 pictures and asked to choose the picture labeled by that word. The details of the experiments are in A.1.

Procedures for Simulation 1

In the cross-situational experiments, the auditory stimuli, the sounds of the words formed a unique stream, thus, in each trial, a single auditory representation was created by chaining the representation of the sequence of phonemes of the words presented.

For example, assuming that the following four words are used in a test: bed, chair, bowl and fork, the representation of the respective sequence of phonemes: /b e d t e b f k/ formed the auditory input as described in Section 4.1.1. On the other hand, in Yu and Smith (2007) individuals could pay attention to each image at a time, observing them individually. Moreover, since there is not a strong correlation between the images, they make more sense when individually observed. Therefore, in our simulations, each image was represented individually, as described in Section 4.1.2. Then, the input stimuli of a trial were constructed by paring the auditory stimulus with each one of the visual stimuli.

For instance, in each trial of the 2x2 condition, two inputs were given for the model: one built by paring the auditory representation with the first image and another one built by pairing it with the second image.

After the learning trials, analogously as in Yu and Smith (2007) the testing consisted of presenting the sound of one word and four images. One of them is the correct association and the others are randomly chosen foils. The input stimuli are built similarly as in the training, with the only difference that now there is only one word, which its representation is paired with the representation of each one of the four images. To identify the association made by the model, each input pair is presented in a random sequence and the level of activity of the winner node in the association layer is registered. Then, the input pair that produced the highest activation is considered as the strongest association made by the model.

The model was trained and tested 38 times, initialized with a different random seed, representing 38 different individuals.

Results of Experiment 1 and Simulation 1

Figure 5 shows that in the results obtained by Yu and Smith (2007) in all conditions the individual have correctly guessed significantly more pairs (, in condition 2x2, , in 3x3 and in 4x4) then they would have by chance (1/4 = 0.25). Even in the most difficult condition (4x4), with 16 possible associations by trial, the individuals guessed on average 10 of the 18 pairs (0.55). The authors argue that humans are good at guessing the correct word-referent associations in situations of ambiguity and the results clearly show that the increase in the level of ambiguity inside the trials negatively affects the learning. This is confirmed by comparing the averages in conditions 2x2 and 4x4 in a -test with a significance level of 1%.

Figure 5: Experiments of Yu and Smith (2007) in comparison with the results of our simulations. The strong horizontal dashed line indicates the probability of guessing by chance, while the error bars indicate the standard deviation.

Although there are visible differences, analogous conclusions can be drawn from the results of our simulations. The model could guess the correct associations better than chance and displayed a similar pattern of decay of learning as a function of the ambiguity inside trials (, in condition 2x2, , in 3x3, and , in 4x4). The most significant difference is observed in condition 2x2, in which the model learns around 78% of the pairs on average, while the individuals were able to learn about 89%. Yet, the same -test confirms that the learning rates in conditions 2x2 and 4x4 are statistically different.

7.4 Experiment 2: Word Learning with More Than One Referent

Yurovsky et al. (2013) experiments aimed to assess the behavior of individuals for words with two correct associations. A total of 48 students were tested in 18 word-referent pairs under 3 distinct conditions: each set of 6 words were associated with 1, 2, or none referents. In each of the 27 learning trials, the subject had to deal with 3 different word combinations. Then, each test consists of providing the subjects with 4 word-referent pairs to rank the most likely associations. The details of the experiments are in A.2.

Procedures for Simulation 2

In order to simulate the stimuli given to the participants of this experiment, the same 18 objects of the previous experiment were used. The six single words were: bed, chair, bowl, fork, door, and canister and presented together with their respective images. The six double words were: clock, computer, desk, refrigerator, sofa, and cooker, with their six respective images used as their first referents. The second referents of double words were images of different objects: respectively goblet, mat, mixer, crib, blender, and shaker. Finally, the six noise words were: spoon, telephone, knife, armoire, cup, and dresser.

The paired input stimuli were built exactly as in the 4x4 condition of Experiment 1, and, in each testing trial, each one of the four testing words was selected (in a random order) and paired with each one of the four referents. The stimulus built for each pair was presented as input for the model and the activation of the winner node in the association layer was computed. Then, the activation levels were used to rank the pairs for computing the single, double, and either scores.

This training and testing procedure was repeated 48 times with random initializations, representing the 48 participants.

Results of Experiment 2 and Simulation 2

The results obtained by Yurovsky et al. (2013) (Figure 6) show that participants displayed a better-than-chance knowledge of the referents of single words (), of one of the referents of double words (), and even for both referents of double words (), difference statistically verified by a -test with a significance level of 1%.

Figure 6: Comparison of the results obtained by Yurovsky et al. (2013) with the result of the simulation with the proposed model in Experiment 2. Dashed lines indicate the chance levels of performance. The error bars indicate the Standard Error (SE), not Standard Deviation (SD), where .

Yurovsky et al. (2013) found that participants were significantly less likely to learn both referents of a double word than one referent of single words (t(47) = 3.68, p ¡ .001). This suggests that two mappings composed of a single word and two different referents do not act like two independent mappings (two words and two different referents). This suggests the occurrence of some kind of competition for the mappings of a word.

The same conclusions can be drawn from our simulations for single words (), one referent () and both referents () of double words. The model was also less likely to learn both referents of double words than one referent of single words (t(47) = 3.5267, p ¡ .001).

Yurovsky et al. (2013) also pointed out that, while this experiment allows concluding there is some kind of competition for the mappings, it is not clear which type of competition, local (within trials) or global (across trials), since both referents were shown in each trial. The next experiment addresses this issue.

7.5 Experiment 3: Local vs Global Competition

Yurovsky et al. (2013) run experiments with 48 subjects who were trained with a single correct referent of double words. The individuals were asked to to the same test of the previous experimental. The details of the experiments are in A.3.

Procedures for Simulation 3

Analogously as in Simulation 2, the six single words (bed, chair, bowl, fork, door, canister) and double words (clock, computer, desk, refrigerator, sofa, and cooker) were the same, with their respective images. And the images of the same different objects (goblet, mat, mixer, crib, blender, and shaker) were used as the second meaning for double words. Noise words were not used and the testing procedure was kept the same of Simulation 2.

Results of Experiment 3 and Simulation 3

The results of this experiment (Figure 7), showed that, although participants knew all types of mappings above chance (single words: ; double words one referent: ; and both referents: ), they again showed better knowledge of single word referents than of both word referents (t(47) = 3.81, ¡ 0.001). This result suggests competition across trials.

Figure 7: Comparison of results obtained by Yurovsky et al. (2013) with the results obtained with the model in Experiment 3. Dashed lines indicate the chance levels of performance. The error bars indicate the Standard Error.

The simulations presented an analogous behavior with above-chance accuracy (single words: ; double words one referent: ; and both referents: ). The highest difference was observed for the recognition rate of both referents of double words, which could not be considered statistically equivalent to the results displayed by humans. In spite of that, the simulated participants also showed a better knowledge of single word referents than of both word referents (t(47) = 4.5613, ¡ 0.0001), which also points to global competition.

7.6 Experiment 4: Online vs Bach Learning

Yurovsky et al. (2013) designed an experiment similar to Experiment 3 to assess the degree of globality of the competition process, i.e., they evaluated the influence of the temporal order of the individual trials upon accuracy. The details of the experiments are in A.4.

Procedures for Simulation 4

For the simulations, the same stimuli of the previous experiment were used for training and test. The only change was in the order of presentation of double words referents along the trials, which one of the referents of each double word was randomly chosen to be presented earlier, while the second referent was presented only after all presentations of the first referent.

Results of Experiment 4 and Simulation 4

Figure 8 shows the obtained results. Participants displayed similar results for single words () and for one referent of double words (). However, they learned both referents of double words () as well as the referent of single words. Therefore, in contrast with previous experiments, the results did not show evidence of competition.

A possible explanation given by Yurovsky et al. (2013), is that while global competition protects old mappings from noisy information, local competition leverage prior mappings knowledge to speed up the acquisition of new mappings.

Figure 8: Comparison of results obtained by Yurovsky et al. (2013) with the results obtained with the model in Experiment 4 for single, either and both words learning accuracy. Dashed lines indicate the chance levels of performance. The error bars indicate the Standard Error.

The simulations have shown similar results for single words () and for one referent of double words (). However, differently from what participants have shown, the model did perform worse for both referents of double words (). This was actually an expected result, since the model, in its present form, does not take any advantage of known mappings to speed up the acquisition of new mappings.

Figure 9: Comparison of results obtained by Yurovsky et al. (2013) with the results obtained with the model in Experiment 4 for the frequency that Early and Late referents were ranked first. The error bars indicate the Standard Error and the dashed lines indicate the chance levels of performance in the experiment with humans (0.2) and in simulation (0.142).

Regarding the ordering factor, the results presented in Figure 9 show that when participants picked up both correct referents for double words, they were slightly more likely () to rank the early referent first () than the late referent (). The model presented a similar pattern, though more strongly (early first: ; late first: , ).

In the next section, we evaluate the capability of the model of reproducing the results of the experiments designed by Trueswell et al. (2013) to verify other learning aspects.

7.7 Experiment 5: Statistical Association vs Propose-but-Verify

Trueswell et al. (2013) proposed the hypothesis “Propose-but-Verify” in which learning results from a one-trial procedure which links word-referent pairs that can be unlinked after opposite observations. To prove it, they designed experiments to verify if participants retain one or more association mappings for each word. They used 50 students to hear sentences an choose object referred by it. The individuals were supposed to learn association between phrases and images. The details of the experiments are in Section A.5.

Procedures for Simulation 5

For simulating the stimuli given to the participants in this experiment, the following 12 randomly chosen words were used among the 18 of Experiment 3: bed, chair, bowl, fork, door, canister, clock, computer, desk, refrigerator, sofa, and cooker. Also, the same 12 respective images were used as referents for these words.

In each trial, the model was trained similarly as in previous experiments. The input stimuli were produced exactly as in previous experiments, combining the representation of the word with the representation of each referent, though now they were presented in the 1x5 condition, i.e., five combined input stimuli per trial.

Differing from the previous simulations, this time, in order to match the procedure of the experiment, the level of activation of the winner node in the association layer was registered after each input stimulus. The referent that has resulted in the highest activation is considered as the choice of the model for the best association in the trial. The simulation was repeated 50 times with different random seeds to simulate the 50 participants.

Results of Experiment 5 and Simulation 5

Figure 10 shows the percentage of correct answers along the five learning cycles. As expected for a 1x5 condition, the average results suggest that the learning was more difficult than in previous experiments, though still viable. With the analysis of the growth of the learning curve, Trueswell et al. (2013) have shown that there was a significant increase in the accuracy throughout the learning cycles. The simulations have presented an analogous behavior. A -test with 1% of significance level confirms that both the participants and the model present an accuracy above chance at the last learning cycle.

Figure 10: Comparison of the accuracy growth through the learning cycles obtained by Trueswell et al. (2013) with the results obtained with the model in Experiment 5. Dashed lines indicate the chance level of performance. The error bars show a confidence interval of 95%.

Since the previous result shows that learning still occurs in the 1x5 condition, the next step was to evaluate the hypothesis raised about the type of learning (Statistical Association vs Propose-but-Verify). As can be seen in Figure 11, the participants have identified the correct referent with an above-chance accuracy () only after assigning the correct referent in the previous cycle. When the participants missed the correct referent in the previous cycle, they seem to choose a random referent ( 0.20), presenting an accuracy near to 1 in 5 (randomly guessing). Therefore, even when a word has co-occurred before with the correct referent, participants show no sign of remembering it when they missed it in a previous cycle.

Figure 11: Comparison of the results shown by Trueswell et al. (2013) with the results obtained with the model in Experiment 5. The “Wrong” label indicates the accuracy displayed when the wrong referent was chosen in the previous learning cycle and the “Right” label indicates the accuracy displayed when the correct referent was previously chosen. Dashed lines indicate the chance level of performance. The error bars show a confidence interval of 95%

With this, Trueswell et al. (2013) conclude that participants did not retain multiple associations through the learning cycles. However, the proposed model has presented an analogous behavior, displaying an above chance accuracy only in the “Right” condition (), while in the “Wrong” condition the results approach a random guess ().

We know, though, that the model can generate multiple hypotheses in each trial (up to five, in this case). This way, we are left with two possibilities: (a) the model did not generate multiple associations in each trial, or (b) the model did generate multiple associations, however, they were not strong enough to affect the accuracy significantly in the following cycle. In the model, the number of new associations generated in each trial is represented by the number of nodes created in the Association Module. Therefore, observing this value we can elucidate what has actually happened.

Figure 12: Number of nodes created in the Association Module after each trial (1-5), through the learning cycles in Experiment 5. The error bars show the standard deviation.

Figure 12 shows the evolution of the number of nodes created in the Association Module in each trial, through the learning cycles. In the first cycle, the model creates about 3 nodes per trial on average, and this number decays through the learning cycles, reaching a value below 1 in the last cycle. This is an expected behavior since, after some learning, the model has already created the associations required to represent most mappings. This indicates that hypothesis (b) is the correct one. The model generates multiple associations, though not as many as possible, however, they are not strong enough to affect the accuracy in the next learning cycle.

We argue that two factors may explain the results observed by Trueswell et al. (2013) without disregarding the hypothesis of multiple associations. One is that global competition may insert noise in the associations formed, degrading weak associations (seen only once). The other factor is that in the experimental design of Trueswell et al. (2013), the number of incorrect associations is computed from the second to the fifth cycle, when the number of associations created by trial may have decreased, as our simulations suggest. Therefore, in our model, a chance level accuracy for words that were incorrectly associated in the previous cycle is not a result of retaining a single association hypothesis.

7.8 Experiment 6 Design: The Role of The Context in Word Disambiguation

In all previous simulations, the context module was active and functional. The results obtained in those simulations show that it does not interfere with the learning in the evaluated conditions. However, the role of the context module itself was not directly evaluated. When a word has different meanings, they are usually employed in different situations (contexts). Therefore, our hypothesis is that learning the context in which words are used can help the model to learn their different meanings.

In order to evaluate this, we designed the following experiment, based on the 1x5 condition proposed by Trueswell et al. (2013): The stimuli are composed of two lists of six words (A and B), sharing exactly one word, the ambiguous label (AL), that is associated with a different referent in each list, i.e., in list A the label AL is associated with the referent RA while in list B it is associated with a different referent, RB.

The training should be carried out in six cycles of 14 trials each: the odd cycles are done with words from list A (including AL), while the even cycles are done with words from list B (including AL). Therefore, the cycles are intercalated in the form: (A, B, A, B, A, B). This training aims to induce the creation of two different contexts associated with the words in each list. Since the context changes slowly, the associations created with words in the same list tend to be similar, since they are consecutively presented.

The testing procedure aims to verify if the recovered referent for the AL matches the context induced by the stimuli previously given as input, and how many stimuli are necessary to induce the context. Thus, in order to induce the context, six trials in the condition 1x4 are done using words from one of the lists (excluding AL), before testing the association of AL, also in condition 1x4, with both referents (RA and RB) and two other randomly chosen referents (lures), one from list A and the other from list B. The context inducting conditions are: 3a+3b, 3b+3a, 4a+2b, 4b+2a, 5a+1b and 5b+1a. The condition 3a+3b, for instance, indicates that three trials with labels and referents from list A were presented, followed by three trials with labels and referents from list B. In this condition, it is expected that the context of list B (late list) is induced, thus, the correct association to be retrieved is with referent RB.

In each testing trial, three results are possible: (i) the referent of the list presented later is chosen (expected association); (ii) the referent from the list presented earlier is chosen; or (iii) one of the lures is chosen. The prior for the situation (i) is 0.25 (one in four referents), while the prior for selecting one of both associations, (i) or (ii), is 0.5 (two in four).

Procedures for Simulation 6

The words chosen to simulate the Experiment 6 were: armoire, snake, dog, cat, cheese, trap and mouse for the first list, and speaker, printer, computer, notebook, monitor, keyboard, and mouse for the second list. Note that the ambiguous label, AL, is the word mouse. The referents for both lists consisted again of images download from Google Image Search ®, using the respective word as the search term. The two referents for the AL consisted of an image of the animal (RA), and an image of the computer device (RB).

Training and testing were done according to the experimental design described above, and the input stimuli combining the word representation with the representation of each referent were produced exactly as in previous experiments. The simulation was repeated 48 times with different random seeds to simulate 48 participants.

Results of Simulation 6

The obtained results are shown in Figure 13. In conditions 3+3 (3a+3b and 3b+3a) and 4+2 (4a+2b and 4b+2a) the context was effective to induce the recovery of the correct referent with a high accuracy (respectively, and ), with an expected decay in accuracy from conditions 3+3 to 4+2. In conditions 5+1 (5a+1b and 5b+1a), however, the accuracy falls to , which means that the contextual information is not enough to induce the correct association. The model seems to have difficulty choosing between the two possible referents RA and RB, though it can easily discard the lures. A -test with 1% of significance level confirms that these results are different between them and that they are above chance.

Figure 13: Accuracy of the model in choosing the referent induced by the context in each condition: In 3+3, the last three words induce the desired context; in 4+2, the last two words induce the desired context; and in 5+1, only the last word induces the desired context.

These results emphasize the role of the context, showing that it can help to recover the correct meaning for ambiguous words.

8 Discussion

The experimental paradigm of the cross-situational word learning has shown to be a very useful tool for evaluating the hypothesis about the mechanisms that allow us to learn word-referent associations. The model described in this article has been proposed considering pieces of evidence accumulated in the studies of psycholinguistics and neurolinguistics, organized in a modular architecture that allows us to better understand and communicate about the functions required for word-referent associations.

The results obtained in Experiment 1 are similar in terms of accuracy to the results of the models evaluated by Yu and Smith (2012). However, this article also considered other conditions not evaluated in previous works and introduces advances in terms of model architecture in comparison with previous models. One improvement is the use of a Time-Varying Self-Organizing Map (LARFDSSOM) as the point of connection between the visual and auditory layers. In previous models, this was done via associative connections trained by Hebbian learning. In the proposed model, LARFDSSOM learns the correlations between the different input dimensions from the co-variations observed in the input data by the means of its relevance learning mechanism. This is similar to Hebbian learning, however, it has other useful features such as the topological representation of input data and the activation levels produced by the nodes, which allowed us to model the CSWL experiments. Moreover, the same kind of map is used in different levels of the architecture, which seems to be more plausible.

The proposed architecture is far from being a complete model of cross-situational word learning. It is, however, a step towards a model that allows us to simulate and evaluate hypotheses about the mechanisms behind this characteristic of human nature. The fact that it can deal with real-world inputs, images for referents and text or sound for words, gives it an enormous flexibility for simulating more accurately several types of experiments carried out with human beings. We notice that in several conditions its association accuracy is a little below the accuracy of humans. For instance, in the 2x2 condition of Exp.1, humans can reach 89% of accuracy while the model achieved only 78%. This could be due to the fact that handcrafted features extractors were used to represent images and words in the first layer. This can be further improved by employing modern representation techniques such as word embeddings (Mikolov et al., 2013) and convolutional neural networks, already shown to work well in combination with LARFDSSOM (Medeiros et al., 2019) and to achieve human level performance in certain image classification scenarios (He et al., 2015).

In spite of that, the simulations show that the model is suitable for replicating most of the experiments considered in this work, allowing us to draw similar conclusions. However, in Experiment 4, it seems that it is much easier for humans to learn a second referent of a double word after having learned the first referent than learning both simultaneously, while this was not the case for the proposed model. We evaluate that this happens because the model cannot take advantage of other known associations within a trial to reduce ambiguity. For instance, in a 4x4 condition, if a human participant knows three of the four associations, he might easily guess the fourth association by disregarding words and referents in the other three. The model, otherwise, only strengthens the current stronger association for each pair. Therefore, in future work, the model should be modified to take this into account.

In spite of that, the main conclusions obtained by Yu and Smith (2007); Yurovsky et al. (2013), and Trueswell et al. (2013) with their experiments, including those in Experiment 4, could also be drawn from the simulation results, summarized below:


The model was able to simulate the remarkable ability of participants in learning associations between labels and referents in different levels of ambiguity, including the fact that it decays with the increase of ambiguity;


The model was able to replicate the greater difficulty that participants present to learn two referents of the same label than to learn only one referent;


Global competition is also the most relevant type of interference that degrades the learning of the model as seems to be the case with the individuals;


The model employs online learning instead of batch learning, which matches the type of learning identified by Yurovsky et al. (2013) in their experiments. However, the model does not benefit from the knowledge of previously known mappings when forming new associations, which prevented it from reproducing part of the results observed by the authors.


Though more difficult, learning still occurs in the high ambiguity condition 1x5, and the model could replicate this fact accurately in each one of the five learning cycles.

It is also worth noting that in Experiment 5, the simulations allowed us to verify how many associations were created in each trial, which has shown that it is possible for a method that makes multiple associations to achieve the results observed by Trueswell et al. (2013), in opposition to the authors’ assumption. This is an example of how such kind of modeling can be useful in the evaluation of new hypotheses.

Another improvement of the current architecture, in relation to previous models, was the introduction of the context module. With this, we could evaluate how context can be used to retrieve the correct meaning of ambiguous words. Therefore, the results obtained in the simulation of Experiment 6 can and should be evaluated in experiments with human beings to verify how well the model predicts the effects of the context in the disambiguation of words meanings.

Finally, the proposed model may be applied in the proposition and test of new hypotheses and experimental paradigms, contributing to understanding the mechanisms involved in word learning, and can be used as a component for developing agents that learn natural language. Still, it is important to emphasize that although this model was developed taking into consideration the current knowledge provided by neuroscience and cognitive psychology, it is a high level computational model and may not reflect the real learning and representation mechanisms that occur in the brain.


The authors would like to thank CNPq (Conselho Nacional de Desenvolvimento Científico e Tecnológico) and FACEPE (Fundação de Amparo à Ciência e Tecnologia do Estado de Pernambuco) for supporting project #APQ-0880-1.03/14.


Appendix A More Information on CSWL Experiments

a.1 Exp.1: Word Learning Under Uncertainty

Yu and Smith (2007) evaluated the CSWL abilities of 38 undergraduate students. The stimuli provided consisted of slides containing 2, 3 or 4 pictures of unusual objects paired respectively with 2, 3 or 4 pseudowords presented in auditory form. These artificial words were generated by a computer program using standard phonemes in the English language, the native language of the participants. In this case, there were 54 label-referent pairs formed by single and unique objects randomly chosen and divided into three groups of 18 pairs, which were used in three different training conditions.

The distinct training conditions differ only in the number of labels and referents simultaneously presented to the test subjects. In the 2x2 condition, two labels and two pictures were presented in each trial; in the 3x3 condition, three labels and three pictures were presented in each trial; and, in the 4x4 conditions, four labels and four pictures were presented in each trial. During the trials, there was no indication of which label goes with each picture. However, in the underlying label-referent mappings it is guaranteed that an individual label was present in a training trial, if and only if the referent was also present. Figure 1 illustrates a 4x4 condition.

In the test procedures, the participants were told that multiple words and pictures would co-occur in each trial and that their task was to figure out across trials which word went with which picture. They were not told that there was one referent per word. After training in each condition, subjects received a four-alternative forced-choice test of learning, in which, they were presented with 1 word and 4 pictures and asked to indicate the picture named by that word. The target picture and the 3 foils were all drawn from a set of 18 training pictures.

a.2 Exp.2: Word Learning with More Than One Referent

Yurovsky et al. (2013) performed a series of experiments to evaluate the behavior of individuals when there are two correct associations. In the first experiment, 48 grad students were evaluated, also with 18 word-referent pairs. However, the pairs were split in different conditions: six words were associated with a single referent (single words), six words were associated with two referents (double words), and the last six words had no associated referents (noise words).

The single words play the same role as those in the previous experiment, always co-occurring with their referents in each trial. The double words, however, co-occur with both referents in each trial. Since both single and double words co-occur six times with their referents, the total number of occurrences is the same for both types of words. The noise words occur with the same frequency for all referents, thus, they are not consistently mapped to any referent. They serve only for producing an equal number of words in all the trials.

Each trial consists of presenting the stimuli in the 4x4 condition. From a total of 27 trials (Figure 14), in two of them the stimuli were composed of four single words; in 14 trials the stimuli were composed of two single words, one double word, and one noise word; and in 11 trials the stimuli were composed of two double words, and two noise words. This way, although in all trials there were always four words and four referents, the mapping structure varied considerably across the trials, and in only two of them it consisted exclusively of one-to-one mappings.

Figure 14: Structure of Experiment 2. The lowercase letters represent words and the uppercase letters represent referents. Single words are in bold (ex.: b-B and c-C), double words are in white (ex.: a-A1 and a-A2, f-F1 and f-F2), and noise words are in gray (ex.: d and g).

After the learning trials, the learning rates of each individual were evaluated similarly as in Yu and Smith (2007). Every single word was presented with its referent and three other randomly chosen referents and each double word was presented with both their referents and two other randomly chosen referents.

The individuals were asked to rank the four objects from the most to the least likely meaning of the word. To compute the scores of single words, one correct guess is computed when the correct referent was ranked first. For double words, two types of scores were computed: a Double score is computed when the participant ranks both correct referents (in either order) in the first and second positions, and a Either score is computed when the participant ranks one of the correct referents in the first position and an incorrect referent in the second position.

a.3 Exp.3: Local vs Global Competition

To evaluate if global or local competition has occurred, Yurovsky et al. (2013), in this experiment, another 48 participants were exposed to only one correct referent of double words in each trial, while the testing procedure was the same of the previous experiment. If only local competition during training was occurring, then the participants of this trial should be able to learn both referents of double words as well as they learn the referent of single words. Otherwise, global competition was occurring. The stimuli were presented as illustrated in Figure 15. Noise words were not necessary for this experiment since two single words and two double words were presented in each trial with their respective referents (4x4 condition).

Figure 15: Structure of Experiment 3. Differently from Experiment 2, in this experiment, only one correct referent of double words is presented in each trial. The co-occurrence frequency of correct associations was the same of Experiment 2.

a.4 Exp.4: Online vs Bach Learning

Yurovsky et al. (2013) conjectured that if the competition is primarily global, and occurs only after all training information has been accumulated (batch learning), there should be no effect of the temporal order of the individual trials. However, if global competition emerges trial-by-trial (online learning), and does not interact with other local mappings within a trial, then it is expected that a decrement of the accuracy will be observed for the second referents of double words presented later in relation to the knowledge of referents presented earlier.

Yurovsky et al. (2013) designed an experiment to evaluate this, with the organization shown in Figure 16 for a new group of participants. Notice that this experiment is similar to Experiment 3. However, one of the referents of each double word is randomly chosen to be presented earlier, while the second referent is presented only after all co-occurrences with the first referent have been carried out. Notice also that both referents have the exact same frequency of co-occurrence with their respective double word.

Figure 16: Structure of Experiment 4. Differently from Experiment 3, in this experiment, one of the referents of double words is presented first (A1), while the other is presented in later trials (A2). The co-occurrence frequency of correct associations was the same of Experiments 2 and 3.

a.5 Exp.5: Statistical Association vs Propose-but-Verify

Although the results of previous experiments suggest that learning under such conditions derives from some kind of statistical-associative learning mechanism, as the one the proposed model employs, Trueswell et al. (2013) suggest the hypothesis that learning is instead the product of a one-trial procedure in which a single hypothesized word-referent pairing is made in one shot and retained across learning instances, being abandoned only if a subsequent observation fails to confirm the pairing. The authors called this hypothesis “Propose-but-Verify”.

In order to test this, Trueswell et al. (2013) designed experiments to explicitly verify if participants retain a set of association mappings for each word or if they keep a single conjecture about the association.

In each of the trials designed by Trueswell et al. (2013), five images were used as referents, while the auditory stimuli consisted of phrases such as “Oh! look, a …!” with one label (condition 1x5). In total, 12 artificial words were used as labels and 12 images of objects were used as referents. In such a scenario, there is a high degree of uncertainty about the correct referent.

The trials were divided into five learning cycles. In each cycle, each word was presented once in a random order. The other four cycles are repetitions of the first cycle in the same order.

Fifty undergrad students participated in the tests. They were instructed that, after hearing the phrase, they should click on the object referred by the phrase.

Since the participants were tested in every trial, this allowed the authors to register the evolution of the learning rates of the individuals after each learning cycle. The rationale is that if participants store only one association, and the referent is not the correct one, then when finding the same word in a subsequent trial, they should choose randomly between the available referents and should not show any bias for the correct referent, since there should be no trace in memory of such association. A bias for the correct referent should be observed if the participants can keep track of multiple possible associations.

Appendix B Numeric Representation of Phonemes

Phoneme IPA ARPAbet Numeric Representation
father AA 1 0.5 1 -1 0 0 0 0 0 0 0 0
at æ AE 1 -0.5 -1 -1 0 0 0 0 0 0 0 0
but. sofa . AH 0.67 0 -1 -1 0 0 0 0 0 0 0 0
off AO 0.33 1 1 1 0 0 0 0 0 0 0 0
how a AW 0 0.5 0 0 0 0 0 0 0 0 0 0
my a AY 0 0 -0.5 0 0 0 0 0 0 0 0 0
red EH 0.33 -0.5 -1 -1 0 0 0 0 0 0 0 0
her. coward . ER 0.33 0 1 0 0 0 0 0 0 0 0 0
big IH -0.67 -0.5 -1 -1 0 0 0 0 0 0 0 0
bee i IY -1 -1 1 -1 0 0 0 0 0 0 0 0
boy OY 0 0 0 0 0 0 0 0 0 0 0 0
show o OW -0.33 1 1 1 0 0 0 0 0 0 0 0
say e EY -0.33 -1 1 0 0 0 0 0 0 0 0 0
should UH -0.67 0.5 -1 0 0 0 0 0 0 0 0 0
you u UW -1 1 1 1 0 0 0 0 0 0 0 0
buy b B 0 0 0 0 1 -1 1 -1 1 -1 -1 -1
chair t CH 0 0 0 0 0.27 -1 -1 0 -1 -1 -1 -1
day d D 0 0 0 0 0.45 -1 1 -1 1 -1 -1 -1
that ð DH 0 0 0 0 0.64 -1 -1 1 1 -1 -1 -1
for f F 0 0 0 0 0.82 -1 -1 1 -1 -1 -1 -1
go g G 0 0 0 0 -0.27 -1 1 -1 1 -1 -1 -1
house h HH 0 0 0 0 -1 -1 -1 1 0 -1 -1 -1
just d JH 0 0 0 0 0.45 -1 -1 0 1 -1 -1 -1
key k K 0 0 0 0 -0.27 -1 1 -1 -1 -1 -1 -1
late l L 0 0 0 0 0.45 -1 -1 -1 1 -1 -1 1
man m M 0 0 0 0 1 1 -1 -1 1 -1 -1 -1
knee n N 0 0 0 0 0.45 1 -1 -1 1 -1 -1 -1
sing NG 0 0 0 0 -0.27 1 -1 -1 1 -1 -1 -1
pay p P 0 0 0 0 1 -1 1 -1 -1 -1 -1 -1
run r. R 0 0 0 0 0.27 -1 -1 -1 1 -1 1 -1
say s S 0 0 0 0 0.45 -1 -1 1 -1 -1 -1 -1
show SH 0 0 0 0 0.27 -1 -1 1 -1 -1 -1 -1
take t T 0 0 0 0 0.45 -1 1 -1 -1 -1 -1 -1
thanks TH 0 0 0 0 0.64 -1 -1 1 -1 -1 -1 -1
very v V 0 0 0 0 0.82 -1 -1 1 1 -1 -1 -1
way w W 0 0 0 0 1 -1 -1 -1 1 1 -1 -1
yes j Y 0 0 0 0 -0.09 -1 -1 -1 1 1 -1 -1
zoo z Z 0 0 0 0 0.45 -1 -1 1 1 -1 -1 -1
measure ZH 0 0 0 0 0.27 -1 -1 1 1 -1 -1 -1
silent # 0 0 0 0 0 0 0 0 0 0 0 0
Table 2: Correspondence between IPA and ARPAbet symbols and the respective numeric representation of each phoneme.

The numeric representation of the auditory data was constructed following a procedure similar to the one described in Araujo et al. (2010). First, each word is converted to its respective phonetic representation according to the CMU Pronouncing Dictionary (Lenzo, 2007), in which, each phoneme is represented by its ARPAbet symbol (Table 2). Then, each phoneme is translated into a vector of 12 real values ranging from -1 to +1, according to its place of pronunciation in the International Phonetic Alphabet (IPA) charts for vowels (4 features) and consonants (8 features). For example, the word “ball” is converted as follow:


  1. journal: Neural Networks
  2. Available on GitHub: https://github.com/hfbassani/word-referent-association


  1. An effective image feature classiffication using an improved som. CoRR abs/1501.01723. Cited by: item 3.
  2. Episodic memory, amnesia, and the hippocampal-anterior thalamic axis.. Behavioral and Brain Sciences 22 (3), pp. 425–44. Cited by: §2.2.
  3. Natural language understanding (2nd edition). Addison-Wesley. External Links: ISBN 0805303340 Cited by: §2.
  4. Occurrence of false memories: a neural module considering context for memorization of words lists. In IEEE International Joint Conference on Neural Networks, pp. 1–8. Cited by: Appendix B, §2.1, §3, §4.1.1, §6.
  5. From grasp to language: embodied concepts and the challenge of abstraction. Journal of Physiology-Paris 102 (1-3), pp. 4 – 20. Note: Links and Interactions Between Language and Motor Systems in the Brain External Links: ISSN 0928-4257 Cited by: §1.
  6. Dimension selective self-organizing maps for clustering high dimensional data. In IEEE International Joint Conference on Neural Networks, Brisbane. Cited by: §5.
  7. Dimension selective self-organizing maps with time-varying structure for subspace and projected clustering. Neural Networks and Learning Systems, IEEE Transactions on 26 (3), pp. 458–471. Cited by: §1, §5.1.
  8. Children’s use of shape in extending novel labels to animate objects: identity versus postural change. Cognitive Development 6 (1), pp. 3 – 16. External Links: ISSN 0885-2014 Cited by: §2.1.
  9. How children learn the meanings of words. The MIT Press. Cited by: §1, §1, §2.
  10. Structure and function of visual area MT. Annu. Rev. Neurosci. 28, pp. 157–189. Cited by: §2.2.
  11. Finding meaning in a noisy world: exploring the effects of referential ambiguity and competition on 2⋅ 5-year-olds’ cross-situational word learning. Journal of child language 44 (3), pp. 650–676. Cited by: item 1, §2.1.
  12. Acquiring a single new word.. Papers and Reports on Child Language Development 15, pp. 17–29. Cited by: §2.1.
  13. Conceptual change in childhood. MIT Press. Cited by: §2.1.
  14. Art-2 - self-organization of stable category recognition codes for analog input patterns. Applied Optics 26 (23), pp. 4919–4930. Cited by: §6.
  15. A self organizing map optimization based image recognition and processing model for bridge crack inspection. Automation in Construction 73, pp. 58 – 66. External Links: ISSN 0926-5805 Cited by: item 3.
  16. Visual co-orientation and maternal speech.. In Studies in mother-infant interaction. London: Academic Press., I. H.R. S. (Ed.) (Ed.), Cited by: item 1.
  17. Child meets word: ”fast mapping” in preschool children. J Speech Hear Res 28 (3), pp. 449–454. Cited by: §2.1.
  18. When meaning is not enough: distributional and semantic cues to word categorization in child directed speech. Frontiers in psychology 8, pp. 1242. Cited by: item 3.
  19. The functional neuroanatomy of episodic memory.. Trends in Neurosciences 20 (5), pp. 213–218. External Links: PII S0166-2236(96)01013-2 Cited by: §2.2.
  20. Inferring word meanings by assuming that speakers are informative. Cognitive psychology 75, pp. 80–96. Cited by: item 2.
  21. Adaptive pattern classification and universal recoding: i. parallel development and coding of neural feature detectors. Biological Cybernetics 23 (3), pp. 121–134. External Links: ISSN 1432-0770 Cited by: §3.
  22. Adaptive pattern classification and universal recording: II. Feedback, expectation, olfaction, illusions. Biological Cybernetics 23, pp. 187–202. Cited by: §3.
  23. The perceptual magnet effect as an emergent property of neural map formation. J. Acoust. Soc. Am. 100 (2 Pt 1), pp. 1111–1121. Cited by: §3.
  24. To cognize is to categorize: cognition is categorization. In Handbook of Categorization in Cognitive Science, C. L. Henri Cohen (Ed.), pp. 20–46. External Links: ISBN 0080446124 Cited by: §1.
  25. The nonverbal content of mothers’ speech to infants. First Language 4, pp. 21–31. Cited by: item 1.
  26. Neural networks: a comprehensive foundation. Prentice Hall.. Cited by: §2.2, item 1.
  27. Delving deep into rectifiers: surpassing human-level performance on imagenet classification. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), ICCV ’15, Washington, DC, USA, pp. 1026–1034. External Links: ISBN 978-1-4673-8391-2 Cited by: §8.
  28. Word learning in children: an examination of fast mapping.. Child Development 58, pp. 1021–1034. Cited by: §2.1.
  29. Subspace multi-clustering: a review. Knowledge and Information Systems 56 (2), pp. 257–284. External Links: ISSN 0219-3116 Cited by: §1.
  30. Receptive fields, binocular interaction and functional architecture in the cat’s visual cortex. Journal of Physiology 160, pp. 106–154. Cited by: item 2.
  31. Functional architecture of macaque visual cortex. Proceedings of the Royal Society B 198, pp. 1–59. Cited by: item 2.
  32. The reorganization of somatosensory cortex following periphereal nerve damage in adult and developing mammals. Annual Review of Neurosciences 6, pp. 325–356. Cited by: item 2.
  33. A bootstrapping model of frequency and context effects in word learning. Cognitive Science 41 (3), pp. 590–622. Cited by: §1.
  34. The second year. Cambridge, MA: Harvard University Press.. Cited by: §2.1.
  35. Time-dependent self-organizing maps for speech recognition. In Artificial Neural Networks, T. Kohonen (Ed.), pp. 1591 – 1594. External Links: ISBN 978-0-444-89178-5 Cited by: item 3.
  36. Explanation, association, and the acquisition of word meaning. Lingua 92 (Supplement C), pp. 169 – 196. External Links: ISSN 0024-3841 Cited by: §2.1.
  37. Modeling individual performance in cross-situational word learning. PsyArXiv. Cited by: §1.
  38. Unsupervised object discovery via self-organisation. Pattern Recogn. Lett. 33 (16), pp. 2102–2112. Cited by: §4.1.2, §4.2, §4.2.
  39. A Computational Account of Bilingual Aphasia Rehabilitation. Biling (Camb Engl) 16 (2), pp. 325–342. Cited by: §3.
  40. Self-organized formation of topologically correct feature maps. Biological Cybernetics 43 (1), pp. 59–69 (English). Cited by: §3, §5.
  41. A generic framework for efficient subspace clustering of high-dimensional data. In ICDM, pp. 250–257. Cited by: §1.
  42. Human adults and human infants show a ”perceptual magnet effect” for the prototypes of speech categories, monkeys do not. Percept Psychophys 50 (2), pp. 93–107. Cited by: §3.
  43. The cmu pronouncing dictionary. Cited by: Appendix B, §4.1.1.
  44. Early lexical development in a self-organizing neural network. Neural Networks 17 (8–9), pp. 1345 – 1362. Note: New Developments in Self-Organizing Systems External Links: ISSN 0893-6080 Cited by: §3, §3.
  45. Dynamic self-organization and early lexical development in children. Cognitive science 31 (4), pp. 581–612. Cited by: §3, §3.
  46. Self-organizing map models of language acquisition. Frontiers in Psychology 4 (828). External Links: ISSN 1664-1078 Cited by: §3.
  47. Crosslinguistic and crosscultural aspects of language addressed to children.. In Input and interaction in language acquisition, I. C. G. &. B.J. R. (Eds.), (Ed.), Cited by: §2.1.
  48. Object recognition from local scale-invariant features. In IEEE International Conference on Computer Vision - ICCV, Vol. 2, pp. 1150–1157 vol.2. Cited by: §4.1.2.
  49. Computational models of child language learning: an introduction. J Child Lang 37 (3), pp. 477–485. Cited by: §3.
  50. Children’s use of mutual exclusivity to constrain the meaning of words. Cognitive Psychology 20, pp. 121–157. Cited by: §2.1.
  51. Evidence against a dedicated system for word learning in children. Nature 385 (6619), pp. 813–815. Cited by: item 2, §2.1.
  52. Remembering words not presented in sentences: how study context changes patterns of false memories. Memory & Cognition 37 (1), pp. 52–64. Cited by: §2.1.
  53. A neurocomputational account of taxonomic responding and fast mapping in early word learning. Psychol Rev 117 (1), pp. 1–31. Cited by: §2.1, §3.
  54. Dynamic topology and relevance learning som-based algorithm for image clustering tasks. Computer Vision and Image Understanding 179, pp. 19 – 30. External Links: ISSN 1077-3142 Cited by: §8.
  55. How words can and cannot be learned by observation. Proceedings of the National Academy of Sciences 108 (22), pp. 9014–9019. External Links: ISSN 0027-8424 Cited by: §3.
  56. Computational maps in the visual cortex. Vol. 1, Springer. Cited by: §2.2, §2.2, item 2.
  57. Dyslexic and category-specific aphasic impairments in a self-organizing feature map model of the lexicon. Brain and Language, pp. 334–366. Cited by: §3, §3.
  58. Distributed representations of words and phrases and their compositionality. In Proceedings of the 26th International Conference on Neural Information Processing Systems - Volume 2, NIPS’13, USA, pp. 3111–3119. Cited by: §8.
  59. Nouns in early lexicons: evidence, explanations and implications. J Child Lang 20 (1), pp. 61–84. Cited by: item 3.
  60. Módulos neurais para modelagem de falsas memórias. Ph.D. Thesis, Universidade Federal de São Carlos. Cited by: §3.
  61. Reconstructing speech from human auditory cortex. PLoS Biology 10 (1). Cited by: §2.2.
  62. Modeling field theory of higher cognitive functions. In Artificial Cognition Systems, A. Loula (Ed.), pp. 65–106. External Links: ISBN 1599041111 Cited by: §1.
  63. Symbol grounding or the emergence of symbols? Vocabulary growth in children and a connectionist net. Connection Science 4, pp. 293–312. Cited by: §3.
  64. Theories of early language acquisition. Trends in Cognitive Sciences 1, pp. 146–153. Cited by: §3.
  65. Labels can override perceptual categories in early infancy. Cognition 106 (2), pp. 665 – 681. External Links: ISSN 0010-0277 Cited by: §2.1.
  66. Brain mechanisms linking language and action. Nature Reviews Neuroscience 6 (7), pp. 576–582. Cited by: §2.2.
  67. Word and object. Cambridge, MA: MIT Press.. Cited by: §1, §1.
  68. Preschooler’s QUIL: Quick incidental learning of words. In In G. Conti-Ramsden & C. Snow (Eds.), Children’s language (Vol. 7), N. E. Hillsdale (Ed.), Cited by: §2.1.
  69. The episodic memory model of conceptual development: an integrative viewpoint. Cognitive Development 1, pp. 183–219. Cited by: §2.
  70. Self-organizing semantic maps. Biological Cybernetics 61 (4), pp. 241–254 (English). External Links: ISSN 0340-1200 Cited by: §3, §3.
  71. The mirror-neuron system. Annual Review of Neuroscience 27 (1), pp. 169–192. External Links: ISSN 0147-006X Cited by: §2.2.
  72. Sensitivity analysis. Wiley. External Links: ISBN 0470743824 Cited by: §7.1.
  73. Introduction to modern information retrieval. McGraw-Hill, Inc., New York, NY, USA. Cited by: §4.2.
  74. Semantic boost on episodic associations: an empirically-based computational model. Cognitive Science 31 (4), pp. 645–671. Cited by: §3, §3.
  75. Ontological categories guide young children’s inductions of word meaning: object terms and substance terms. Cognition 38 (2), pp. 179–211. Cited by: §2.1.
  76. The mind within the net: models of learning, thinking, and acting. A Bradford book, MIT Press. External Links: ISBN 9780262692366, LCCN 98010911 Cited by: §2.2.
  77. The extent to which bisonar information is represented in the bat auditory cortex. In Dynamic Aspects of Neocortical Function, W.M. Cowan (Ed.), pp. 653–695. Cited by: item 2.
  78. Propose but verify: fast mapping meets cross-situational word learning. Cognitive Psychology 66 (1), pp. 126–156. Cited by: §A.5, §A.5, §A.5, §1, §1, §1, §2.1, §3, Figure 10, Figure 11, §7.6.2, §7.7.2, §7.7.2, §7.7.2, §7.7, §7.8, §7, §8, §8.
  79. Unsupervised object discovery: a comparison. Int. J. Comput. Vision 88 (2), pp. 284–302. Cited by: §4.1.2, §4.2, §4.2.
  80. Balint’s syndrome–visual disorientation. Ugeskr. Laeg. 154 (21), pp. 1492–1494. Cited by: §2.2.
  81. Novel approach for speech recognition by using self — organized maps. In 2011 International Conference on Emerging Trends in Networks and Computer Communications (ETNCC), Vol. , pp. 215–222. External Links: ISSN Cited by: item 3.
  82. Unsupervised learning of models for recognition. In European Conference on Computer Vision - ECCV, Part I, London, UK, pp. 18–32. Cited by: §4.1.2.
  83. Rapid word learning under uncertainty via cross-situational statistics. Psychol Sci 18 (5), pp. 414–420. Cited by: §A.1, §A.2, §1, §1, §1, §1, §1, §2.1, §3, Figure 5, §7.2, §7.3.1, §7.3.1, §7.3.2, §7.3, §7, §8.
  84. Modeling cross-situational word-referent learning: prior questions. Psychol Rev 119 (1), pp. 21–39. Cited by: §3, §8.
  85. Competitive processes in cross-situational word learning. Cognitive Science 37 (5), pp. 891–921. External Links: ISSN 1551-6709 Cited by: §A.2, §A.3, §A.4, §A.4, §1, §1, §1, §2.1, Figure 6, Figure 7, Figure 8, Figure 9, §7.4.2, §7.4.2, §7.4.2, §7.4, §7.5, §7.6.2, §7.6, §7, item Exp.4:, §8.