LETTER
Communicated by Gary Cottrell
On the Emergence of Rules in Neural Networks Stephen Jos´e Hanson
[email protected]
Michiro Negishi
[email protected] Psychology Department, Rutgers University, Newark, N.J. 07102, U.S.A.
A simple associationist neural network learns to factor abstract rules (i.e., grammars) from sequences of arbitrary input symbols by inventing abstract representations that accommodate unseen symbol sets as well as unseen but similar grammars. The neural network is shown to have the ability to transfer grammatical knowledge to both new symbol vocabularies and new grammars. Analysis of the state-space shows that the network learns generalized abstract structures of the input and is not simply memorizing the input strings. These representations are context sensitive, hierarchical, and based on the state variable of the finite-state machines that the neural network has learned. Generalization to new symbol sets or grammars arises from the spatial nature of the internal representations used by the network, allowing new symbol sets to be encoded close to symbol sets that have already been learned in the hidden unit space of the network. The results are counter to the arguments that learning algorithms based on weight adaptation after each exemplar presentation (such as the long term potentiation found in the mammalian nervous system) cannot in principle extract symbolic knowledge from positive examples as prescribed by prevailing human linguistic theory and evolutionary psychology.
1 Introduction A basic puzzle in the cognitive neurosciences is how simple associationist learning at the synaptic level of the brain can be used to construct known properties of cognition that appear to require abstract reference, variable binding, and symbols. The ability of humans to parse sentences and to abstract knowledge from specific examples appears to be inconsistent with local associationist algorithms for knowledge representation (Pinker, 1994; Fodor & Pylyshyn, 1988; but see Hanson & Burr, 1990). Part of the puzzle is how neuron-like elements could, from simple signal processing properties, emulate symbol-like behavior. Symbols and symbol systems have c 2002 Massachusetts Institute of Technology Neural Computation 14, 2245–2268 (2002)
2246
Stephen Jos´e Hanson and Michiro Negishi
previously been defined by many different authors. Harnad’s version (1990) is a useful summary of many such properties: (1) a set of arbitrary physical tokens (scratches on paper, holes on a tape, events in a digital computer, etc.) that are (2) manipulated on the basis of explicit rules that are (3) likewise physical tokens and strings of tokens. The rule-governed symbol-token manipulation is based (4) purely on the shape of the symbol tokens (not their “meaning”) i.e., it is purely syntactic, and consists of (5) rulefully combining and recombining symbol tokens. There are (6) primitive atomic symbol tokens and (7) composite symbol-token strings. The entire system and all its parts—the atomic tokens, the composite tokens, the syntactic manipulations (both actual and possible) and the rules—are all (8) semantically interpretable: The syntax can be systematically assigned a meaning (e.g., as standing for objects, as describing states of affairs). As this definition implies, a key element in the acquisition of symbolic structure involves a type of independence between the task the symbols are found in and the vocabulary they represent. Fundamental to this type of independence is the ability of the learning system to factor the generic nature (or rules) of the task from the symbols, which are arbitrarily bound to the external referents of the task. Consider the simple problem of learning a grammar from valid, “positive set only” sentences consisting of strings of symbols drawn randomly from an infinite population of such valid strings. This sort of learning might very well underlie the acquisition of language in children from exposure to grammatically correct sentences during normal discourse with their language community.1 As an example, we will examine the learning of strings generated by Finite State machines (FSMs; for an example, see Figure 1), which are known to correspond to regular grammars. Humans are known to gain a memorial advantage from exposure to strings drawn from an FSM over random strings (Miller & Stein, 1963; Reber, 1967), as though they are extracting abstract knowledge of the grammar. A more stringent test of knowledge of a grammar would be to expose the subjects to an FSM with one external symbol set and see if the subjects transfer knowledge to a novel external symbol set assigned to the same FSM. In principle in this type of task, it is impossible for the subjects to use the
1 Although controversial, language acquisition must surely involve the exposure of children to valid sentences in their language. Chomsky (1957) and other linguists have stressed the importance of the a priori embodiment of the possible grammars in some form more generic than the exact target grammar. Although not the main point of this article, it must surely be true of the distribution of possible grammars that some learning bias must exist that helps guide the acquisition and selection of one grammar over another in the presence of data. What the nature of this learning bias is might be a more profitable avenue of research in language acquisition than the recent polarizations inherent in the nativist-empiricist dichotomy (Pinker, 1997; Elman et al., 1996).
On the Emergence of Rules in Neural Networks
2247
Figure 1: A vocabulary transfer task. The figure shows two FSM representations, each of which has three states (1, 2, 3), with transitions indicated by arrows and legal transition symbols (A, B, . . . , F) for each state. The network’s first task is to learn the FSM on the left-hand side to criterion and then transfer to the second FSM on the right-hand side of this figure. This task involves no possible generalization from the transition symbols. Rather, all that is available are the state configuration geometries. This type of transfer task explicitly forces the network to process the symbol set independently from the transition rule. States with tailless incoming arrows indicate initial states (state 1 in the FSMs). States within circles indicate accepting states (state 3 in the FSMs).
symbol set as a basis for generalization without noting the patterns that are commensurate with the properties of the FSM.2 An example of this type of transfer (vocabulary transfer) is shown in Figure 1. In this task, the syntactic structure of the generated sentences is kept constant while the vocabularies are switched. This will be the first kind of simulation in this article. In the second simulation, we examine a type of syntax transfer effect where the vocabulary is kept the same but the syntax is altered (see Figure 2). The purpose of the second simulation is to examine the effect of syntactic similarities on the degree of knowledge transfer. 2 Related Work Early neural network research in language learning (Hanson & Kegl, 1987) involved training networks to encode English sentences drawn randomly from a large text corpus (the Brown corpus). The network was able to learn a large number of English sentences and provided evidence that abstract knowledge could emerge from the statistical properties of a representative population of sentences and a simple neural network learning rule. Other early research (Elman, 1990, 1991) showed that networks could also be used to abstract general knowledge from structured input sentences. In Elman’s work, a neural network model with feedback connections, called the simple recurrent network (SRN), was trained with sentences generated from
2 Reber (1969) showed that humans would significantly transfer in such a task; however, his symbol sets allowed subjects to use similarity as a basis for transfer as they were composed of contiguous letters from the alphabet. Recent reviews of the literature indicate that this type of transfer is common even across modalities (Redington & Chater, 1996).
2248
Stephen Jos´e Hanson and Michiro Negishi
Figure 2: A syntax transfer task. In this task, vocabulary sets are the same in both the left and the right FSMs, but the structure of the FSMs, such as starting states and accepting states, allowed inputs at each state, and resultant state transitions differ for each FSM. As in the vocabulary transfer task, the network’s first task is to learn the FSM on the left-hand side to criterion and then transfer to the second FSM on the right-hand side of this figure. In a syntax transfer task, the network may make use of prior knowledge associated with input symbols where the same input causes the same transition in both of the FSMs.
a context-free grammar. After the SRN was shown symbols in a sentence sequentially (one symbol at a time), it could predict a possible set of symbols that could appear next in the sentence. For instance, after reading a singular noun, the network predicted that the set of next possible symbols included a wh- word (who in Elman’s grammar) and singular form verbs, which in this case exhausted the possibilities in the grammar used for the simulation. In the 1990 study, Elman conducted a cluster analysis (see also Hanson & Burr, 1990) on the hidden-layer node activities to show that syntactic categories of input symbols as well as semantic categories were formed during learning. Elman (1991) studied more complex syntactic forms (embedded sentences) also constructed from artificial grammars. In these studies, he conducted a principal component analysis (PCA) of the hidden-layer activities and showed that trajectories of these activities in the multiple embedded clauses were represented by self-similar movements in the state-space and that locations of these self-similar loops were slightly displaced from each other. These analyses indicated that the network appeared to learn an approximation to a phrase structure grammar. Elman’s work and other past language learning lays the foundation for our learning research in which we examine the productivity of these types of learned structures in supporting language acquisition and transfer. This article focuses on analyzing the effect of changing the vocabulary while keeping the syntax constant (vocabulary transfer) and the effect of changing the syntax while keeping the vocabulary constant (syntax transfer). We introduce a method of analysis that optimally searches for structure in the hidden space that is most consistent with how the representation of a grammar might be processed. Dienes, Altmann, and Gao (1999) showed that an SRN augmented with an additional hidden layer acquires abstract knowledge that is indepen-
On the Emergence of Rules in Neural Networks
2249
dent of input symbols. During the training phase in their simulations, the network was given sentences of inputs in one vocabulary generated by an FSM and was trained on the symbol prediction paradigm as in Elman’s work. During the test period, they presented the network with sentences that had the same structure but used a different vocabulary. During this period, some of the weights in the network were not modified. This strategy is often called weight freezing. They showed that the network could learn sentences composed of symbols from the second vocabulary during the test phase significantly better if the network had been trained with the first vocabulary using the same syntactic rules, compared with the case where the network was trained with sentences that were generated by different syntactic rules. Thus, the network was shown to be able to extract abstract knowledge in one vocabulary and could transfer such knowledge to a new vocabulary (i.e., an independent group of input symbols whose mapping to the old ones was not known a priori). They showed that the network had many previously reported characteristics of human learners: it was able to transfer grammatical knowledge of a finite-state grammar to a new vocabulary, and it showed sensitivity to both the grammaticality of the test sentences and the similarity between the test sentences and the training sentences, greater learning if trained on the whole sentences than when it was trained for bigrams (if equated for the number of symbols), and the ability to transfer even if the sentences did not include repetition of a symbol within a sentence that creates noticeable patterns.3 Although Dienes et al. (1999) compared the behavior of their network with human data, they did not analyze how the abstract knowledge was represented in the network after training. We believe that one of the important goals of connectionist modeling (see Hanson & Burr, 1990) should be to examine the representational and mechanistic aspects of network processing postlearning. This should be concomitant with showing that the network has similar behavioral patterns to human performance in the same task. If we can find correlates of grammatical knowledge in terms of grammatical features, parsing states, and so on as the result of mathematical analysis of the network weights or activity, we can say that grammatical knowledge emerged in the network and think of mathematical entities (such as the patterns of weights in the networks and attractors4 associated with states in 3 For instance, if repetitions of symbols are allowed, BDCCCB may be recognized as analogous to MXVVVM (Dienes et al., 1999). 4 Although attractors are often used to describe continuous time systems, the same notion can also be applied to discrete time systems. In the current network, attractors are defined as follows: For each point in the hidden state space at time t, there is a new position in the same space at t + 1 for each input symbol. Hence, there is a vector field, or a phasespace, for each input symbol (for instance, there is a phase-space for symbol A, another phase space for symbol B, and so on). Attractors are the stable points in the phase-space that movements from surrounding areas are drawn toward. Note that phase-spaces can also be defined for sequences of symbols. For instance, if an input sequence ABC causes
2250
Stephen Jos´e Hanson and Michiro Negishi
an FSM in the hidden unit activity space) as physical tokens that emerged through learning. Another possible disparity between network and human language processing is that Dienes et al. (1999) required freezing of some of the weights so that the network’s existing vocabulary knowledge would not be interfered by learning a new vocabulary. It is unclear whether such “weight freezing” is at work in humans at some cognitive level, nor is it clear that there is some physiological sense in which weight freezing might be implemented in a biological system. Given that adults show the vocabulary transfer effects, weight freezing cannot be attributed to a developmental factor. One could hypothesize a mechanism that controls the rate of learning in humans, but if a model can account for the transfer effect without such a mechanism, such a model would be preferred because of the simplicity of the explanatory theory that is embodied by it. It should also be noted that our network has one less hidden layer than Dienes et al.’s (1999), lacking what they call an encoding layer. Such a model would further be attractive because it could also be used to discover new mechanisms that might have a sensible biological plausibility. In this article, we describe a series of experiments with a recurrent neural network that creates abstract structure that is context sensitive, hierarchical, and extensible, without any special conditions such as freezing weights or otherwise creating any reference to past knowledge acquisition. Also, unlike Dienes et al.’s (1999) work, the focus of this article is on the analysis of the representation of both the resultant abstract knowledge and symbol-dependent knowledge. It is well known that neural networks can induce the structure of an FSM from exposure to strings drawn from that FSM (Cleeremans, 1993; Giles, Horne, & Lin, 1995). In fact, it has been shown that if a neural network (under simple perturbations due to nothing more than starting point variation for example, due to noise in data set order or gaussian noise injected into the weights; Hanson, 1990) can be reliably trained to recognize strings generated from an FSM, then the underlying state-space partitions of the neural network have no choice but to be in a one-to-one correspondence with the states of the minimal FSM. This result is consistent with a mechanism containing the minimal number of internal states that correctly accepts only the sentences created by the FSM (theorem 3.1 in Casey, 1996). Also, all cycles in the minimal FSM induce attractors in hidden unit state-space of the recurrent network (theorem 3.3, Casey, 1996). These surprising theorems seem to suggest that learning an FSM may provide the opportunity for a neural network to factor out the underlying rules of a state machine in its attractor space from the actual symbols that cause the state transitions. Without Casey’s theorems, it is possible to imagine that one sort of representation the hidden-layer activity to return to the original position (before the input ABC) in the state-space, such a cyclic movement through three different phase spaces (corresponding to A, B, and C) reduces to a point—possibly an attractor—in the phase-space defined for the input sequence ABC.
On the Emergence of Rules in Neural Networks
2251
that a neural network could learn about an FSM would involve the input encoding of the symbols being bound with the state representation in the hidden unit attractor space. This type of representation of an FSM would be very limited in its productivity or ability to transfer to other symbol sets or FSMs since each state would depend on a specific input symbol or set of symbols. Casey’s theorems inspire us to investigate a network’s general ability to transfer between surface symbols of the same FSM. We investigate this possibility in the following studies. In all of the connectionist modeling works cited above (Elman, 1990, 1991; Cleeremans, 1993; Giles et al., 1995; Dienes et al., 1999), the networks have feedback connections, such as connections from a hidden layer to an input layer. One obvious question that arises is whether a network memory created by recurrent connections is essential not only for accomplishing the task but for extracting abstract knowledge. Networks without feedback connections or slow decays do not have the means to maintain a short-term memory and thus merely map inputs to outputs. Feedback connections make the past history of the network available at the present time. Our initial studies in this area that investigated similar tasks, including the Penzias problem, a simple counting type of FSM induced by a feedforward network (Denker et al., 1987), implied that feedback connections could be important for extracting and transferring abstract knowledge. In this task, the network was trained to count the numbers of clusters of a particular symbol. For instance, if the network was trained to count the 1 cluster, the output corresponding to 11111 was 1, the output corresponding to 01011 was 2, the output corresponding to 10101 is 3, and so on. We trained both feedforward and recurrent networks on the Penzias task. A permutation task was defined in a similar way to the vocabulary transfer task. That is, the network that had been trained with 0s and 1s was trained with As and Bs, for example. The recurrent network showed transfer effects (savings in the number of training trials required to reach a performance criterion), but the feedforward network showed only interference effects (an increase in the number of training trials required to reach a performance criterion) even with significant increases in the capacity of the network. In the Penzias task, the best (and most transferrable) strategy is to find 0 to 1 transitions; the position of each transition is irrelevant. Such a strategy can be achieved by a memory, which keeps information as to whether the network has seen a series of one or more 0 symbols or a series of one or more 1 symbols most recently. Context dependence could also be accomplished in a feedforward network with shared weights, in which the weight pattern that is used to represent left and right context dependence at one position is copied to weight patterns that are used in other positions. However, in shared weight networks, all inputs (or at least the past several inputs) have to be presented to the network simultaneously, which is just a simple form of memory. Context dependence is also necessary for the efficient processing of regular grammars, which would be made possible by a network mem-
2252
Stephen Jos´e Hanson and Michiro Negishi
ory that keeps track of the current FSM state. At least the results from our preliminary studies including the Penzias task would suggest that memory in the network is an important component of the ability of a neural network to transfer its abstract knowledge. Note that the transfer of knowledge between tasks is possible without feedback connections if inputs are not completely novel. It has been shown that learning in a new task is significantly accelerated by transferring weights in neural networks without feedback connections if the same input domain is used (Pratt, Mostow, & Kamm, 1991; Pratt, 1993). Analogy is another process that is closely related to knowledge abstraction and transfer. Once enough evidence is accumulated to infer mappings from the old domain to the new domain, abstract knowledge in the old domain can be freely applied in the new domain. Most neural networks that have been used for analogical processing are embedded in a symbol processing system that has complex predefined architectures and typically do simple constraint satisfaction between the source and target domains (Thagard & Verbeurgt, 1998; Burns, Hummel, & Holyoak, 1993). In this article, a simple neural network with recurrent connections from the hidden layer to the input is examined for its ability to abstract knowledge, and the behavior of the hidden layer is analyzed in an effort to understand how the abstract knowledge is acquired. This type of network provides a simpler basis for studying the network’s complexity and representational bias than the highly structured networks of previous works. 3 Simulations 3.1 The Training Paradigm. In this work, we train a recurrent neural network (RNN; see Figure 3) on symbol sequences generated from FSMs. The network is trained with randomly generated grammatical sentences until performance of the network meets a learning criterion.5 Each symbol in the sentences is presented to the network sequentially. Each symbol is
5 Specifically, after each training sentence, the network was tested with 1,000 randomly generated sentences. The criterion for termination of training was the correct prediction of all allowable end points in these sentences. A margin of error was chosen in order to reduce false positives. Two thresholds were used to bound network responses that were either too low to be certain of an end point or too high to be certain of a nonendpoint. Falling within the margin was classified as an error. Thresholds were initialized to 0.20 (highvalue threshold) and 0.17 (low-value threshold) and were adapted at the end of each test phase using output values that were generated during processing of the test sentences by the network. The high threshold was modified to the minimum value yielded at the actual end of the sentence, minus a margin (0.01). Note that actual sentence end points rather than potential end points were used. Explicitly providing the network with information concerning potential end points was seen as providing too much information about the FSM in the external error signal. These modified thresholds were used for the next training and test sentences.
On the Emergence of Rules in Neural Networks
2253
Figure 3: The recurrent network architecture used in the simulations. The feedback-layer activity is a copy of the hidden-layer activity at a previous time step and thus provides a simple form of memory. Weights from the hidden layer to the feedback layer are one to one and are of a fixed strength. All other weights are subject to adaptation during both source and target grammar and vocabulary learning.
represented as an activation value of 1.0 of a unique node in the input layer, while all other node activations are set to 0.0. The task for the network is to predict if the sentence can end with the current word (in which case the output layer node activation was trained to become 0.85) or not (output should be 0.15). Note that when the FSM is at an accepting state, a sentence can either end or continue. Therefore, at this state, the network is sometimes taught to predict the end of a sentence and sometimes not. However, the network eventually learns to yield higher output node activation when the sentence can end, as will be shown. Input sentences are generated by an FSM, with equal probabilities of state transitions emanating from each state. Sentences longer than 20 symbols long are filtered out.6 Because of this limitation in the input sentence length, we do not claim that the network has acquired full grammatical competence. Rather, we are interested in the formation of the representation of the grammatical knowledge that faithfully corresponds to the FSM through limited exposure to inputs. 3.2 The Network. The network architecture (see Figure 3) embodies the basic mechanism of any FSM: the combination of current input (the input 6 Because of this generation scheme, sentences with shorter lengths are generated more often than longer ones during training and test. However, because the network is tested with 1,000 sentences, it is in practice impossible for the network to pass the test phase by being tested on only relatively short sentences given the distribution of sentence lengths sampled.
2254
Stephen Jos´e Hanson and Michiro Negishi
layer) and the current state (the feedback layer) determines the current output (the output layer) and the next state (the hidden layer). The network has an input layer with 13 nodes (three symbols times four vocabulary sets, plus the START symbol), an output layer with one node, one hidden layer with four nodes, and a feedback layer with four nodes. The network architecture is a special case of a second-order recurrent neural network (Giles et al., 1992) where the output layer is not connected to the feedback layer. This modification enables us to analyze hidden-layer activities independent of the desired output. Both the output layer and the hidden-layer receive first-order and second-order connections. Each first-order connection connects one source node to one destination node. In a first-order connection, the weight is multiplied by the source node activity before it is added to the destination node potential. In contrast, a second-order connection connects two source nodes (one node in the input layer and another in the feedback layer in the current network) to one destination node. In this case, the weight is multiplied by the product of the two source node activities before it is added to the destination node potential. (See the appendix for a complete set of activation and training equations.) At the beginning of each sentence, the hidden- and feedback-layer activities are reset to zero, and the inputs are processed one symbol at a time, beginning with the START symbol. The weights are modified using a standard learning algorithm developed by Williams and Zipser (1989), extended to second-order connections (Giles et al., 1992). The learning rate was set to 0.2 and the momentum coefficient to 0.8. Except for the “copy-back” weights from the hidden layer to the feedback layer, all weights in the RNN were candidates for adaptation; no weights were fixed prior to or during learning. 3.3 Simulation 1: The Vocabulary Transfer Task. In the first task, new symbols were cyclically assigned to the arcs in the FSM. In this task, the network was trained on three regular grammars (the source grammar) that had the same syntactic structure defined on three unique sets of symbols ({A,B,C}, {D,E,F}, and {G,H,I}), and the effect of the prior training of these symbols sets on the target grammar was measured as the network was trained with yet another new set of symbols ({J,K,L}). This task explicitly examined the ability of the network to transfer over novel, unknown symbol sets due to prior training on the same source grammar. The method of training and the stopping criteria are exactly as described in section 3.1. One of the indicators of the vocabulary transfer effect is the savings in terms of number of trials needed to meet the completion criterion. In the simulation, three source grammars were used for three cycles, resulting in nine training sessions (or eight vocabulary switchings) before the sentences from the target grammar were presented. Figure 4 shows the number of trials for both the source grammar training (vocabulary switching = 0, 1, . . . , 8 in the figure) and the target grammar training (vocabulary switching = 9), averaged over 20 networks with different initial random weights. The error
On the Emergence of Rules in Neural Networks
2255
Figure 4: The learning savings from subsequent relearnings of the vocabularies. Each data point represents the average of 20 networks trained to the criterion on the same grammar. The relearning cycle shows an immediate transfer to the novel symbol set, which continues to improve to near-perfect transfer through the eighth vocabulary switch (which results in the ninth cycle: three cycles of the three symbol sets) until the ninth switching, where a completely novel set is used with the same grammar. Over 60% of the original learning on the grammar independent of symbol set is saved.
bars show standard errors. The result of eight vocabulary switchings (50,000 sentences) is a complete accommodation of the new symbol sets with 98% savings (as shown in the eighth cycle of Figure 4, where the number of trials for learning has dropped to 2% relative to the number of trials required for the first learning of the source FSM). This accommodation represents the network’s ability to create a data structure that is consistent with a number of independent vocabularies. Although this does not seem necessarily surprising, it does mean that there was no catastrophic forgetting (McCloskey & Cohen, 1989). We think this is worth pointing out, considering that Dienes et al. (1999), who employed local encoding as we did, needed to freeze some weights in order to prevent catastrophic forgetting in the vocabulary transfer task (which they call a domain transfer task). One difference between our network and their network is that our network has second-order connections in addition to first-order connections, whereas Dienes et al.’s network has only first-order connections. This difference in the architec-
2256
Stephen Jos´e Hanson and Michiro Negishi
ture may have an impact on the generalization among vocabularies and forgetting of old vocabularies. Every second-order connection to the output layer and to the hidden layer is associated with one input-layer node. This means that while the network is trained on one vocabulary, second-order connections that are associated with other sets of vocabularies participate in neither the computation of activities nor the modification of weights. Thus, second-order weights do not contribute to generalization, but they resist forgetting. In contrast, first-order weights from the feedback layer are independent of input-layer activities and thus contribute to generalization but are prone to forgetting. We postulate that first-order weights are more protected from forgetting by having second-order weights because during the learning of a new vocabulary, second-order weights can be modified in a way that is specific to the combination of the input symbol and the current hidden-layer state. Thus, on average, second-order weights, having a higher degree of freedom, accommodate much of the input- and state-dependent changes. By contrast, first-order weights, which are affected by all input words in all vocabularies, undergo less vocabulary specific changes overall, although not perfectly protected from modification. More critical, however, there was a 63% reduction in the amount of required training for the new unseen vocabulary (again as compared to the first learning of the source FSM). At first glance, this savings may not look impressive, given that humans are known to be able to infer grammatical features of unknown symbols in a sentence even without learning (Berko, 1958). However, it is unclear how well humans do if all symbols in a sentence are novel. We are unaware of an experiment that demonstrated one-shot or very quick learning in language,7 but in an artificial grammar learning experiment reported by Brooks and Vokey (1991), subjects were able to label only 64% of new grammatical sentences as grammatical even if the vocabulary was known and the new sentence was similar to training sentences. For sentences that had the same syntactic structures as the training set but consisted of unknown symbols, they labeled only 59% of grammatical sentences as grammatical, showing that the learning was far from instantaneous. Apparently, the RNN is able to partially factor the transition rules from its constituent symbol assignments after mere exposure to a diversity of vocabularies. How is the neural network accomplishing these abstractions? Note that in the vocabulary transfer task, the network has no possible way to transfer based on the external symbol set. It follows that the network must abstract away from the input encoding. In effect, the network must find a way to recode the input in order to defer binding of variables in rules to symbols 7 Marcus, Viyayan, Bandi Rao, and Vishton’s (1999) experiment, in which sevenmonth-old infants could transfer phonological rules to novel sounds, is pertinent to this problem. However, infants may have been attending to phonological features that were not novel (Negishi, 1999).
On the Emergence of Rules in Neural Networks
2257
until enough symbol sequences have been observed (and simply continue to produce errors on what appear at the surface to be new sentences). If the network extracted the common syntactic structure, the hidden-layer activities would be expected to represent the corresponding FSM states regardless of vocabularies. This is in fact shown by linear discriminant analysis (LDA).8 After the network learned the first vocabulary, activity of the hidden layer was sensitive to FSM states (see Figure 5a). In the figure, values of two linear discriminants after each input symbol presentation during the test phase are plotted. Different FSM states are plotted with different colors. The figure shows a clear separation among colors, indicating that the state-space is organized in terms of the FSM states. We observed that cyclic trajectories through these clusters represent attractors in the state-space. If two transitions corresponding to the same input symbol start at nearby points in the state-space within one cluster that corresponds to the current FSM state, ending points after the transitions in the state-space are also close to each other within a cluster that corresponds to the next FSM state. Furthermore, we observed that even if one starts at a periphery of a cluster (a state far from the center of a cluster), the state at the end of the transition will be closer to the center of the new cluster (these cases not shown). Hence, this space possesses context sensitivity in that coordinate positions encode both state and trajectory information.9 After each of the three vocabularies was learned in three cycles, LDA of the hidden-layer node activities with respect to FSM states (see Figure 5b; colors signify FSM states and shapes signify vocabularies) was contrasted with LDA with respect to vocabulary sets (see Figure 6a; colors signify vocabularies and shapes signify FSM states). In Figures 5a, 5b, 6a, and 6b, LDA was applied to the hidden unit activations of one network to demonstrate the development of the phase-space structure. To show that such phase-space organization was not formed by chance, LDA was applied to 20 networks, and a measure of separability of the state-space with respect to FSM states was computed. One measure of such separability is the correct
8 LDA of the hidden unit states does a complete search of linear projections that are optimally consistent with organizations based on FSM states or by vocabulary. Like PCA, Fisher’s LDA extracts the first set of a linear combination of variables (first discriminant function) that maximizes the ratio of the within- and between-group variances (within and between states or vocabulary variances in this case) for all groups, and then extracts the second set for the next largest ratio of within- and between-group variances, and so on. Evidence from the LDA for state representations would indicate that the RNN found a solution to the multiple vocabularies by referencing them hierarchically within each state based on context sensitivity within each vocabulary cluster. The first two linear discriminants accounted for the bulk of the variance (more than 90%) and are used for reprojection of sample sentence hidden unit activity patterns. 9 Here the phrase context sensitivity is used as in dynamical systems theory and not as in linguistics—that is, to denote the effect of the attractor structure in a dynamical system, and not the effect of environments on a local syntactic structure.
2258
Stephen Jos´e Hanson and Michiro Negishi
Figure 5: LDA of hidden-layer activities. (a) LDA of hidden activities of a network that learned a single FSM-symbol set. The colors code state (1: red, 2: blue, 3: green), and the + sign codes for the single symbol set. It can be seen that FSM states are pairwise separable in the hidden-layer activity space. (b) LDA of the hidden units with respect to states after the RNN has learned three FSMs with three different symbol sets using state as the discriminant variable. Notice how the hidden unit space spreads out compared to a. The space is organized by clusters corresponding to states (coded by color: red = 1, blue = 2, green = 3), which are internally differentiated by symbol sets (+ = ABC, triangle = DEF, square = GHI).
On the Emergence of Rules in Neural Networks
2259
Figure 6: LDA with respect to vocabularies and LDA after transfer to a new symbol set. (a) LDA of hidden unit activities with respect to vocabularies after the RNN has learned three FSMs and three different symbol sets. The LDA used the symbol set as the discriminant. The color codes the vocabularies (red: ABC, blue: DEF, green: GHI), and the shape codes the FSM states (+: 1, triangle: 2, square: 3). In this case, the discriminant function based on the symbol set produces no simple spatial classification compared to Figure 5b, which shows the same activations classified by the state of the FSM. (b) LDA of the hidden state-space after training on three independent symbol sets for three cycles and then transfer to a new untrained symbol set (coded by stars, circled for clarity). Note how the new symbol set slides in the gaps between symbol sets previously learned (see Figure 5b, which is primarily identical except for the new symbol set). Apparently, this provides initial context sensitivity for the new symbol set, creating the 60% savings.
2260
Stephen Jos´e Hanson and Michiro Negishi
rate of discrimination using linear discriminants. One linear discriminant divides the state-space into two regions: one in which the linear discriminant is positive, and another in which the linear discriminant is negative. Two linear discriminants divide the phase space into four. Because there are only three FSM states, if FSM states are completely (pairwise) linearly separable in the phase-space, two linear discriminants with respect to the FSM states should be sufficient to discriminate all FSM states correctly. Similarly, two linear discriminants with respect to vocabularies should be sufficient to discriminate the three vocabularies correctly. The correct rate of discrimination after eight vocabulary switchings clearly shows that the state-space is organized by FSM states rather than by vocabulary sets, since FSM states could be correctly classified by the two linear discriminants with respect to the FSM states with an accuracy of 91% (SD = 7%, n = 20), whereas the vocabulary set could be classified correctly for only 73% (SD = 10%, n = 20). Note that LDA when determining accuracy takes into account the number of classes in each case, although in the case here, there was an equal number of classes for both vocabulary and state. This means that the hidden layer learned a mapping from the input and current state-space, which is not pairwise separable by states, into the next state-space, which is pairwise separable by states, greatly facilitating the state transition and output computation at the next input. Notice in both Figures 5b and 6b relative to 5a that the symbol sets have spread out and occupy more of the hidden unit space with significant gaps between clusters of the same and different symbol sets. Moreover, from Figure 5b, one can also see that in each state (coded by the colors), each vocabulary (coded by the plotting symbols) clusters together. Although vocabularies are not always pairwise separable in this plot, the separation is remarkable considering that the linear discriminants are optimized solely to discriminate the states. We can test the vocabulary separability within each state directly by doing an LDA within each state using vocabulary as the discriminant variable. In this case, vocabularies were correctly discriminated with an average accuracy of 95% (S.D. = 7, n = 20). This is strong evidence that the hidden-layer activity states are hierarchically organized into FSM states and vocabularies. This hierarchical structure allows for the accommodation of the already-learned vocabularies and any new ones the RNN is asked to learn. LDA after the test vocabulary is learned once also shows that the network state is predominantly organized by FSM states (see Figure 6b), although the linear separation by FSM states of a small fraction of activities is compromised, as can be seen at the boundary of red crosses, blue stars, and triangles in the figure. This interference by the new vocabulary is not surprising considering that old vocabularies were not relearned after the new vocabulary was learned. What is more interesting is the spatial location of the new vocabulary (“stars”). The hidden unit activity clearly shows that state discriminant structure is dominant and organizes the symbol sets. It seems that the fourth vocabulary simply fills empty spots in the hidden unit
On the Emergence of Rules in Neural Networks
2261
space to code it in a position relative to the existing state structure. This can be seen by comparison to Figure 5b, which is almost identical to the linear projections that were found without the new symbol set. This retention of the projections of the old space would not be surprising since the configuration prior to exposure to the new symbol set should not change because of the extensive prior vocabulary training. Apparently, the precedence of the existing abstraction encourages use of the hierarchical state representation and its existing context sensitivity. The network seems to bootstrap from existing nearby vocabularies that would allow generalization to the same FSM that all symbol sets are using. Several factors may make such bootstrapping possible. First, initial connection weights from novel symbols are small random weights, creating quite different patterns of activations in the hidden layer than old symbols do through the more articulated learned weights (weight strengths with a wider dynamic range). Second, the probability distribution of FSM states is the same regardless of the vocabulary set, and this information is reflected in the first-order weights. Third, the gradient-descent algorithm finds a most effective modification to the weights to perform the prediction task. For these reasons, we can view what the network is doing as a kind of analogical learning process (e.g., D is to E as A is to B). 3.4 Simulation 2: The Syntax Transfer Task. Humans can apply abstract knowledge (such as the structure of symbol sequences) for solving problems even if the abstract structure required for the task is slightly different from the acquired one (playing card games, for example; often the knowledge of strategy in one game transfers to another game). Does a neural network have the same flexibility? In the second simulation, we carried out a simulation of the syntax transfer task where the target syntactic structure was changed from the acquired ones and the vocabulary was kept constant. In the first syntax transfer task, only the initial and the last states in the FSM were changed, so the subsequence distributions are almost the same (see Figure 7a). In this case, 47% and 49% savings were observed in each transfer direction. In the second syntax transfer task, directions of all arcs in the FSM were changed to the opposite (see Figure 7b). Therefore, the mirror image of a sentence accepted in one grammar is accepted in the other grammar. Although the grammars were very different, there is a significant amount of overlap in the permissible subsequences. Therefore, there were 19% and 25% savings in training. In the third and fourth syntax transfer tasks, the source and the target grammars share fewer subsequences. In the third case (see Figure 7c), the subsequences were very different because the source grammar has two one-state loops (at states 1 and 3) with the same symbol A, whereas the two one-state loops in the target grammar consist of different symbols (A and B). Also, the two-state loops (states 2 and 3) consist of different symbols (BCBC· · · in source grammar and CCCC· · · in the target grammar). In this case, there was an 18% reduction in the number of trials required in one transfer direction but a 29% increase in the other
2262
Stephen Jos´e Hanson and Michiro Negishi
Figure 7: Syntax transfer results as a function of source to target difference in grammar. Numbers on the horizontal arrow tails signify numbers of required training sentences for the source grammar, and numbers on the horizontal arrow head signify numbers of required training on the target grammar. For instance, the left grammar in a required 20, 456 training sentences when it was the source grammar and 10, 760 training sentences when it was the target grammar. Thus, the effect of learning the grammar on the right-hand side resulted in a 47.4% reduction of required training sentences. Numbers in the parenthese are standard errors (N = 20).
On the Emergence of Rules in Neural Networks
2263
direction. In the fourth case (see Figure 7d), one grammar includes two one-state loops with the same symbol (AAA· · · in states 1 and 3), whereas in the other grammar, they consist of different symbols (AAA· · · in state 1 and BBB· · · in state 3). In this case, there were 13% and 14% increases in the number of trials required. The fact that there is interference (increase in the number of required training trials) as well as transfer indicates that finding a correct mapping is a hard problem for the network. From the observations, we speculated that if the acquired grammar allows many subsequences of symbols that are also allowed by the target grammar, the transfer is easier, and therefore there will be more savings.10 The simulation results are consistent with the finding in human artificial grammar learning that the transfer effect persists even when the new strings violate the syntactic rules slightly (Brooks & Vokey, 1991). 4 Conclusion It has been shown that the amount of training needed to learn the end prediction task on a new grammar is reduced in the vocabulary transfer paradigm as well as in the syntax transfer paradigm (given that the source and the target grammars were similar). We have shown that previous experience with examples drawn from rules (governing an FSM) can shape the acquisition process and speed of new, though similar, rules in simple associationist neural networks. The linear discriminant analysis of the hidden-layer activities showed that the activity space was hierarchically organized in terms of FSM states and vocabularies. The trajectories in the state-space showed context sensitivity, and the reorganization of the state-space showed a simple type of analogical process that would support the vocabulary transfer process. It has been argued that unstructured (minimal bias) neural networks with general learning mechanisms are incapable of representing, processing, and generalizing symbolic information (Pinker, 1994; Fodor & Pylyshyn, 1988; Marcus et al., 1999). Evolutionary psychologists argue that humans have an innate symbol processing mechanism that has been shaped by evolution. Pinker, for one, argues that there must be two distinct mechanisms for language, an associative mechanism and a rule-based mechanism, the latter being equipped with a nonassociative learning mechanism or arising from genetic basis (Pinker, 1991). The other alternative, as demonstrated by our experiments, is that neural networks that incorporate associative mechanisms can be both sensitive to the statistical substrate of the world and exploit data structures that have the property of following a deterministic rule. As demonstrated, these new data structures can arise even when that
10 To confirm this conjecture, we sought a measure of the overlap of subsequences between the source and the target grammar, Our efforts to produce such a numeric measure have met only moderate success (see Negishi & Hanson, 2001, for more details).
2264
Stephen Jos´e Hanson and Michiro Negishi
explicit rule has been expressed only implicitly by examples and learned by mere exposure to regularities in the data. Appendix: Network Equations A.1 Output-Layer Node i activation. XiO (t) = f (PO i (t)) = f
S wO ijk Xj (t − 1)Ik (t)
jk
In the equations, t is a time step that indicates the number of input words that have been seen. For instance, t = 1 when the network is processing the first word. In the equation above, the product of a feedback-layer node j activity (which is a state hidden-layer node j activity at the previous time step) XjS (t − 1), and an input-layer node k activity Ik (t) is weighted by a O second-order weight Wijk and is added to an output-layer node i potential
PO i (t). A transfer function f ( ) is applied to the potential and yields the output-layer node i activity XiO (t). In the current network, there is only one output layer node (i = 1). X0S (t − 1) and I0 (t) are special constant nodes O whose output values are always 1.0. As the result, Wi0k serves as a firstorder connection from the input node k to the output-layer node i, and O Wij0 serves as a first-order connection from the feedback-layer node j to the O output layer node i. The weight Wi00 serves as a bias term for the output layer node i.
A.2 State Hidden-Layer Node Activation. XiS (t) = f (PSi (t)) = f
wSijk XjS (t − 1)Ik (t)
jk
The product of a feedback layer node j activity XjS (t − 1) and an input layer S node k activity Ik (t) is weighted by a second-order weight Wijk and is added
to a state hidden-layer node i potential PSi (t). The transfer function f ( ) is applied to the potential and yields the state hidden-layer node i activity XiS (t). Because of the special definitions of X0S (t − 1) and I0 (t) described S above, Wi0k serves as a first-order connection from the input node k to the S serves as a first-order connection from state hidden-layer node i, and Wij0 S the feedback-layer node j to the state hidden-layer node i. The weight Wi00 serves as a bias term for the state hidden-layer node i.
On the Emergence of Rules in Neural Networks
2265
A.3 Output-Layer Weight Update. wO lmn = −α
∂ J(t) t
∂XiO (t) ∂wO lmn
=
∂ ∂wO lmn
∂wO lmn
f
= −α
i
Ei (t)
t
∂XiO (t) ∂wO lmn
O S wjk Xj (t − 1)Ik (t)
jk
S = f¯(PO i (t))δil Xm (t − 1)In (t) O Change in a second-order weight Wlmn (from the input hidden layer node m and the feedback layer node n to the output layer node l) is the negative of the learning constant α times the partial derivative of the cost function J(t), summed over the whole sentence. The partial derivative is equal to the error (output value minus desired output) of the output layer nodes Ei(t) times the partial derivatives of the outputs XiO (t) summed over all outputs i (in the current network, there is only one output node). The partial derivative of the output node activity is computed using the chain rule (dx/dy = (dx/dz)(dz/dy)). See the equations for the output-layer node output for variable descriptions. In the last line of the equations above, f ( ) with an overbar on f denotes the derivative of f ( ), and δil is a Kronecker delta whose value is one only if i = l and is zero elsewhere.
A.4 State Hidden Weight Update. wSlmn = −α ∂XiO (t) ∂wSlmn
=
∂ Ji (t) i
∂ ∂wO lmn
∂wSlmn
t
f
∂wSlmn
=
∂wSlmn
i
t
Ei (t)
∂XiO (t) ∂wSlmn
S wO ijk Xj (t − 1)Ik (t)
wO ijk Ik (t)
jk
∂
jk
= f¯(PO i (t)) ∂XiS (t)
= −α
f
∂XiO (t) ∂wSlmn
wSijk XjS (t − 1)Ik (t)
jk
S = f¯(PSi (t)) δil Xm (t − 1)In (t) +
jk
wSijk Ik (t)
∂XiS (t) ∂wSlmn
S Change in a second-order weight Wlmn (from the input layer node m and the feedback layer node n to the state hidden-layer node l) is the negative
2266
Stephen Jos´e Hanson and Michiro Negishi
of the learning constant α times the partial derivative of the cost function J(t) summed over the whole sentence. The partial derivative is equal to the error (output value minus desired output) of the output nodes Ei(t) times the partial derivatives of the output-layer nodes XiO (t), with respect to a state hidden node weight this time, summed over all outputs i. The partial derivative of the output node activity is computed using the chain rule. In this case, we need to compute the partial derivative of state hidden-layer node activities XiO (t). This again is computed using the chain rule. The time suffix (t − 1) of the state hidden-layer node activation XiS (t − 1) in the last line indicates that the partial derivative of a hidden-layer node activity with respect to a state hidden-layer weight is computed recursively. The initial value of the partial derivative (at the beginning of the sentence, t = 0) is assumed to be zero.
Acknowledgments We thank Michael Casey for contributions to an earlier version of this work and Ben Martin Bly for comments and editing of earlier versions of this article. We also acknowledge an anonymous reviewer for reading over our manuscript and providing many useful comments and edits and Gary Cottrell for many useful discussions and corrections to the article.
References Berko, J. (1958). The child’s learning of English morphology. Word, 14, 150–177. Brooks, L. R., & Vokey, J. R. (1991). Abstract analogies and abstracted grammars: Comments on Reber (1989) and Mathews et al. (1990). Journal of Experimental Psychology: General, 120, 316–323. Burns, B. D., Hummel, J. E., & Holyoak, K. J. (1993). Establishing analogical mappings by synchronizing oscillators. In Proceedings of the Fourth Australian Conference on Neural Networks. Sydney, Australia. Casey, M. (1996). The dynamics of discrete-time computation, with applications to recurrent neural networks and finite state machine extraction. Neural Computation, 8, 1135–1178. Chomsky, N. (1957). Syntactic structures. The Hague: Mouton. Cleeremans, A. (1993). Mechanisms of implicit learning. Cambridge, MA: MIT Press. Denker, J., Schwartz, D., Wittner, B., Solla, S., Howard, R., Jackel, L., & Hopfield, J. J. (1987). Automatic learning, rule extraction and generalization. Complex Systems, 1, 877–922. Dienes, Z., Altmann, G., & Gao, S-J. (1999). Mapping across domains without feedback: A neural network model of transfer of implicit knowledge. Cognitive Science, 23, 53–82. Elman, J. (1990). Finding structures in time. Cognitive Science, 14, 179–211.
On the Emergence of Rules in Neural Networks
2267
Elman, J. (1991). Distributed representations, simple recurrent networks, and grammatical structure. Machine Learning, 7, 195–225. Elman, J., Bates, E., Johnson, M., Karmiloff-Smith, A., Parisi, D., & Plunkett, K. (1996). Rethinking innateness. Cambridge, MA: MIT Press. Fodor, J., & Pylyshyn, Z. (1988). Connectionism and cognitive architecture: A critical analysis. In S. Pinker & J. Mehler (Eds.), Connections and symbols. Cambridge, MA: MIT Press. Giles, C. L., Horne, B. G., & Lin, T. (1995). Learning a class of large finite state machines with a recurrent neural network. Neural Networks, 8, 1359–1365. Giles, C. L., Miller, C. B., Chen, D., Chen, H. H., Sun, G. Z., & Lee, Y. C. (1992). Learning and extracting finite state automata with second-order recurrent neural networks. Neural Computation, 4, 393–405. Hanson, S. J. (1990). A stochastic version of the delta rule. Physica D, 42, 265–272. Hanson, S. J., & Burr, D. (1990). What connectionist models learn: Learning and representation in connectionist models. Behavioral and Brain Sciences, 13, 471–518. Hanson, S. J., & Kegl, J. (1987). PARSNIP: A connectionist network that learns natural language grammar from exposure to natural language sentences. In Proceedings of the Ninth Annual Conference on Cognitive Science. Seattle, WA. Harnad, S. (1990). The symbol grounding problem. Physica D, 42, 335–346. Marcus, G. F., Viyayan, S., Bandi Rao, P., & Vishton, M. (1999). Rule learning by seven-month-old infants. Science, 283, 77–80. McCloskey, M., & Cohen, N. (1989). Catastrophic interference in connectionist networks: The sequential learning problem. Psychology of Learning and Motivation, 24, 109–165. Miller, G. A., & Stein, M. (1963). Grammarama I: Preliminary studies and analysis of protocols (Tech. Rep. No. CS-2). Cambridge, MA: Harvard University. Negishi, M. (1999). A comment on G. F. Marcus, S. Vijayan, S. Bandi Rao, & P. M. Vishton “Rule learning by seven-month-old infants.” Science, 284, 435. Negishi, M., & Hanson, S. J. (2001). A study of grammar transfer effects in a second order recurrent network. Proceedings of International Joint Conference on Neural Networks 2001, 1, 326–330. Pinker, S. (1991). Rules of language. Science, 253, 530–535. Pinker, S. (1994). The language instinct. New York: Morrow. Pinker, S. (1997). How the mind works. New York: Norton. Pratt, L. Y. (1993). Discriminability-based transfer between neural networks. In C. L. Giles, S. J. Hanson, & J. D. Cowan (Eds.), Advances in neural information processing systems, 5 (pp. 204–211). San Mateo, CA: Morgan Kaufmann. Pratt, L. Y., Mostow, J., & Kamm, C. A. (1991). Direct transfer of learned information among neural networks. Proceedings of the American Association for Artificial Intelligence, 2, 584–589. Reber, A. (1967). Implicit learning of artificial grammars. Journal of Verbal Learning and Verbal Behavior, 6, 855–863. Reber, A. (1969). Transfer of syntactic structure in synthetic languages. Journal of Experimental Psychology, 81, 115–119. Redington, J., & Chater, N. (1996). Transfer in artificial grammar learning: A reevaluation. J. Exp. Psych: General, 125, 123–138.
2268
Stephen Jos´e Hanson and Michiro Negishi
Thagard, P., & Verbeurgt, K. (1998). Coherence as constraint satisfaction. Cognitive Science, 22, 1–24. Williams, R. J., & Zipser, D. (1989). A learning algorithm for continually running fully recurrent neural networks. Neural Computation, 1, 270–280. Received June 13, 2000; accepted March 27, 2002.