An AGI Alignment Drive Dave Jilk & Seth Herd December 30, 2017

Abstract To improve the likelihood of Artificial General Intelligence aligning its behavior with human interests, it should be imbued with an innate, pre-conceptual learning signal that produces reward in response to positive interaction with humans. Introduction This article outlines an argument that humans developing Artificial General Intelligence (AGI) should build it to incorporate an innate, pre-conceptual learning signal that produces reward in response to positive interaction with humans. It builds on our previous theoretical work developed in two papers, ​Anthropomorphic Reasoning about Neuromorphic AGI Safety​ [1] and Conceptual-Linguistic Superintelligence​ [2]. For convenience, a brief overview of [2] is included in the Appendix. The flow of the argument is as follows. To initiate and sustain an intelligence explosion, AGI must have and use a conceptual-linguistic faculty (CLF) with substantial functional similarity to the human faculty, including richly grounded representations of the world [2]. If the CLF has a strictly neuromorphic architecture, its conceptual representations will develop with and incorporate the influence of basic, pre-conceptual drives [1]. We will show here that, even in non-neuromorphic architectures for the CLF, its learned representations will reflect the influence of innate learning signals and mechanisms. The influences of those signals and mechanisms are deeply embedded and will tend to be robust, in contrast with motivating factors that develop or are added after conceptual representations have formed. Motivators added later will also tend to be strengthened or weakened to the extent that they cohere with drives resulting from innate learning signals and mechanisms. Innate learning signals must rely on environmental inputs unmediated by conceptual representations, so designing a signal to drive desired behavior is non-trivial. Seeking positive interactions with humans is one of the known basic drives of human beings, and most of the applicable behaviors are not innate - thus its implementation through an unmediated learning signal is known to be possible. AGI that develops its conceptual representations in the context of such a learning signal is likely to generally align its behavior with human interests, since such behavior would tend to result in positive human interactions. Further, the resulting drive is likely to be important or at least helpful in the intellectual development of AGI, since it will need to begin with human knowledge [2]. Finally, perverse effects of the resulting drive can be mitigated

because the AGI can grasp via the CLF the long-term consequences of its behavior. Thus we conclude that developers of AGI should include this innate learning signal. The Importance of Innate Learning Signals in AGI In this section, we will describe how an innate learning signal, for an initial or non-backward-compatible AGI with a CLF, will cause the AGI to develop a behavioral drive that is deeply embedded and robust, that cannot be removed directly, and that will tend to oppose application and learning of drives and signals that contradict it. In [2] we showed that in an intelligence explosion, participant artificial intelligences must have and make use of a CLF with substantial functional similarity to the human system. We detailed this functional similarity as a system that relies on rich semantic contents with graded, statistical, and overlapping representations, bidirectional connectivity, and graded mutual activation among semantic contents and words in various combinations, among other capabilities. AGI must have these capabilities for two reasons: first, to be able to understand and use the vast corpus of human knowledge to create successors in the intelligence explosion; second, to be capable of resisting human efforts to terminate the intelligence explosion. A brief summary of the argument in [2] is provided in the Appendix. We did not claim that the CLF is necessary to constitute AGI or superintelligence, only that it is necessary to sustain an intelligence explosion. However, an intelligence explosion is one of the most feared scenarios relating to AGI, and is sometimes asserted as inevitable due to convergent instrumental drives [4] [5]. Further, our thesis here is one of taking action to improve the likelihood of a satisfactory outcome of developing AGI, and is not exclusive of other actions. Consequently, this limitation does not cause a loss of generality for our purposes. Throughout, we use AGI to mean AGI that is capable of sustaining an intelligence explosion. Nor did we claim that the CLF is the only, primary, or dominant means of cognition available to a participant artificial intelligence. However, we did establish that the CLF would need to be consistently active and providing inputs to the system as a whole, in order to avoid human interference in the intelligence explosion. Again, because the CLF will influence even if it does not entirely control behavior, this limitation does not cause a loss of generality with respect to our thesis. Finally, we did not claim in [2] that a satisfactory CLF must be “neuromorphic” in a strict sense (as detailed in [1]). Consequently, we cannot rely on “anthropomorphic reasoning” as in [1] to establish the role of basic drives or the safety benefits of a drive toward positive human interaction. We must establish those points independently now, relying only on the characteristics of a CLF. An initial AGI, or any successor AGI with an architecture that is not backward-compatible with its predecessors (as discussed in [2]), must be a ​learning system​. It must learn its CLF

representations from interaction with the world (which may partially include training environments). If instead the rich, graded and overlapping representations that ground all its knowledge were implanted ​a priori​, the result would be highly unpredictable, for the same reason that we usually find such representations inscrutable in the extant systems that rely on them. If the system learns its representations, they can operate even if they are not understood. A learning system needs ​learning signals​ to develop its representations. A learning signal is a stimulus that provides information to the system to guide modification of knowledge representations. In unsupervised learning, the signal is implicit in the normal input stimuli and is supported by some mechanism, such as “fire together, wire together” in Hebbian neural networks. In supervised and reinforcement learning, the learning signal is an additional stimulus that offers feedback on the system’s responses to stimuli, again supported by a mechanism that interprets the signal with respect to the applicable stimuli. Logically and in general, whether the learning signal is implicit or explicit, the learning mechanisms in the system must be responsive to its stimuli in some form or it will not learn anything about the stimuli. Assuming that its behavior is tied to its knowledge representations, an agentic learning system will change its behavior as it learns. Over time, its behavior will be influenced by the interplay of its learning mechanisms and the learning signals that those mechanisms rely on. We can call these behavioral implications a ​drive​, and if the learning signal is innate then we can reasonably refer to the result as an innate or basic drive. In a learning CLF, all of the conceptual-linguistic representations will reflect and incorporate the effects of innate learning signals. As mentioned, learning mechanisms must be responsive to the learning signal, and an innate learning signal is present from the initial ​tabula rasa​ stage of the CLF. Further, since the representations in the CLF are graded, statistical, and overlapping, the way that the influence of a learning signal is incorporated into those representations will also be graded, statistical and overlapping, and therefore distributed throughout and implicit. It is also embedded in the conceptual structure itself - in the relationships among and between various words and their semantic contents. Due to this intricate embedding, the basic drive arising from an innate learning signal will not be represented in some simple closed form, or in a single place or slice of the representations. It will be a deeply ingrained and inextricable aspect of all the conceptual-linguistic representations and of the representational structure as a whole. In contrast with a utility function or other simple and closed forms of implementing and representing drives, a motivational drive that develops via an innate learning signal and is represented throughout the conceptual structure cannot be modified easily or directly. Because it is implicit in and distributed throughout the AGI’s knowledge representations, any attempt at direct change will have unpredictable effects. Instead, further learning would be necessary to modify such a drive in a controlled fashion. This means that the original drive is resistant to sudden change. Further, the overall structure of representations has a strong tendency toward stability, because changing it can cause unpredictable and often undesirable consequences for everything that has been previously learned. Thus even if the representations are adjusted

superficially, the influences of the innate learning signal on the overall representational structure will tend to be retained. Once conceptual representations have largely formed, a system with a CLF could be given motivating factors that are specified by language. In humans, a goal like “make money so you can buy more toys” is something that can be taught. That approach would also be possible in AGI, but in AGI it could also be accomplished more directly by identifying some of the locus of the conceptual representations and influencing them through some mechanism. Either way, over a long period of time such motivators might become embedded in the system’s representations as habits, and could develop a certain amount of stability. However, because they are specified through language or mediated by concepts, a substantial portion of the representation will reside in purely linguistic connections. Thus such motivators are more susceptible to retraining. Further, they rely on the existing conceptual structure rather than being embedded in it, and are therefore more naturally separable from that structure. Unmediated learning signals could also be added later. Humans experience this frequently as they age; for example, from the pain of an arthritic knee we learn to favor that knee. An AGI could have an even greater variety of such new learning signals. Like innate unmediated learning signals, these will be reflected through modification of the graded, statistical, and overlapping semantic contents, thus their effects cannot be or modified directly. However, their effect on representations will be related to the length of time the signal has been active, and because the overall knowledge structure is already developed and stable, these newer signals will tend not be reflected in that overall structure. If later-added drives and learning signals oppose or contradict basic drives, some amount of behavioral confusion will result. For example, imagine if all food suddenly started smelling bad to you. Such new signals and drives will be learned more slowly than if they were coherent, because they will be opposed by the stable existing representations and their overall structure. We can see from all this that an innate learning signal, for an initial or non-backward-compatible AGI with a CLF, will cause the AGI to develop a behavioral drive that is deeply embedded and robust to drift, that cannot be removed directly, and that will tend to oppose application and learning of drives and signals that contradict it. The drive arises from representations in the CLF, which the AGI must keep active in an intelligence explosion, it will consistently influence the AGI’s behavior. These are desirable features from an alignment perspective. A Learning Signal for Positive Human Interaction We often describe learning signals in conceptual-linguistic terms. For example, we say that animals learn to avoid pain. But this learning signal is not mediated by conceptual representations, even in humans. It is a direct neuronal process that travels from the source of pain through a particular pathway in the brain, triggering synaptic plasticity and in some cases even a reflex to move away from the pain. To be sure, many human learning signals are

mediated by concepts, and even the avoidance of pain in adults often uses abstract ideas. Initially, though, there are no concepts on which the signal can rely, so an innate learning signal must operate without such mediation. It is crucial to distinguish our linguistic descriptions of a learning signal from its actual mechanism. The same sort of distinction applies to an initial or non-backward-compatible AGI with a CLF. An innate learning signal is one that is present prior to the development of its conceptual-linguistic representations; it therefore cannot be mediated by any such representations. It must be based on something more directly connected to the environment. Of course, in building an initial AGI, humans could incorporate a learning signal from a complex device that putatively implements a conceptual feature. For example, a camera system with a deep learning network that recognizes snakes and produces a negative learning signal would, by and large, teach the system to avoid snakes. But this is still unmediated, because if the camera lens gets scratched or there is a new species of snake that it does not recognize, it cannot adapt, and the AGI cannot interpret the signal prior to its influence. It will learn exactly the signal that the device produces. Earlier we described the potential benefit of an innate learning signal: its effects become deeply embedded in knowledge representations and are therefore somewhat robust. The difficulty is that producing a desired pattern of behavior from unmediated signals can be challenging. We cannot simply incorporate an innate learning signal that trains the AGI to “seek justice.” It must be built from more elemental components, and there are no guarantees that the necessary elemental components are feasible, nor is it necessarily straightforward to determine what those components should be. Among other basic drives, humans have drives for affiliation with other humans and attachment to caregivers [1]. Most of the behaviors associated with these drives are not innate reflexes, so they must be the result of one or more innate learning signals. We have some insight into the mechanisms, though they are not fully understood. For example, in a number of mammal species, olfactory learning driven by oxytocin and norephinephrine drives attachment to the mother [6]. Newborn human babies recognize and respond to caregiver expressions [7]. The takeaway is that drives like affiliation and attachment can be produced from a set of unmediated, innate learning signals. We know from the human example that it is ​possible​, and further that we have a reference implementation available. For simplicity, we will refer to this set of signals as a single signal for “positive interactions with humans,” meaning the set of mechanisms that result in drives to affiliation and attachment. In a first order analysis, it is obvious that an AGI with a drive toward affiliation with and attachment to humans will have a tendency to behave in ways that promote or effect those ends. All other things being equal, beneficial outcomes for humanity seem more likely if AGI has such a drive. The drive will influence the AGI’s behavior away from actions that harm humans, and toward actions that genuinely promote their welfare, because such actions subserve the AGI’s drive. This learning signal and drive will also promote the AGI’s intellectual development.

Because it will initially need to learn from its human “caregivers,” a drive toward attachment and affiliation with those caregivers will help it learn [1]. Those experiences of learning from humans will naturally augment the innate learning signal in influencing representations. Afterward, it can continue its learning from artifacts such as written materials [2], but this would be greatly facilitated by continuing human contact (as in [1]). All of this leads us to propose that embedding in AGI an innate learning signal that rewards positive interactions with humans would help to align AGI behavior with human interests. However, to take the approach seriously, we must look at potential risks and downsides, and go beyond the first order analysis. Could this learning signal and its resulting drives produce perverse effects and outcomes that we would not consider aligned with our human needs and desires? If so, how likely are such outcomes? Many analyses and science fiction stories have looked at incorporating motivations for positive behavior toward humans in AGI. Often they demonstrate or illustrate the possibility of dystopian results. These scenarios almost always assume that the AGI does not have a sophisticated ability to interpret or moderate its motivational influences, thus a combination of “literalism” and unflinching maximization produces perverse results, even with motivations that would seem to promote aligned behavior. In our proposal, the innate learning signal operates on a CLF and influences its representations and therefore the AGI’s behavior. That same CLF gives the AGI the ability to analyze and interpret its own behavior, motivations, and values and weigh its options, and it will need to do so to some extent to sustain an intelligence explosion [2]. Thus it will be capable of foreseeing some of the long-term consequences of its behavior on the humans to which it is attached. For example, it will be capable of grasping that imprisoning humans so that it can be around them, or being overly attentive to the point of emotional suffocation, will result in interactions that are not positive. The sort of “literalism” that many dystopian scenarios rely on can potentially be avoided by an AGI with a CLF. Indeed, because its representations deeply embed drives toward attachment and affiliation with humans, it will have considerable motivation to avoid these scenarios. Of course, this point could be used to defend other motivational approaches than what we propose here. Our proposal is not intended to be exclusive; the fact that the proposed learning signal is incorporated does not mean that the AGI would have no other motivational drives or learning signals. The reader may note that we have based this entire discussion on the application of a positive human interaction learning signal to a CLF. An AGI may have many other modes of cognition and learning, and these might also make use of a similar or analogous learning signal, to putative beneficial effect. However, because those other modes of cognition are unspecified, we

cannot say very much about how this might work or how robust it might be, nor does our analysis of the potential negative consequences apply. Conclusion In this article we have argued that humans developing AGI should build it to incorporate an innate, pre-conceptual learning signal that rewards positive interaction with humans. We do not claim that this provides any sort of guarantee of particular beneficial outcomes, for several reasons. First, it is extremely unlikely that any such guarantee is possible [3]. Second, the influence of such a learning signal is indirect, so neither particular interactions nor larger-scale effects will always be positive or beneficial for humans [8] [9] [10]. Third, AGI may have other modes of cognition that are influenced by entirely different motivating factors [2]. Finally, humans widely disagree on the details of outcomes that they see as beneficial, thus the AGI will not be able to please everyone. Nevertheless, AGI with a drive toward positive interactions with humans would be more likely to aim for outcomes that are widely seen as beneficial, in contrast to AGI without such a drive. In particular, worst-case scenarios such as a complete extermination (whether or not intentional) or enslavement of humans would run counter to its basic drives. Perverse effects due to uncontrolled maximization behaviors can be mitigated because the AGI has access to a CLF. References [1] Jilk, D., Herd, S., Read, S., O’Reilly, R. (2017). “Anthropomorphic reasoning about neuromorphic AGI safety”. ​Journal of Experimental and Theoretical Artificial Intelligence​ 29(6): 1337-1351. DOI:10.1080/0952813X.2017.1354081 [2] Jilk, D. (in press). “Conceptual-Linguistic Superintelligence”. Forthcoming in ​Informatica​. [3] Jilk, D. (in press). “Limits to Verification and Validation of Agentic Behavior.” Forthcoming book chapter in ​Artificial Intelligence Safety and Security​, CRC Press. Preprint available at https://arxiv.org/abs/1604.06963v2. [4] Omohundro, S. (2008). “The Basic AI Drives”. In P. Wang, B. Goertzel, and S. Franklin (eds.), Proceedings of the First AGI Conference, 171, Frontiers in Artificial Intelligence and Applications. Amsterdam: IOS Press. [5] Bostrom, N. (2012). “The Superintelligent Will: Motivation and Instrumental Rationality in Advanced Artificial Agents”. Minds and Machines 22(2): 71-85. [6] Insel, T., Young, L. (2001). “The neurobiology of attachment”. ​Nature Reviews Neuroscience 2: 129-136. DOI: 10.1038/35053579

[7] Meltzoff, A. N., & Moore, M. K. (1983). Newborn infants imitate adult facial gestures. Child development, 702-709. [8] Yudkowsky, E. (2008). “Artificial Intelligence as a Positive and Negative Factor in Global Risk”. In N. Bostrom and M. Ćirković (eds.), Global Catastrophic Risks, pp. 308–345. Oxford: Oxford University Press. [9] Bostrom, N. (2014). Superintelligence: Paths, Dangers, Strategies. Oxford: Oxford University Press. [10] Herd, S. (submitted). Goal Changes in Intelligent Agents”. Appendix The analysis presented here relies substantially on some of the conclusions in reference [2]. Because those conclusions are potentially controversial, and in any case the paper is not yet publicly available (though it will be published soon in ​Informatica​), we present a brief summary of the relevant claims and arguments found there. Its central claim is that “artificial intelligence capable of sustaining an uncontrolled intelligence explosion must have a conceptual-linguistic faculty with substantial functional similarity to the human faculty.” The human conceptual-linguistic faculty (CLF) is described in some detail via results from neuroscience and cognitive psychology. Most pertinent to the discussion here is that a CLF “combines information representations that are treated discretely or symbolically with information representations that are graded, statistical, and overlapping,” and that the overall conceptual structure is richly connected and structured through both the symbolic and graded representations. For an intelligence explosion to be sustained, an AGI participating in the explosion must be able to do at least two things: create self-improvements or successors that are more intelligent, and prevent humans from interrupting its efforts in doing so. With respect to the first, the key point is that the AGI must be able to understand and use the existing corpus of human knowledge to create improvements or successors, and this requires a conceptual-linguistic faculty. Its only alternative is to learn how the world works from scratch, which would take a long time. The paper argues at length against various objections that there are shortcuts or that this knowledge is not really necessary. With respect to the second requirement, the paper reminds us that humans are masters of the “hack,” and that if the AGI does not have a CLF, it will not be able to model human tactics and strategies with sufficient fidelity to predict and thwart interventions. Consequently, though it may have other modes of cognition available, an AGI in an intelligence explosion will have a CLF as characterized, and will need to use it actively to be able to resist human intervention.

The paper goes on to show that any change or improvement in a computational system results in a new version that may have different behavior or goals than its predecessor, and it can be demonstrated mathematically and logically that there is no way to determine algorithmically whether this has occured. Consequently the AGI is faced with an unavoidable tradeoff between creating improvements in intelligence and the risk of subverting its own purposes. To resolve this tradeoff, the AGI will therefore need to question its own motivations in a context outside its purposes, which might be described as being ​thoughtful​ about its actions.

An AGI Alignment Drive.pdf

Page 1 of 9. An AGI Alignment Drive. Dave Jilk & Seth Herd. December 30, 2017. Abstract. To improve the likelihood of Artificial General Intelligence aligning its ...

120KB Sizes 0 Downloads 98 Views

Recommend Documents

Split alignment
Apr 13, 2012 - I use the standard affine-gap scoring scheme, with one additional parameter: a .... Ai,j: the alignment score for query base j in alignment i.

Shark-IA: An Interference Alignment Algorithm for Multi ...
Nov 14, 2014 - Architecture and Design—Wireless communication. Keywords ... adversary, we exploit propagation delays as an advantage for throughput ...

Mounting system and method therefor for mounting an alignment ...
Jul 10, 2002 - 33/203 18_ 33/562 instrument onto a vehicular Wheel Which is to be used to ..... sensing head 20 is mounted on a support bar 74. The support.

AGI-17 Demo Session proposal -
The OpenNARS proto AGI system exhibits a broad range of cognitive functions realized through a single concept-centric reasoning mechanism. Resources:.

AGI March 41/3
The costs of publication of this article were defrayed in part by the payment of page .... and AST was measured as described in MATERIALS AND METHODS. Data represent means. SE (n ... Body weight. To allow for full recovery from surgery,.

Vehicle alignment system
Jun 13, 1983 - tionally employs rather sophisticated equipment and specialized devices in order to align vehicle wheels. It has been recognized, as shown in the Modern Tire. Dealer, Volume 63, Number 7, June of 1982 at page 31, that alignment of the

Downlink Interference Alignment - Stanford University
cellular networks, multi-user MIMO. I. INTRODUCTION. ONE of the key performance metrics in the design of cellular systems is that of cell-edge spectral ...

Downlink Interference Alignment - Stanford University
Paper approved by N. Jindal, the Editor for MIMO Techniques of the. IEEE Communications ... Interference-free degrees-of-freedom ...... a distance . Based on ...

Cyber, Nano, and AGI Risks: Decentralized ... - Semantic Scholar
the operating systems, therefore our discussion starts with this most urgent vulnerability. Problem for cybersecurity is social constraint that could be overcome via genetic takeover. With regard to cyber attack, it is widely believed that improvemen

Downlink Interference Alignment
Wireless Foundations. U.C. Berkeley. GLOBECOM 2010. Dec. 8. Joint work .... Downlink: Implementation Benefits. 2. 1. 1. K. Fix K-dim reference plane, indep. of ...

Manifold Alignment Determination
examples from which it is capable to recover a global alignment through a prob- ... only pre-aligned data for the purpose of multiview learning. Rather, we exploit ...

Cyber, Nano, and AGI Risks: Decentralized ... - Research at Google
monitoring system, if based on sufficiently secure software foundations and physical .... We're still paying the cost for that today, with companies such as Target being .... the intelligence of civilization is the superintelligence that is relevant.

AGI and Neuroscience: Open Sourcing the Brain
immediately apparent from the Wikipedia entry on Strong AI and AGI [26]. 3.2 Can ... massive parallelism such as in neuromorphic hardware perhaps? And we ...

GJC Principal AGI Proposal & Cov.Lr .pdf
Page 1 of 2. From To. The Dist. Vocational Edn. Officer. PRINCIPAL,. Government Junior College,. Rc.No. /A/ /201 -1 , Dated. -0 -201. Sir,. Sub:- TSIES - ...

Cyber, Nano, and AGI Risks: Decentralized ... - Semantic Scholar
Traditional top-down, Big Brother-style surveillance, which tends to lead to abuse;. 2. A symmetric ... We're still paying the cost for that today, with companies such as Target being hacked ... demolition and reconstruction, building codes are writt

Cyber, Nano, and AGI Risks: Decentralized ... - Foresight Institute
Brin, David. 1998. ​The Transparent Society. .... ​Artificial Intelligence: A Modern Approach.​ Pearson. Education Limited. Simpson, Corbin; Short Allen. “Monte ...

AGI and Neuroscience: Open Sourcing the Brain
In past decades, research in AI has been guided by insights about the human mind ..... .blogspot.com/2007/12/reducing-agi-complexity-copy-only-high.html. 25.

Opportunistic Interference Alignment for Random ... - IEEE Xplore
Dec 14, 2015 - the new standardization called IEEE 802.11 high-efficiency wireless ... Short Range Wireless Transmission Technology with Robustness to ...

Automatic Score Alignment of Recorded Music - GitHub
Bachelor of Software Engineering. November 2010 .... The latter attempts at finding the database entries that best mach the musical or symbolic .... However, the results of several alignment experiments have been made available online. The.

Opportunistic Interference Alignment for MIMO ...
Feb 15, 2013 - Index Terms—Degrees-of-freedom (DoF), opportunistic inter- ... Education, Science and Technology (2010-0011140, 2012R1A1A1044151). A part of .... information of the channels from the transmitter to all receivers, i.e., its own ......