Heuristic Evaluation Functions for General Game Playing

Viewer
Transcript

University of California Los Angeles

Heuristic Evaluation Functions for General Game Playing

A dissertation submitted in partial satisfaction of the requirements for the degree Doctor of Philosophy in Computer Science

by

James Edmond Clune III

2008

c Copyright by

James Edmond Clune III 2008

The dissertation of James Edmond Clune III is approved.

Adnan Y. Darwiche

Thomas S. Ferguson

Alan C. Kay

Todd D. Millstein

Richard E. Korf, Committee Chair

University of California, Los Angeles 2008

ii

Table of Contents

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1

1.1

General Game Playing: The Problem . . . . . . . . . . . . . . . .

1

1.2

The Importance of General Game Playing . . . . . . . . . . . . .

2

1.2.1

Competitive Performance Metric . . . . . . . . . . . . . .

2

1.2.2

Flexible Software Capabilities . . . . . . . . . . . . . . . .

3

1.2.3

Game-Oriented Programming . . . . . . . . . . . . . . . .

3

1.2.4

Generality as a Key Aspect of Intelligence . . . . . . . . .

4

1.2.5

A Timely Problem . . . . . . . . . . . . . . . . . . . . . .

4

1.3

Overview of the Project . . . . . . . . . . . . . . . . . . . . . . .

5

1.4

Overview of the Dissertation . . . . . . . . . . . . . . . . . . . . .

6

2 Philosophy and Vision . . . . . . . . . . . . . . . . . . . . . . . . .

8

2.1

The Ubiquity of Game Models . . . . . . . . . . . . . . . . . . . .

8

2.2

Game-Oriented Programming . . . . . . . . . . . . . . . . . . . .

10

3 Literature Review . . . . . . . . . . . . . . . . . . . . . . . . . . . .

13

3.1

3.2

General Game Playing . . . . . . . . . . . . . . . . . . . . . . . .

13

3.1.1

Barney Pell’s Metagamer . . . . . . . . . . . . . . . . . . .

13

3.1.2

AAAI General Game Playing Competition . . . . . . . . .

14

3.1.3

Other General Game Playing Systems . . . . . . . . . . .

15

Automated Planning . . . . . . . . . . . . . . . . . . . . . . . . .

16

3.2.1

16

Classical Planning . . . . . . . . . . . . . . . . . . . . . .

iii

3.2.2

International Planning Competition . . . . . . . . . . . . .

17

3.2.3

Planning Techniques . . . . . . . . . . . . . . . . . . . . .

18

Discovery Systems . . . . . . . . . . . . . . . . . . . . . . . . . .

20

3.3.1

Feature Discovery . . . . . . . . . . . . . . . . . . . . . . .

20

3.3.2

AM and Eurisko . . . . . . . . . . . . . . . . . . . . . . .

21

3.3.3

Learning Heuristic Evaluation Functions . . . . . . . . . .

23

4 General Game Playing Framework . . . . . . . . . . . . . . . . . .

25

3.3

4.1

Game Description Language (GDL) . . . . . . . . . . . . . . . . .

25

4.2

GGP Protocol . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

28

4.3

AAAI GGP Competition . . . . . . . . . . . . . . . . . . . . . . .

29

5 Abstract-Model Based Heuristic Evaluation Functions . . . . .

31

5.1

Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

31

5.2

Feature Identification . . . . . . . . . . . . . . . . . . . . . . . . .

32

5.2.1

Candidate Expressions . . . . . . . . . . . . . . . . . . . .

33

5.2.2

Expression Interpretations . . . . . . . . . . . . . . . . . .

34

5.2.3

Stability . . . . . . . . . . . . . . . . . . . . . . . . . . . .

35

Abstract Model . . . . . . . . . . . . . . . . . . . . . . . . . . . .

37

5.3.1

Payoff . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

37

5.3.2

Mobility . . . . . . . . . . . . . . . . . . . . . . . . . . . .

38

5.3.3

Termination . . . . . . . . . . . . . . . . . . . . . . . . . .

39

5.4

Heuristic Evaluation Function . . . . . . . . . . . . . . . . . . . .

40

5.5

Anytime Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . .

42

5.3

iv

5.6

Use of Evaluation Function in Game-Play . . . . . . . . . . . . .

42

5.7

Evaluation Function Results . . . . . . . . . . . . . . . . . . . . .

43

5.7.1

Racetrack Corridor . . . . . . . . . . . . . . . . . . . . . .

44

5.7.2

Othello . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

45

5.7.3

Chess . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

46

5.7.4

Chinese Checkers . . . . . . . . . . . . . . . . . . . . . . .

47

6 Techniques Specific to Single-Player Games . . . . . . . . . . . .

49

6.1

Motivation and Overview . . . . . . . . . . . . . . . . . . . . . . .

49

6.2

Heuristic Evaluation Function Construction . . . . . . . . . . . .

50

6.3

Search Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . .

52

6.3.1

Uninformed Search . . . . . . . . . . . . . . . . . . . . . .

53

6.3.2

Informed Search . . . . . . . . . . . . . . . . . . . . . . . .

55

6.4

Algorithm Composition . . . . . . . . . . . . . . . . . . . . . . . .

56

6.5

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

57

7 Rollout-Based Monte Carlo Methods . . . . . . . . . . . . . . . .

58

7.1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

58

7.2

Use of Heuristic Evaluation Functions . . . . . . . . . . . . . . . .

60

7.3

Use of Action Heuristics . . . . . . . . . . . . . . . . . . . . . . .

60

7.4

Automatic Construction of Action Heuristics . . . . . . . . . . . .

61

8 Alpha-Beta Minimax versus Monte Carlo Methods . . . . . . .

64

8.1

Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

v

64

8.2

Randomly Generated Synthetic Games . . . . . . . . . . . . . . .

66

8.3

Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

68

8.4

Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

68

8.5

Real Games . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

72

9 Empirical Results: AAAI GGP Competitions . . . . . . . . . . .

78

9.1

First Annual GGP Competition . . . . . . . . . . . . . . . . . . .

78

9.2

Second Annual GGP Competition . . . . . . . . . . . . . . . . . .

80

9.3

Third Annual GGP Competition . . . . . . . . . . . . . . . . . .

82

9.4

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

84

10 Discussion and Conclusions . . . . . . . . . . . . . . . . . . . . . .

85

10.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

85

10.2 Discussion and Future Work . . . . . . . . . . . . . . . . . . . . .

86

10.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

87

A Interpreting Heuristic Evaluation Functions . . . . . . . . . . . .

88

A.1 Chess

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

89

A.2 Chinese Checkers . . . . . . . . . . . . . . . . . . . . . . . . . . .

90

A.3 Othello . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

92

B Engineering Considerations . . . . . . . . . . . . . . . . . . . . . .

94

B.1 Reasoning Module . . . . . . . . . . . . . . . . . . . . . . . . . .

94

B.2 Multi-Processor Utilization . . . . . . . . . . . . . . . . . . . . . .

95

vi

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

vii

97

List of Figures 4.1

A State in Tic-Tac-Toe . . . . . . . . . . . . . . . . . . . . . . . .

26

5.1

Overall Evaluation Function . . . . . . . . . . . . . . . . . . . . .

41

5.2

Racetrack Corridor (initial position) . . . . . . . . . . . . . . . . .

44

5.3

Racetrack Corridor (with some walls placed) . . . . . . . . . . . .

44

5.4

Chinese Checkers . . . . . . . . . . . . . . . . . . . . . . . . . . .

48

6.1

Flow-Chart for Solving Single-Player Problems . . . . . . . . . . .

51

8.1

Two-Player Games from AAAI Tournaments . . . . . . . . . . . .

76

8.2

Results by Branching Factor . . . . . . . . . . . . . . . . . . . . .

77

viii

List of Tables 5.1

Abstract Model Parameters . . . . . . . . . . . . . . . . . . . . .

37

8.1

Zero-Sum Games . . . . . . . . . . . . . . . . . . . . . . . . . . .

69

8.2

Varying Time Per Move . . . . . . . . . . . . . . . . . . . . . . .

70

8.3

Nonzero-Sum Games . . . . . . . . . . . . . . . . . . . . . . . . .

71

8.4

Games from AAAI Tournaments . . . . . . . . . . . . . . . . . . .

73

9.1

AAAI 2006 GGP Competition Leaderboard . . . . . . . . . . . .

79

9.2

AAAI 2007 GGP Competition: Results of Preliminary Rounds . .

82

ix

Acknowledgments I am indebted to my adviser, Rich Korf, for his advice and encouragement. He was supportive from the project’s conception and remained so throughout. Many technical discussions with Rich contributed to this work by improving my understanding of central issues. He read multiple writeups and provided insightful comments and constructive criticism. He steered me in helpful directions, yet gave me freedom to pursue my own ideas. He also sponsored me as a research assistant, which helped me finish this dissertation. I would like to thank Michael Genesereth for sponsoring the AAAI general game playing competition, without which this work would not have been possible. Also, thanks to the Stanford General Game Playing group including Nat Love, Eric Schkufza, David Haley, and Tim Hinrichs. I’d like to also thank the other competitors in the AAAI general game playing tournaments for open discussions amidst friendly competition. Thanks to Alan Kay for encouraging discussions, especially in helping broaden my vision beyond “games” in the traditional sense. The philosophical issues explored in Chapter 2, particularly the view of game-oriented programming languages, came largely from interactions with Alan. I’m grateful to the past and present UCLA AI grads who have helped me through numerous discussions and proofreading, particularly Alex Dow and Alex Fukunaga. Thanks to Barney Pell for an encouraging and helpful discussion early in the project.

x

Abstract of the Dissertation

Heuristic Evaluation Functions for General Game Playing by

James Edmond Clune III Doctor of Philosophy in Computer Science University of California, Los Angeles, 2008 Professor Richard E. Korf, Chair

A general game-playing program plays games that it has not previously encountered. A game manager program sends the game-playing programs a description of a game’s rules and objectives in a game description language. The gameplaying programs compete by sending messages over a network indicating their moves until the game is completed. The class of games covered is intentionally broad, including games of one or more players with alternating or simultaneous moves, with arbitrary numeric payoffs. This research explores the problem of constructing an effective general gameplaying program, with an emphasis on techniques for automatically constructing effective heuristic evaluation functions from game descriptions. A technique based on abstract models of games is presented. The abstract model treats mobility, payoff and termination as the most salient elements of a game. Each of these aspects are quantified in terms of stable features. Evidence is presented that the technique produces heuristic evaluation functions that are both comprehensible and effective. Empirical work includes a series of general game-playing programs that placed

xi

first or second for the three consecutive years of the AAAI General Game-Playing Competition.

xii

CHAPTER 1 Introduction 1.1

General Game Playing: The Problem

The idea of general game playing (GGP) is to create a computer program that effectively plays games that it has not previously encountered. A game manager program sends the game playing programs a description of a game in a welldefined game description language. The description specifies the goal of the game, the legal moves, the initial game state, and the termination conditions. The game manager also sends information about what role the program will play (black or white, naughts or crosses, etc), a start time (time allowed for pre-game analysis), and a move time (time allowed per move once game play begins). The game playing programs compete by sending messages over a network indicating their moves until the game is completed. The class of games covered is intentionally broad, including games of one or more players with alternating or simultaneous moves, with arbitrary numeric payoffs. The immediate goal of the research is to develop techniques that allow us to create GGP programs that win games. The techniques are implemented and their effectiveness is evaluated empirically by competing against programs embodying alternative techniques. My work on general game playing techniques emphasizes heuristic evaluation functions. These are functions from game states to numbers used to assess the desirability of non-terminal states for particular players. In my opinion, automatic

1

construction of effective heuristic evaluation functions from game descriptions is the central challenge of general game playing. However, it is a subject that is best dealt with in the context of a complete general game player employing some type of game-tree search. In recognition of this, search techniques are covered in this dissertation as well.

1.2

The Importance of General Game Playing

Here I describe several reasons why GGP is an important research problem.

1.2.1

Competitive Performance Metric

In his doctoral dissertation [Pel93], Barney Pell argues that a primary reason for computer game playing research has been the competitive performance metric for intelligence, namely the presumed link between winning games and intelligent behavior. Pell describes a problem with this presumption: Unfortunately, the use of such a link has proved problematic: we have been able to produce strong programs for some games through specialized engineering methods, the extreme case being special-purpose hardware, and through analysis of the games by humans instead of by programs themselves. Consequently, it now appears that increased understanding and automation of intelligent processing is neither necessary nor sufficient for strong performance in game-playing. That is, it appears that we can construct strong game-playing programs without doing much of interest from an AI perspective, and conversely, we can make significant advances in AI that do not result in strong game-playing programs. [Pel93]

2

Pell goes on to propose that the solution to this is to change the problem so that the programs are forced to do the game analysis because the programmer is not given the specific game to be played ahead of time. He calls this new paradigm metagame, which describes the basic ideas of what is now known as GGP.

1.2.2

Flexible Software Capabilities

Machines in general and computers in particular are extremely brittle compared to the flexibility of behavior and abilities we see in the animal kingdom. Part of this brittleness is attributable to the extreme narrowness of domain that is associated with most software. The notion of domain breadth is difficult to characterize precisely. A single program can easily play top-quality chess, checkers, and Othello by simply combining the top programs for each game with a simple “switch” statement. However, this approach does not produce the ability to effectively play an infinite number of games that qualitatively differ from one another. GGP presents a challenge to exhibit flexibility in software that is more common to natural organisms than to computers.

1.2.3

Game-Oriented Programming

Another way to view GGP is to think of the game descriptions as computer programs and the general game player as a compiler [Kay06]. When viewed through this lens, the present research program can be seen as an exploratory step in the development of a game-oriented programming paradigm. The paradigm enables a programmer to specify a task that can be formulated in terms of goals, actions, states, payoffs, and adversaries without explicitly supplying the compiler

3

with any strategies or tactics for maximizing the payoff. This idea is developed more fully in chapter 2.

1.2.4

Generality as a Key Aspect of Intelligence

A primary scientific problem of AI is to understand intelligence computationally. Our concern is with computational theories that circumscribe the design space of intelligent systems rather than psychological theories of human cognition. Nevertheless, because humans exhibit the most perspicuous example of intelligence, it is reasonable to strive for theories that capture features that have been identified as important to understanding human intelligence. Chief among these features is what psychologists call g, or the general factor of intelligence [Jen99]. GGP research contributes to the scientific goals of AI by explicitly addressing generality as an important aspect of intelligence.

1.2.5

A Timely Problem

Good research projects not only address problems that are interesting and important, they address them at the right time. Now is a good time for GGP research for a number of reasons. One is the introduction of the AAAI General Game Playing Competition organized by Michael Genesereth. Competitions were held in 2005, 2006, and 2007, and are planned for 2008. To encourage participation a $10,000 prize is awarded to the winner. The competitions have provided a venue for empirical comparisons of different approaches and have served to catalyze the creation of a small community of researchers investigating the problem of GGP. Further evidence that the current climate is amenable to GGP research is that there has been a recent increase in recognition that achieving the original goals of AI require something more than continued progress in the specialized

4

sub-disciplines that have come to characterize the field of AI. Ron Brachman refers to this need as AI being more than the sum of its parts [Bra06]. Some have proposed the term “artificial general intelligence” to distinguish an emphasis on broad capabilities from “narrow AI” [GP07]. Regardless of the terms used, the explicit emphasis on generality enables the GGP research agenda to support the large-scoped aspirations of AI.

1.3

Overview of the Project

The project associated with this dissertation is aimed at developing effective techniques for general game playing, emphasizing heuristic evaluation functions. My technique for heuristic evaluation functions involves the construction of an abstract model from a game description. The model is in terms of stable (incrementally varying) numeric features, and abstracts the game to core components, including payoff, mobility, and termination. The overall heuristic evaluation function seeks to pursue mobility and expected payoff, weighing payoff more heavily as the game approaches termination. An important aspect of the project is that these techniques are developed and evaluated in the context of a complete general game playing program. This project has co-evolved with the introduction of the AAAI General Game Playing Competition and the associated Stanford General Game Playing Project. To a first degree of approximation, my project can be understood as a series of entries in this competition. This does not mean that the value of my project is limited to the competition, but rather that my project represents one approach to the type of research that the competition is designed to promote. On a practical level, the Game Description Language (GDL) developed by the Stanford group has provided the model for which games are described in my own project, allowing

5

me to focus more on game-playing techniques instead of on how to describe games. The competition itself has created a small but international community of researchers as well as a venue for empirically evaluating various approaches. At the time of this writing, there have been three annual GGP competitions. My project has contributed an entrant in each competition and has consistently earned a spot in the championship round, placing first the first year and second the other two years. The level of game-play has increased significantly each year. At the time of this writing, my project is the only one to make it to the championship round more than once.

1.4

Overview of the Dissertation

The remainder of the dissertation is organized as follows. • Chapter 2, Philosophy and Vision, presents a broad perspective on the philosophy of general game playing research and its relevance beyond “games” in the traditional sense. • Chapter 3, Literature Review, reviews the literature most relevant to general game playing and the automatic construction of heuristic evaluation functions. • Chapter 4, General Game Playing Framework, provides an overview of Genesereth’s general game playing framework, which I use in my project. • Chapter 5, Abstract-Model Based Heuristic Evaluation Functions, presents a technique for automatically constructing heuristic evaluation functions. The presentation assumes some familiarity with the material in Chapter 4.

6

• Chapter 6, Techniques Specific to Single-Player Games, presents planning techniques that I have incorporated into my player that are specific to single-player games. • Chapter 7, Rollout-Based Monte Carlo Methods, describes a version of my game playing program using rollout-based Monte Carlo methods. • Chapter 8, Alpha-Beta Minimax versus Monte Carlo Methods, describes an investigation of the factors influencing the relative performance of alpha-beta minimax and Monte Carlo methods. • Chapter 9, Empirical Results: AAAI GGP Competitions, describes the versions of the game-playing program submitted in each of AAAI competitions and how they performed. • Chapter 10, Discussion and Conclusions, provides a summary of the dissertation and some conclusions. • Appendix A, Interpreting Heuristic Evaluation Functions, shows what my program outputs to describe its heuristic evaluation functions. I give examples of how an informed user can translate these descriptions into English. • Appendix B, Engineering Considerations, discusses some implementation details of my general game playing program. Aspects described here are nonessential to the core dissertation, but are of potential interest to programmers.

7

CHAPTER 2 Philosophy and Vision In this chapter, I present a “big picture” perspective on the philosophy and vision of my project. The central theme of this chapter is that the long-term goals of general game playing research extend far beyond “games” in the traditional sense.

2.1

The Ubiquity of Game Models

Although the term game in common usage refers to a form of play, the metaphor and mathematical model of a game is a powerful concept for understanding a wide variety of complex phenomena. The foundations of game theory were established in the 1940s by John Von Neumann and Oskar Morganstern in their book Theory of Games and Economic Behavior [vM53]. Classical game theory provides mathematical models for what happens when rational decision-makers with differing preferences interact. An important contribution to game theory was John Nash’s introduction of an equilibrium condition for game strategies now known as Nash equilibrium [Nas50]. A strategy is a Nash equilibrium if it has the property that when two players utilize the strategy, neither can increase his or her payoff by deviating from the strategy. Evolutionary game theory was introduced in the 1970s by John Maynard Smith and his colleagues as a way of thinking about biological evolution in which natural selection is viewed as an evolutionary game [May76]. Unlike classical

8

game theory, there is no assumption of rational decision-makers. Instead, there are populations of individuals with fixed strategies as inheritable traits. Successful strategies are those that result in successful reproduction rates. Vincent and Brown describe the idea in this way: Evolution by natural selection is an evolutionary game in the sense that it has players, strategies, strategy sets, and payoffs. The players are the individual organisms. Strategies are heritable phenotypes. A player’s strategy set is the set of all evolutionarily feasible strategies. Payoffs in the evolutionary game are expressed in terms of fitness, where fitness is defined as the expected per capita growth rate for a given strategy and ecological circumstance. [VB05] Maynard Smith introduced the notion of an evolutionarily stable strategy, which is closely related to a Nash equilibrium. A strategy is evolutionarily stable if given a large population of players employing one strategy, the population is resistant to invasion by mutant strategies differing from the original strategy. The concept of evolutionarily stable strategies has provided new insight into biological phenomena which were previously explained using more dubious group selection arguments. For example, a number of animal species engage in “conventional” fighting in which individuals fight only within certain limits, then back down instead of escalating [Now06]. Natural selection’s ability to produce this behavior is now understood in terms of evolutionarily stable strategies. An interesting application of game theory that builds on the understanding of biological evolution and holds a particular relevance for artificial intelligence is the science of evolutionary psychology. This field is concerned with the application of evolutionary theory to an understanding of human behavior.

9

One lesson from evolutionary psychology is that the human mind, far from being the blank slate proposed by the standard social science model, is in fact highly biased with a human nature that transcends cultural differences [Pin99, Wri95]. This by no means implies that environment has no influence on human behavior, but rather that the mental substrate upon which the environment impinges is neither initially blank, nor infinitely malleable. Furthermore, the process which has shaped this nature is an evolutionary one, the principles of which are illuminated through a game-theoretic perspective. A subtopic of evolutionary psychology that has been particularly influenced by game theory is that of the evolution of cooperation. More broadly, the topic tries to understand of the process by which man has evolved from a social animal to a moral animal. A key piece of this understanding has been the recognition that the game of genetic proliferation is non-zero sum. Computer simulations exploring the non-zero sum game of the prisoner’s dilemma have shown that effective strategies are cooperative, retaliatory and somewhat forgiving [Axe80]. Although the empirical work presented in this dissertation is of a much narrower scope than that of economics, biology, and evolutionary psychology, the work is approached from the standpoint that the underlying model of interactions as games is one of widespread importance.

2.2

Game-Oriented Programming

The widespread applicability of modeling complex phenomena as games suggests that as computer programs perform tasks in more complex, dynamic environments, it will be increasingly appropriate to view these programs as game playing agents. Consider a financial program designed for algorithmic trading. Such a

10

program is easily viewed as a game playing agent executing trading strategies in pursuit of financial objectives in a competitive market. The notion that it is advantageous to view computer programs operating in complex, dynamic environments as game-playing agents suggests the introduction of a programming paradigm that I will refer to as game-oriented programming. This paradigm conceptualizes computer programs in terms of game elements, that is, players, moves, transition rules, payoffs, and termination conditions. From this standpoint, GDL can be viewed as an early game-oriented programming language. The division of labor enabled by this paradigm is that the programmer, or modeler, describes the dynamics of the game and the available actions and objectives of the players, while the “compiler” (a general game playing program) is responsible for producing an effective strategy for pursuing the goals subject to the constraints and dynamics of the game. An important technical capability required for making this approach effective is that general game playing techniques be advanced enough that the strategies employed by the compiler are effective. Developing technical capabilities along these lines is one objective of this dissertation. The realization of a practical game-oriented programming system for general use may still be some time from now. However, it is not too early to envision domains in which game-oriented programming would likely be beneficial. Business software could be formulated in terms of goals of maximizing profit while competing with other businesses and complying with regulatory constraints. Military simulations are naturally expressed as adversarial games. Optimization and scheduling programs can be formulated as either single-player games or games against nature. These domains will likely require models that go beyond the current framework, particularly in the areas of hidden information, stochastic

11

elements, and richer numeric concepts. Nevertheless, I believe the framework explored here is a reasonable place to start. The advent of game-oriented programming is in some ways a natural step in the evolution of programming languages. Throughout the history of computers, an identifiable trend in programming languages has been the gradual transition from low-level, highly detailed, step-by-step procedures that describe algorithms in computer terminology, to higher-level, more abstract representations that describe algorithms more in the terminology of application domains. From this perspective, the introduction of a game-oriented abstraction promoted by gameoriented programming is consistent with the broad trends of the evolution of programming languages.

12

CHAPTER 3 Literature Review In this literature review, I focus specifically on three areas of particular relevance to this dissertation: general game playing, automated planning, and discovery systems.

3.1

General Game Playing

3.1.1

Barney Pell’s Metagamer

The primary previous work is Barney Pell’s doctoral dissertation [Pel93]. The dissertation has three main parts. The first part identifies GGP as a research problem and argues for its importance. The primary motivation is to redress the competitive performance metric for intelligence, which is the presumed link between winning games and intelligent behavior. The second part is a detailed construction of a concrete class of games applicable to the advocated line of research. The class of games is symmetric chess-like games, which includes games such as chess, checkers, Tic-Tac-Toe, Chinese-chess, and Shogi. Pell defines a language for specifying symmetric chess-like games and also describes a program that automatically generates games within this class. The third part of Pell’s dissertation is a description of a program called Metagamer, which is a general game player for symmetric chess-like games.

13

Metagamer incorporates strategy based on an analysis of the class of symmetric chess-like games. The analysis includes features such as mobility, material, centrality, promotion, and threats. Some coarse-grained features, such as mobility, are further classified into finer-grained features such as immediate dynamic mobility, eventual mobility, static mobility, and constrained mobility. Metagamer incorporates these features in the form of advisers. Finally, Pell reports results of a tournament of multiple versions of Metagamer on randomly generated games. The results demonstrate that the various advisers identified indeed significantly improve the quality of play.

3.1.2

AAAI General Game Playing Competition

In 2005, Michael Genesereth organized the First Annual AAAI General Game Playing Competition [GLP05]. The organizers also defined the Game Description Language, or GDL [GL05], which covers a significantly broader class of games than symmetric chess-like games. Games expressible in GDL include two-player games, games involving more than two players, single-player puzzles, turn-taking games, games with simultaneous moves, games with numeric payoffs, zero-sum competitive games, and non-zero-sum cooperative games. Furthermore, whereas Pell’s game language has built-in notions of pieces, squares, board, movement, captures, and promotions, Genesereth’s version has none of these notions. Instead, such concepts are constructed as needed from first-order logic within the game descriptions. This approach forces program authors to implement extremely general game-playing agents. The first published work on generating heuristics for GDL is that of Kuhlmann, Dresner, and Stone [KDS06]. They describe a method of creating features from syntactic structures by recognizing and exploiting relational patterns such as suc-

14

cessor relations and board-like grids. They then perform distributed search. Each machine uses a single heuristic based on either the maximization or the minimization of one of the detected features. Kuhlmann and Stone have also described a transfer learning technique that identifies games that are similar to games encountered previously [KS07]. The approach maps values from the known game to the similar game in order to speed up reinforcement learning. Schiffel and Thielscher describe another approach to constructing heuristic evaluation functions for GDL [ST07]. Their approach also recognizes structures such as successor relations and grids in game descriptions. Their method applies fuzzy logic and techniques for reasoning about actions to quantify the degree to which a state satisfies the conditions for winning. They also assess how close the game is to termination, seeking terminal states when goals are reached and avoiding terminal states when goals are not yet attained.

3.1.3

Other General Game Playing Systems

Robert Levinson describes a series of game-playing systems called Morph in [Lev95]. The original Morph was designed to learn to play chess and had a chess-specific state and game representation. The second version, Morph II, was generalized to a general game playing model. Levinson describes his game model as a “generic-hypergraph-game”. States are represented as vectors of Boolean properties. Operators have preconditions and postconditions in terms of the Boolean properties of a state. This is a very general framework, avoiding concepts such as “pieces” and “boards”. Perhaps ironically given Morph’s heritage as a chess system, the generic-hypergraph game representation of Morph II does not scale well to large combinatorial games such as chess.

15

Zillions of Games is a commercial software product for Microsoft Windows for playing strategy games, particularly chess-like games [Zil]. It uses a proprietary language for specifying games called Zillions rules format, or ZRF. which uses a mixture of declarative and imperative constructs. It includes a high quality built-in game playing engine that plays games specified in ZRF.

3.2

Automated Planning

An independent line of research predating GGP is work on the problem of automated planning. Planning is the explicit deliberation process of choosing and sequencing actions to achieve goals. Automated planning is an area of AI that studies this deliberation process computationally. This section provides a brief overview of the aspects of automated planning particularly relevant to general game playing. For a comprehensive overview of automated planning, see Ghallab, Nau, and Traverso’s Automated Planning: Theory and Practice. [GNT04].

3.2.1

Classical Planning

Classical planning problems are deterministic, static, finite, fully-observable statetransition systems with designated goal states. The STRIPS representation of planning problems introduced by Fikes and Nilsson in an early work in classical planning has had a lasting influence on the formulation of planning problems [FN71]. In STRIPS-style planning, states are represented by a set of logical atoms that are true or false, expressed in a notation derived from first-order logic. Actions are represented as planning operators associated with preconditions and effects. An alternative representation for classical planning is the state-variable rep-

16

resentation, also called state-vector representation. In this representation each state is represented by a tuple of values of n state variables and each action is represented by a partial function that maps previous values to new values [Kor85].

3.2.2

International Planning Competition

The International Planning Competition (IPC) was organized by Drew McDermott in 1998 to empirically compare approaches to automated planning [HE05]. The competition, which is now a biennial event, is similar to the GGP competition, but for planning instead of game playing. The Planning Domain Definition Language (PDDL) [GL06] plays an analogous role in IPC as GDL does in GGP. In one sense, GGP subsumes planning. In fact, some of the single-player puzzles used in the 2006 GGP Competition were problems from the planning community translated into GDL. Despite this overlap in domain, there are nontrivial differences between the game playing and planning communities in how single-agent problems are formulated and how solutions are evaluated. One difference is that game playing imposes a constraint of commitment to moves in constant time. Game players thus interleave planning and execution, whereas planners typically do not commit to any moves until a complete plan is constructed. The planning community classifies problems as deterministic or probabilistic. This categorization is applicable to game playing as well, though the GGP community has not yet tackled probabilistic games, preferring to first gain a better understanding of deterministic games. Within deterministic planning, the planning community distinguishes between optimal planners, which produce optimal solutions, and satisficing planners, which produce plans that are correct but not necessarily optimal. The satisficing planners are closest in spirit to game playing

17

formulations. The IPC and GGP also differ in their evaluation criteria. In GGP, each game description provides rules for payoff values of the final game states. The payoff values are integers in the range [0, 100] and rules are predicated on arbitrary GDL expressions. The IPC has used three types of evaluation criteria: the speed in which plans are produced, the coverage of problems solved, and the solution quality of the solutions [LF06]. The IPC differentiates among different planning domains, so coverage refers to the proportion of problems solved and the balance of this proportion across multiple domains. The relative emphasis of these evaluation criteria has evolved over the competitions, with speed and coverage initially being the emphasis, and solution quality more recently gaining in importance. Originally, solution quality was measured by the makespan of the plan, which is the overall execution time of the plan. IPC5 allowed a flexible numeric optimization criteria to be specified in the problem description. This latter approach is similar to GDL in expressive capability, though the details of GDL and PDDL differ in how they express these notions.

3.2.3

Planning Techniques

Here I briefly review some of the techniques utilized in automated planning. The perspective of this review is that planning is fundamentally a search problem [Kor87]. There are two broad approaches to planning as a search: as a singleagent path-finding problem and as a constraint-satisfaction problem.

3.2.3.1

Planning as a Single-Agent Path-Finding Problem

The most straight-forward approach to planning is as a single-agent path-finding problem. This approach is also called state-space planning. State-space planning

18

approaches include forward-search, backward-search, and bi-directional search. The search algorithms can be non-informed brute-force search, or informed search guided by a heuristic evaluation function. Search algorithms applicable to statespace planning include A* [HNR68] and depth-first branch-and-bound. Despite the simplicity of the basic idea of state-space planning, the approach is often competitive or superior to alternative approaches, particularly when effective heuristic evaluation functions are employed.

3.2.3.2

Planning as a Constraint-Satisfaction Problem

Another broad approach to planning as search is to view plan construction as a constraint satisfaction problem (CSP). A constraint satisfaction problem is represented as a set of variables, a set of domains for the variables, and a set of constraints restricting the values of the variables. A solution to a CSP is a set of variable bindings satisfying the constraints. General techniques for constraint-satisfaction problems tend to fall into two broad categories. The first is to incrementally assign values to variables while maintaining the constraints of the problem. This approach is called backtracking. An alternative approach is to assign values to all the variables, then incrementally adjust the values to minimize the number of constraints that are violated. This approach is called heuristic repair [MJP92]. A variety of approaches to planning may be viewed as particular instances of constraint satisfaction problems, varying in the details of how plans are encoded as variables and constraints. In plan-space planning, a plan is represented as a set of actions to be performed and a set of ordering and binding constraints on the actions. The approach seeks to construct valid plans through a sequence of refinement operations on partially

19

specified plans, applying the least-commitment principle during each refinement. Plan-space planning is attributed to Sacerdoti [Sac74]. Planning-graph techniques construct a sequence of sets of actions. The approach emphasizes reachability analysis, which is concerned with the question of which states are reachable from which other states. Blum and Furst introduced and popularized this approach through their Graphplan planner [BF97]. Planning as satisfiability encodes the planning problem as a propositional formula and applies a satisfiability decision procedure. The strategy here is to apply effective satisfiability solvers to planning problems. This approach was pioneered by Kautz and Selman [KS92].

3.3

Discovery Systems

An important aspect of the current research is discovering structure within games that is exploitable by heuristic evaluation functions. I review here previous work in systems that automatically discover features and heuristics.

3.3.1

Feature Discovery

The problem of feature discovery in game playing (also called feature generation or feature construction) is to determine the features or properties of a game state that are most relevant for inclusion in a heuristic evaluation function. The problem was identified by Samuel in the development of his checkers-playing program [Sam67]. Feature discovery methods can be roughly categorized into two classes: inductive learning methods that depend on a training signal but do not require an explicit task model, and knowledge-based methods, which do not require a train-

20

ing signal, but do require a formal description of the task. The latter methods are of particular relevance to general game playing, because game descriptions are available but trainers are not. Note that the distinction between these approaches is not entirely crisp. In particular, reinforcement learning methods [SB98] blur the distinction by simulating games through random play according to the game description and using only the game outcome as a training signal. Utgoff characterizes a popular approach to constructing features automatically through inductive methods as parameter tuning [Utg01]. The approach is to create a layered, feed-forward artificial neural network with inputs connected to game states. The systems learn through parameter tuning that adjust the weights of connections in the inner layers of the network based on training signals. Fawcett and Utgoff describe a knowledge-based approach in [FU92]. They describe a system called Zenith which generates features given a domain theory for a task or game. The technique is based on a transformational theory of feature generation. Features are represented as a conjunction or disjunction of first-order terms created by a sequence of transformations on the domain theory. The transformations include decomposition, goal regression, abstraction, and specialization. Zenith derived useful features for the game of Othello and for a telecommunications network management domain.

3.3.2

AM and Eurisko

Doug Lenat describes two case studies into the nature of heuristics [Len82]. The first, AM (Automated Mathematician), was a computer program that was given 115 set theory concepts and 243 heuristic rules for proposing new concepts. The system applied heuristics to existing concepts and rediscovered interesting and

21

non-obvious mathematical concepts and theorems, including de Morgan’s laws, the fundamental theorem of arithmetic, Goldbach’s conjecture, and a conjecture about highly composite numbers first discovered by the mathematician Srinivasa Ramanujan. In the second case-study, a successor program to AM called Eurisko was developed with the goal of discovering new heuristics. Eurisko was applied to multiple domains, and its accomplishments included winning a national wargame tournament after being given the rules and constraints from Traveller: The Trillion Credit Squadron. The entry in the tournament was a combined effort between Lenat and Eurisko, with Eurisko applying heuristics and Lenat periodically culling the results, weeding out heuristics he deemed invalid or undesirable, and rewarding those he judged as especially promising. Lenat judges that the final crediting of the win should be approximately 60% Lenat, 40% Eurisko, but that neither Lenat nor Eurisko could have won alone. It is also worth pointing out that during the tournament, in which most battles took 2-4 hours, most of Eurisko’s battles took only a few minutes for Eurisko to be declared the victor. This was accomplished through a bizarre fleet configuration that was unanticipated by the tournament directors, and prompted a rule change for future tournaments. As the title “The Nature of Heuristics” implies, Lenat proposes some theories about the nature, source, and power of heuristics based on lessons from AM and Eurisko. The conceptualization of a heuristic in this context is an if-then rule that serves as an effective rule of thumb. This conceptualization differs from the heuristic evaluation functions that are the primary focus of the present work. I will use the term action heuristics to distinguish heuristics based on if-then action rules from heuristic evaluation functions when the distinction is not clear from the context.

22

In exploring the power of heuristics, Lenat uses the metaphor of a function Appropriateness(Action, Situation) as a way of thinking about heuristics. He proposes that effective heuristics are ones in which it is reasonable to act as if the corresponding Appropriateness function is time-invariant, continuous in both variables and varies very slowly. He proposes that among the characteristics of a candidate heuristic, continuity and stability are prerequisites for heuristic effectiveness.

3.3.3

Learning Heuristic Evaluation Functions

A heuristic evaluation function is a mapping from a state of a problem to a number. Richard Korf describes how heuristic evaluation functions are used in three general classes of search problems: single-agent path-finding problems, twoplayer games, and constraint-satisfaction problems [Kor94]. Christensen and Korf present a unifying view of heuristic evaluation functions applying to single-agent path-finding problems and two-player games [CK86]. The view is that heuristic evaluation functions are mappings from a state of a problem to a number such that the value of the function is approximately invariant over optimal moves and returns exact values for terminal nodes. This interpretation is applicable to both single-agent path-finding problems and twoplayer games. The emphasis on the two properties of outcome determination and invariance over optimal moves suggests a method for learning heuristic evaluation functions: search for heuristics satisfying one of the properties and test how well the other property is satisfied. An early work utilizing this idea was Arthur Samuel’s checkers program, which automatically learned a heuristic evaluation function in terms of a number of board features [Sam67]. The idea was extended to use

23

linear regression and applied to learning material evaluation functions in chess in [CK86].

24

CHAPTER 4 General Game Playing Framework Intrinsic to the notion of general game playing is that games be described in some machine-readable language and played through some well-defined protocol. The language and protocol together comprise the general game playing framework. The framework adopted in this dissertation is Genesereth’s Game Description Language and communication protocol. I provide here an overview of this framework. For a complete description, see the Game Description Language specification [GL05].

4.1

Game Description Language (GDL)

This section illustrates GDL by using it to describe the game of Tic-Tac-Toe. My intention is to give the reader a sense of the expressiveness of GDL, so I present GDL here in terms of logical statements, omitting the GDL syntax for readability. In GDL, states are represented by the set of facts (or fluents) true in the given state. Rules are logical statements composed of expressions consisting of fluents, logical connectives, a distinguished set of GDL relations, and gamespecific relations.

25

The role relation lists the roles of the players in the game: role(x)

role(o)

A natural representation of a state in Tic-Tac-Toe is an enumeration of the values of each of the nine cells. There is no cell relation in GDL, but we can describe states in terms of cells because GDL permits game-specific relations. For example, we can describe the state in Figure 4.1 by the fluents: cell(1, 1, x) cell(1, 2, b) cell(1, 3, b) cell(2, 1, b) cell(2, 2, o) cell(2, 3, b) cell(3, 1, b) cell(3, 2, b) cell(3, 3, b)

X O

Figure 4.1: A State in Tic-Tac-Toe

The init relation enumerates the set of fluents true in the initial state of the game. We can indicate the initial state of Tic-Tac-Toe by declaring that x has control and each cell is blank: init(control(x)) init(cell(1, 1, b)) init(cell(1, 2, b)) ... The legal relation indicates the legal moves for each role in terms of the fluents that are true in the current state. GDL models all games as having simultaneous

26

moves, but we can represent turn-taking through a game-specific control relation and a move indicating “no-operation”. The legal moves are for the controlling player to mark a blank cell and the non-controlling player to perform a noop. legal(R, mark(M, N ) ⇐ true(cell(M, N, b)) and true(control(R)) legal(x, noop) ⇐ true(control(o))) legal(o, noop) ⇐ true(control(x))) The next and does relations are used to describe resulting fluents in terms of existing fluents and player’s actions. A cell takes on the mark of the player that marks it and control alternates between x and o. next(cell(M, N, R)) ⇐ does(R, (mark(M, N ))) next(control(x)) ⇐ true(control(o)) next(control(o)) ⇐ true(control(x)) A frame axiom is used to indicate the conditions under which cell values persist. next(cell(M, N, Z)) ⇐ true(cell(M, N, Z) and does(R, (mark(J, K))) and (M 6= J or N 6= K) The terminal relation is used to indicate termination conditions for the game. terminal ⇐ line(x) or line(o) or not open The goal relation is used to indicate payoff values in the range of [0, 100] for each role for terminal states. goal(R, 100) ⇐ role(R) and line(R) goal(R, 50) ⇐ role(R) and not line(x) and not line(o) goal(R, 0) ⇐ role(W ) and line(W ) and R 6= W

27

Finally, game-specific relations are defined in terms of fluents. line(R) ⇐ row(R) or column(R) or diagonal(R) row(R) ⇐ true(cell(M, 1, R)) and true(cell(M, 2, R)) and true(cell(M, 3, R)) column(R) ⇐ true(cell(1, N, R)) and true(cell(2, N, R)) and true(cell(3, N, R)) diagonal(R) ⇐ true(cell(1, 1, R)) and true(cell(2, 2, R)) and true(cell(3, 3, R)) diagonal(R) ⇐ true(cell(1, 3, R)) and true(cell(2, 2, R)) and true(cell(3, 1, R)) open ⇐ true(cell(M, N, b)) This completes a description of Tic-Tac-Toe. More complex games are constructed in the same way.

4.2

GGP Protocol

Whereas GDL defines a language for specifying games, the GGP protocol defines a protocol of interaction between the game manager and game playing systems. A match begins by the game manager sending a message to the game-playing

28

programs with the following information: an identifier for the match, the complete game description of the game to be played, the role that the receiving game player will play in the match, the start clock in seconds, and the play clock in seconds. After each step in the game, the game manager reports the moves made by each role and the players respond with their next moves. Note that the complete state is not sent as part of the protocol after each move, rather it is the responsibility of the individual game-playing programs to compute the next state based on the actions of the players. The syntax of the protocol is not germane to this dissertation, but is described in the GDL Specification [GL05].

4.3

AAAI GGP Competition

The details of the GGP competition have evolved since its conception, but the consistent objective has been to pit general game playing programs against each other in direct competition on a variety of games that are unknown to the program authors ahead of time. The first year, all matches were played at the AAAI conference. In acknowledgment of time constraints of the conference and in response to desire on behalf of both the organizers and the participants to incorporate more matches in the tournament, subsequent competitions have included several rounds of matches played over the Internet prior to the AAAI conference. During the first few rounds, programs accumulate points. In each round, points are weighted more heavily than the previous round and participants are encouraged to continue working on their players between rounds. This gives newcomers a chance to work out problems in their players and it provides the organizers an opportunity to calibrate the difficulty of the games according to the abilities of the entrants. The games chosen are picked by the competition organizers, thus far from Michael Genesereth’s group at Stanford University. The total accumu-

29

lated, weighted points are used to rank the entrants and this ranking is used to seed an elimination tournament at AAAI, culminating in a final championship round. The championship round was a single match the first two years and was expanded to three matches the third year. The winner of the championship round is declared the champion and the program authors are awarded a $10,000 prize. In order to discourage human intervention particularly during the Internet rounds, the game descriptions are obfuscated in the following way. Typically, the games are either known strategy games or variations of known games and are described with terms that are intuitive to humans familiar with the game. For example, chess like games tend to include terms like black, white, queen, king, rook, pawn, knight, capture, check, etc. In the obfuscated version of the game description, these terms are replaced with meaningless symbols, making the description significantly more difficult for humans to understand immediately. This also serves to highlight for spectators that the only symbols in the rules with semantic significance not deriving from the structure of the rules themselves are the GDL relations described earlier.

30

CHAPTER 5 Abstract-Model Based Heuristic Evaluation Functions 5.1

Overview

In this chapter, I present the technique I have developed for automatically constructing heuristic evaluation functions from game descriptions. The material here is an updated version of the work presented in [Clu07]. The core ideas governing the approach are the following. One source of a heuristic evaluation function is an exact solution to a simplified version of the original problem [Pea84]. Applying this notion to GGP, my approach is to abstract the game to its core aspects and compute the exact value of the simplified game. Here, the value of the game is the minimax value as described by von Neumann and Morgenstern [vM53]. Because this simplified game preserves key characteristics of the original game, the exact value of the simplified game is used as an approximation of the value of the original game. The core aspects of the game that are modeled are the expected payoff, the relative mobility, and the expected game termination (or game longevity). These core game aspects are modeled as functions of game-state features. The features must be automatically selected, and the key idea guiding this selection is stability, an intuitive concept for which I will provide a more technical meaning. My approach strives

31

to evaluate states based on only the most stable features.

5.2

Feature Identification

When a human first learns the game of checkers, there are some obvious metrics that appear relevant to assessing states: the relative numbers of regular pieces of each color and kings of each color. The more sophisticated player may consider other factors as well, such as the number of openings in the back row, degree of center control, or distance to promotion. I will use the term feature to refer to a function from states to numbers that has some relevance to assessing game states. A feature is not necessarily a full evaluation function, but features are building-blocks from which evaluation functions can be constructed. Consider the cues that human checkers players exploit when learning to evaluate positions. Pieces are represented as physical objects, which have properties of persistence and mobility with which humans have extensive experience. The coloring of the pieces, the spatial representation of the board, and the symmetry of the initial state each leverage strengths of the human visual system. The GDL game description, although isomorphic to the physical game, has none of these properties, so feature identification is not so simple. I use a three-part approach to feature identification: 1. Identify candidate expressions to use as the basis of features. 2. Impose various interpretations on these expressions in order to interpret them as numeric functions of game states. 3. Identify the features most likely to be relevant based on their stability, that is, the degree to which the features vary incrementally.

32

I will discuss each of these aspects in turn.

5.2.1

Candidate Expressions

The approach taken here is to extract expressions that appear in the game description and impose interpretations on these expressions to construct candidate features. There are two aspects to this: identifying a set of potentially interesting candidate expressions and imposing interpretations on these expressions. To identify the set of candidate expressions, the analyzer starts by simply scanning the game description to see what expressions appear. The theory behind this seemingly naive approach is that a game that is succinctly described will tend to be described in terms of its most salient features. In the case of checkers, this includes expressions representing red pieces, black pieces, red kings, and black kings. In addition to the expressions appearing directly in the game description, the feature-generating routine has some built-in rules for generating additional expressions. Chief among these is a routine which performs domain analysis to determine what constants can appear in what positions for expressions involving variables. These constants are then substituted into expressions involving variables to generate new, more specific candidate expressions. When the domain of a particular variable has only a few constants, they are each substituted into the expression. In checkers, variables corresponding to pieces are found to permit symbols for red pieces, black pieces, red kings, and black kings. When the domain has more constants, then additional structure is sought to determine if some of the constant values are distinct from the rest. In checkers, the first and last rows are found to be distinct from the rest because the first and last rows have a single adjacent row, while the other rows have two adjacent rows.

33

5.2.2

Expression Interpretations

When the expressions are identified, the set of candidate features is generated by imposing various interpretations of these expressions. The analyzer has three primary interpretations, each defining a different type of feature. I call the first interpretation the solution cardinality interpretation. The idea is to take the given expression and find the number of distinct solutions to this expression in the given state. In the case of the expression cell(X, Y, black king), the number of solutions corresponds to the number of black kings on the board. The second interpretation is the symbol distance interpretation. Binary relations among symbols in GDL are game-specific relations with arity two. For example, the checkers description utilizes a game-specific next rank relation to establish ordering of rows. The program constructs a graph of game-specific symbols appearing in the description, where each constant symbol is a vertex in the graph. Edges are placed between vertices that appear together in the context of a binary relation in the game description. For example, rank3 and rank4 have an edge between them due to the rule next rank(rank3, rank4). Once this graph is constructed, the symbol distance between two symbols is the shortest path between the two symbols along this graph. To impose the symbol distance interpretation on an expression such as cell(X, rank8, red piece), the program first identifies constants within the expression that are in the domain of a binary relation. In this case, the only such constant is rank8, which appears in the next rank relation. The program substitutes a variable for this symbol, obtaining an abstract expression cell(X, Y, red piece). It finds all the solutions to the abstract expression. For each solution, it finds the distance to the original expression based on the binding of the newly introduced variable. For example, if the solutions are cell(a, rank2, red piece) and cell(f,

34

rank5, red piece), then the distances are 6 and 3, respectively, because according to the next rank rules, the distance from rank2 to rank8 is 6 and from rank5 to rank8 is 3. Finally, the overall symbol distance is the minimum of the distances for the individual solutions to the abstract expression. In this example, the value would be 3 and could represent the distance to promote a piece to a king. The third interpretation is the partial solution interpretation. This interpretation only applies to compound expressions involving multiple conjuncts or disjuncts. For example, in Connect-4, the rule for winning involves a conjunction of four pieces of the same color in a row. The partial solution interpretation of the conjunction results in a number that is proportional to the fraction of conjuncts satisfied (0.75 for three in a row). This is similar to Schiffel and Thielscher’s degree of goal attainment [ST07]. For each candidate expression, the program creates corresponding features by applying each of the possible interpretations to the expression. Finally, some additional features are generated by observing and exploiting symmetry in the game description. Utilizing the domain information extracted earlier, the program observes that red piece and black piece appear in the same number of relations the same number of times and that each have an initial cardinality of twelve. From this, it hypothesizes that the solution cardinality interpretations of cell(X, Y, red piece) and cell(X, Y, black piece) are symmetric to each other and introduces a relative feature for the difference between the two. In this case, the feature represents the material piece advantage of red over black.

5.2.3

Stability

Once the program has generated a set of candidate features, it needs some way of determining which features are most likely to be relevant to state evaluation.

35

The intuition behind the criteria will be that quantities which wildly oscillate do not provide as good a basis for assessing the value of a state as quantities that vary only incrementally. This idea can be quantified by introducing a measure called stability. To compute the stability of a feature, the program first generates a set of sample states in the game through random exploration of the game tree. It computes the value of the feature for each sample state. Next, it computes the variance of the feature’s values over the sample states. This is the total variance. Two pairs of states are adjacent if one is the immediate successor of the other in some path through the game tree. It calculates another quantity which I will call the adjacent variance by summing the squares of the difference in feature values for adjacent sample states and dividing by the number of adjacent state pairs. Let the stability quotient denote the ratio of the overall variance to the adjacent variance. I also refer to the stability quotient as simply the stability. If the feature wildly oscillates from state to state, the stability will be low (≈ 1), whereas if the feature value changes only incrementally, the stability will be significantly greater than one. Although to my knowledge, this particular notion of stability is unique to my approach, it is similar to previous ideas in the heuristics literature. One similarity is with Lenat’s description of effective heuristics as ones in which a function of Appropriateness(Action, Situation) is approximately time-invariant, and both variables are continuous and slowly varying [Len82]. Another similarity is with Christensen and Korf’s notion that effective heuristic evaluation functions are approximately invariant over optimal moves [CK86].

36

Table 5.1: Abstract Model Parameters Parameter Meaning

5.3

P : Ω → [0, 100]

Approximates payoff function.

M : Ω → [−1, 1]

Relative mobility.

T : Ω → [0, 1]

Proximity to termination.

SP : [0, 1]

Relative stability of P.

SM : [0, 1]

Relative stability of M.

Abstract Model

After stable features are identified, the next stage in my technique is to construct an abstract model of the game. The abstract model reduces the game to five parameters: P (payoff), M (mobility), T (termination), SP (payoff stability), and SM (mobility stability). The ranges and intuitive meanings of the various game parameters are summarized in Table 5.1, where Ω denotes the set of legal game states.

5.3.1

Payoff

The intention of the payoff value is to provide a function from states to numbers that evaluates to the role’s payoff values for the terminal states and is approximately continuous over the topology of the game tree. To construct this, the program starts with the list of stable features. It eliminates some features through a dependency analysis of the rules in the game description. It excludes features based on expressions involving relations that do not influence goal expressions. Next, it categorizes the remaining stable features by their correlation with the payoff function. The correlation is either positive, negative, or not correlated. To determine the correlation, a pair of pseudo-states are constructed that

37

are identical to each other except that one has the expression associated with the feature and the other does not. The payoff value of the pseudo-states are computed based on the game’s payoff rules as if the pseudo-states were terminal states. If the payoff values are the same, the feature is considered uncorrelated with payoff. If the payoff is higher on the state with the higher feature value, then the correlation is positive and if the payoff is lower on the state with the higher feature value, then the correlation is negative. Of the features having nonzero correlation, the program excludes absolute features that are subsumed by relative features. In the case of Othello, the features correlated with payoff are the white piece count, the black piece count, and the difference in piece counts, so the difference in piece counts is retained. When there are multiple features that are correlated with payoff and not subsumed by other features, the features are weighted according to their stability, with the coefficient being positive for positively correlated features and negative for negatively correlated features. Finally, the overall coefficient and offset are set such that the values of the resulting payoff function for all the sample states fall in the range of [0, 100].

5.3.2

Mobility

The mobility function is intended to quantify differences in the number of moves available to each role. Let the moves available to role k at state ω ∈ Ω be denoted mk (ω). Define the relative mobility at state ω to be: mred (ω) − mblack (ω) (5.1) maxω0 ∈Ω (mred (ω 0 ) + mblack (ω 0 )) Positive numbers indicate red has more moves, negative numbers indicate black has more moves. In games with more than two roles, mblack is replaced with the sum of the number of moves of the adversaries. The denominator cannot be measured directly because the state space is too large, so it is approximated by

38

taking the maximum quantity over the sample states. To compute the mobility function, the program begins with the stable features. It eliminates features associated with expressions that do not influence legal expressions, as these features do not impact the moves available to the various roles. It removes absolute features subsumed by relative features. In checkers, it ends up with two features: the number of red pieces minus the number of black pieces and the number of red kings minus the number of black kings. To quantify the relative contribution of these features, the program takes a statistical approach. It generates a collection of sample states by simulating game play with random moves. It performs least-squares linear regression with the mobility as the dependent variable and the features as the independent variables to find the best fit for mobility values in terms of the features. Because the mobility function is constructed in terms of stable features, the function is an abstraction of mobility rather than a direct measure of mobility. In games such as checkers and chess, where each player controls a set of movable pieces, the mobility function tends to correspond to material, that is, the relative number of each player’s pieces multiplied by a coefficient indicating the strength of that type of piece. However, the concept of a mobility function expressed in terms of stable features is more general than the concept of mobility. This is because the former is generally applicable to any game in which the number of possible actions varies even if the games do not involve movable pieces.

5.3.3

Termination

The third function used in constructing the heuristic evaluation function is termination. This function is an estimate of how close the game is to termination (0 for initial state, 1 for terminal states, 0.5 for half-way through, etc.). The

39

termination function is computed statistically by least squares regression, with the target values 1 for terminal states and 0 for non-terminal states.

5.4

Heuristic Evaluation Function

The description thus far gives us an abstract model of a game in terms of mobility, payoff, and termination. For example, the result of letting the program analyze checkers for red for 5 minutes is as follows. Although the analysis program prints the features in terms of GDL syntax, I present the features here in their English translations for improved readability. For more information on the interpretations, see Appendix A. Here “early payoff” means the game description’s payoff function applied to non-terminal states. Payoff = 50 + 0.454 (# total red - # total black) + 0.292 (early payoff for red - early payoff for black) Mobility = 0.040 (# red kings - # black kings) +0.026 (# red pieces - # black pieces) Terminal = 0.0006 (current step #) + 0.00044 Payoff Stability = 0.237, Mobility Stability = 0.763 The remaining piece is to construct the overall evaluation function from these elements. The overall heuristic evaluation function is taken as a weighted sum of the mobility function and the payoff function. The termination function is used to determine the relative weights of the payoff and mobility, as depicted in Figure 5.1. The intuition behind this approach is that in the opening it may make sense to attempt to gain more control of the game’s trajectory through greater mobility. In the endgame, however, maximizing the payoff is paramount.

40

The exact coefficients of the payoff function and the mobility function in the weighted sum for the overall heuristic evaluation function is determined as follows. First, the mobility function is translated from the range of [-1, 1] to the range [0, 100] by multiplying by 50 and adding 50. Secondly, as the game approaches termination, the payoff weight should approach one and the mobility weight should approach zero. Finally, I base the relative weights of payoff and mobility for the initial state to be that of the relative stability of the two functions. Here the stability of the mobility and payoff functions is computed in the same way as the stabilities of the individual features was computed earlier. However, they are normalized by dividing each by the sum of the two stabilities. This ensures that SP + SM = 1, so they can be used as the initial weights for the stability and the mobility functions. In other words, the relative stabilities determine the intersection of the functions with the Y-axis in Figure 5.1. The result of this is that the overall v of the heuristic evaluation for a particular state ω is: v = T (ω) ∗ P (ω) + (1 − T (ω))((50 + 50M (ω)) ∗ SM + SP ∗ P (ω)) (5.2)

Payoff weight increases

Weight of component in overall evaluation Mobility weight decreases Terminal State

Initial State

Proximity to Termination Figure 5.1: Overall Evaluation Function

41

5.5

Anytime Algorithm

A key parameter influencing both the quality of the results and the computational cost of the analysis is the number of sample states generated. More samples generally result in better results at the expense of longer analysis time. It is difficult to predict how long the model construction will take for a given number of samples for a given game, so the analyzer implements an anytime algorithm that produces better quality game models given more time. The anytime algorithm simply starts with a small number of sample states (25), and computes the model. It then doubles the number of sample states and recomputes the model. It repeats this until there is no analysis time remaining, at which point it returns the most recently completed model.

5.6

Use of Evaluation Function in Game-Play

To utilize these functions to play games, the program performs an iterativedeepening minimax search with alpha-beta pruning, hereafter referred to as alpha-beta minimax [KM75]. The program utilizes well-known enhancements, including node ordering based on the previous iteration and transposition tables. When evaluating nodes at the search frontier, terminal nodes are valued according to their actual payoff values as determined by the game description. To evaluate non-terminal nodes, it first computes values for P, T, M, SP , SM by evaluating the functions constructed in the analysis phase. It then uses these values to compute the outcome v according to Equation 5.2. These values are propagated according to the minimax algorithm and are used as the basis for move selection. Technically, alpha-beta minimax is only applicable to two-player, turn-taking,

42

zero-sum games. The class of games supported for general game playing includes games violating each of these restrictions. To handle these differences, the program uses a set of paranoid assumptions. The original paranoid assumption, due to Sturtevant and Korf [SK00] is applicable to games with more than two players. By assuming that all my opponents play as if they were colluding against me, I can treat them as a single adversary, thus reducing a multi-player game to a two-player game. A different type of paranoid assumption enables games with simultaneous moves to be treated as turn-taking games. This is the assumption that the other players can observe my move before committing to their own moves. Finally, non-zero-sum games can be treated as zero-sum games by assuming that the opponent seeks to minimize my payoff instead of maximizing his own payoff.

5.7

Evaluation Function Results

The operations for automatically constructing heuristic evaluation functions have been incorporated into my player. This section reports results in running the heuristics constructor on a sampling of games. The results were obtained by running the program on a MacBook Pro with a 2.16 GHz Intel Core Duo and 1 GB of RAM. Game descriptions for each of the games were provided by the Stanford General Game Playing group. Different analysis times were used on the various games because larger games take longer for the analysis to converge to a consistent evaluation function.

43

5.7.1

Racetrack Corridor

Racetrack corridor was introduced in the championship match of the First Annual General Game Playing Competition. It was inspired by the game of Quorridor, which involves moving a piece across a board and placing walls to impede the opponent’s progress. It is played on 3 x 5 grids as shown in Figure 5.2.

Figure 5.2: Racetrack Corridor (initial position)

Figure 5.3: Racetrack Corridor (with some walls placed) Each player controls a piece that starts at the top of one of the grids and must race lengthwise to the other side. In each move, a player may choose to move their piece forward or sideways or may place a horizontal wall that blocks two-thirds of the opponent’s track (Figure 5.3). Moves are made simultaneously. Each player begins the game with four walls. The game ends when either player reaches the bottom or after 20 moves, whichever comes first. Though the board looks small, the initial branching factor is one-hundred because each player simultaneously

44

chooses from 10 legal moves. The following results were obtained for white in 2 minutes of analysis: Payoff = 50 + 7.5 (goal distance: black - white) Mobility = 0.040 (# walls on left side of white lane - # walls on left side of black lane) +0.039 (# walls on right side of white lane - # walls on right side of black lane) Terminal = 0.016 (current step #) - 0.102 Payoff Stability = 0.732, Mobility Stability = 0.268

The payoff function values proximity to the goal, an intuitively obvious heuristic in this game. The likelihood of termination increases with each step of the game. The mobility function is non-intuitive, but its lower stability dictates that it will be weighed less heavily than the payoff function.

5.7.2

Othello

An interesting challenge in constructing heuristic evaluation functions for Othello is that the intuitive strategy of greedily maximizing one’s disks throughout the game results in poor performance. Machine learning techniques have been effectively utilized to construct heuristic evaluation functions for Othello, such as in [Bur02]. However, these techniques have relied on self-play on the order of hundreds of thousands of games. One hour of analysis on Othello produced the following result for white: Payoff = +2.246 (lower right corner: # white - # black)

45

+1.756 (upper left corner: # white - # black) +1.561 (upper right corner: # white - # black) +1.418 (lower left corner: # white - # black) +1.372 (lower edge: # white - # black) +1.308 (right edge: # white - # black) +0.898 (upper edge: # white - # black) +0.605 (lower edge: # white - # black) +50 Mobility = 0.0027 (# black pieces - # white pieces) Terminal = 0.061 - 0.0015 (# empty cells) Payoff Stability = 0.837, Mobility Stability = 0.163 The payoff function seeks to occupy corners and edges, with the corners weighted more heavily than the edges. (Note that corner pieces are also counted as edge pieces.) The mobility function is actually a reverse material function, though the payoff stability is significantly greater than the mobility stability, so the reverse mobility is given a relatively small weight initially and this weight is decreased further as the number of empty squares decreases.

5.7.3

Chess

Chess is a challenging game with a long history of AI work, most of which has utilized highly tuned heuristic evaluation functions formulated by chess experts. The GDL game description for chess holds fairly closely to the official game rules, including special moves such as castling and en passant. It does not, however, contain the rules for a draw after repeating a state three times. Instead, it calls a game a draw if no checkmate or stalemate is reached after 200 moves.

46

One hour of analysis yielded the following result: Mobility = 0.060 (# white queens - # black queens) +0.035 (# white rooks - # black rooks) +0.027 (# white bishops - # black bishops) +0.017 (# white knights - # black knights) +0.0031 (# white pawns - # black pawns) It did not find any stable features correlated with payoff, which is unsurprising given the conditions for checkmate. The heuristic evaluation function reduces to the mobility function, which in this case corresponds to a valuation of material. Among material features, it valued queens highest, followed by rooks, then bishops, then knights, then pawns. The relative ordering of the pieces is consistent with the classical valuation of chess pieces (9 for queens, 5 for rooks, 3 for bishops, 3 for knights, and 1 for pawns).

5.7.4

Chinese Checkers

Chinese Checkers is a multi-player game played with marbles on a star-shaped board. The program analyzed a GDL description of a small version of Chinese Checkers for six players, each of which controls three marbles on a star-shaped board as shown in Figure 5.4. The analysis yielding the following for red in two minutes: Payoff = 0.792 (fraction of red marbles in goal) +0.792 (early payoff for red) -2.546 (total distance red marbles need to travel) +70.123

47

Figure 5.4: Chinese Checkers No stable features were found to correlate with mobility, so the evaluation function is strictly the payoff function. In this six-player game, the analyzer was unable to deduce relative features, so only features relevant to the player’s own goals appear. The function values states in which more marbles are in place and the distance to the goal of the remaining marbles is minimized.

48

CHAPTER 6 Techniques Specific to Single-Player Games The domain of general game playing encompasses games with one or more players. Single-player games are also known as planning problems or puzzles. Although the techniques described in Chapter 5 are applicable to planning problems as well as adversarial games, the absence of adversaries in planning problems enables the use of more specialized techniques. As reviewed in Chapter 3, there is a significant body of previous work in automated planning, but the application of this work to GGP is nontrivial due to differences in how planning problems are formulated and how solutions are evaluated. The most fundamental differences are that GGP requires constant time move commitment and expresses payoffs as a function from terminal states to numeric values. Given these differences, this chapter describes additional techniques that I have incorporated into my player specifically for single-player games.

6.1

Motivation and Overview

To motivate the use of specialized techniques for planning problems, consider a plan consisting of a sequence of actions resulting in a known payoff. Although such plans exist for single-player games, they generally do not exist for games involving more than one player because both the legal moves and the final payoffs generally depend on the actions of all the players in the game. A motivation for

49

the use of specialized techniques for single-player games is to capitalize on this simplification. The general approach is to search for plans in the deliberation phase, recording the best plan discovered, then during game play, execute the plan while continuing to look for better plans from the current state. Optimal plans are identifiable because they have GDL’s maximum payoff of 100. If and when an optimal plan is found, the program executes the plan unconditionally. This approach can be refined by observing that in some cases, an uninformed search may find a solution more quickly than a lengthy process of constructing a heuristic evaluation function to guide an informed search. The program does not have a good way of knowing whether or not this is the case for a given game but it can hedge its bets by spawning a child process to perform an uninformed search while the primary process constructs the heuristic evaluation function. This is particularly effective on a dual-processor machine because the two searches are trivially parallelizable. My program uses this technique. A flow-chart summarizing the order of operations for my program when playing a single-player game is shown in Figure 6.1.

6.2

Heuristic Evaluation Function Construction

When constructing heuristic evaluation functions for single-player games, the following minor modifications are made to the abstract-model based technique outlined in Chapter 5. First, whereas the full model consists of a mobility function, a payoff function, and a termination function, the model for a planning problem omits the mobility function and the termination function, constructing only the payoff function. The

50

Start singleplayer game

Fork

Construct heuristic evaluation function Uninformed search for plan Informed search for plan

Send best plan

Search for better plan (in parallel)

Execute move from best plan

Game over?

No

Yes End game

Figure 6.1: Flow-Chart for Solving Single-Player Problems

51

rationale for this is that relative mobility is not applicable unless there are at least two players and the termination function is only used to modulate the relative weights of the mobility and payoff functions. Another modification of the technique for constructing the abstract model is that as the sample states are generated through random walks down the game tree, the best plan resulting from the random walks is recorded. This enables the random walks to provide value even if the game has insufficient structure for effective heuristics. A final modification is that instead of using the full deliberation time to improve the heuristic evaluation function, it uses only one third of this time to construct the abstract model and corresponding evaluation function, then it begins searching for plans using this evaluation function.

6.3

Search Algorithms

It should come as no surprise that I have found that the most effective search algorithm for single-agent games is highly dependent on the particular game. In response to this observation, my player has evolved to take somewhat of a “shotgun” approach to searching for plans: it tries several different algorithms during deliberation and uses whichever plan has the highest payoff. None of the single-agent search algorithms used here are original to my work, so rather than discuss them in detail, I simply highlight aspects that are particular to their use in general game playing. One constraint imposed by GDL that is relevant to this discussion is that every game must terminate in a finite number of steps. A consequence of this constraint is that a slight modification must be made to problems with repeatable

52

states, such as sliding tile puzzles or Towers of Hanoi. The modification is to incorporate into the game state some type of step counter and a rule that the game terminates after a certain number of steps.

6.3.1

Uninformed Search

The uninformed search algorithms utilized are breadth-first search, depth-first search, and a heuristic-repair algorithm for constraint-satisfaction problems.

6.3.1.1

Breadth-First Search

The breadth-first implementation is straightforward. An aspect worth noting is that the program attempts to identify step-counters and remove them from the game. Step counters are sometimes introduced in GDL games to make games finite that would otherwise be infinite. For example, sliding tile puzzles and towers of Hanoi problems have reversible operators so states can be revisited indefinitely. By introducing a step counter, the game can be terminated after a fixed number of steps. Removing the step counter can significantly improve the efficiency of breadthfirst search in games where such step counters can be identified. However, there is some risk in doing this. It may be that what my program identifies as a step counter actually serves a more subtle role in the game. In such cases, a solution to the modified game may not be a solution to the original game. The player guards against this by verifying each plan with the original game description and rejecting it if it turns out to be invalid.

53

6.3.1.2

Depth-First Search

The property of depth-first search that is worth noting here is that it is an anytime algorithm. This means it tends to find a plan quickly and gradually find better and better plans. This is helpful in games with a large number of distinct payoff values. Also, the problem of depth-first search not terminating does not arise in general game playing because of the restriction that all games must terminate in a finite number of steps.

6.3.1.3

Heuristic Repair

Heuristic repair is an algorithm for solving constraint satisfaction problems [MJP92]. Although in one sense any single-player game can be represented as a constraint satisfaction problem, the usage here refers specifically to games in which the order in which actions are taken do not effect the resulting state. For example, an N-Queens problem can be encoded as a game in which N queens are placed on an N x N chess board. The payoff values are maximum if none of the queens are attacking each other and the payoff value is decreased for each queen that is attacked. The order in which these queens are placed is irrelevant to the state of the board and to the final payoff. Similarly, a Sudoku puzzle can be encoded such that each move involves filling in a square and there is a penalty for each row, column, or box in which a number appears more than once. My player attempts to identify constraint satisfaction problems by empirically testing if the game has a fixed number of moves and a given sequence of legal actions can be reordered without changing the legality of the actions or the resulting payoff. If the game appears to be a constraint-satisfaction problem, it uses heuristic repair to find a solution. If, during the course of heuristic repair, it discovers that a resulting plan is invalid with respect to the original

54

game description, it abandons heuristic repair and switches to one of the other techniques.

6.3.2

Informed Search

Once a heuristic evaluation function is constructed, the program uses this function to engage in informed search for effective plans. Like the uninformed case, multiple algorithms are used with the philosophy that it is more effective to try multiple approaches than to try to automatically determine which algorithm will be most effective for a particular game. The informed search algorithms utilized are as follows.

6.3.2.1

A*

A* is a well-known search algorithm for single-agent path-finding problems [HNR68]. It utilizes a heuristic evaluation function that returns an estimate of the distance to the goal state, which differs from a game-playing evaluation function that estimates the payoff value. Let hd denote a distance heuristic evaluation function and hp denote a payoff heuristic evaluation function. The program constructs hd from hp using the following relationship: hd = (100 − hp ) × k Here k is a normalization factor equal to the maximum number of steps in the game divided by 100. The heuristic evaluation function constructed in this manner is not guaranteed to be a lower bound on the distance to a state with a maximum payoff, so the resulting A* algorithm is not guaranteed to return optimal solutions. Nevertheless, the approach can be very effective when the game involves finding the

55

shortest path to a goal and the heuristic evaluation function is proportional to a distance to the goal.

6.3.2.2

Minimin Lookahead Search

The minimin lookahead algorithm involves a limited lookahead search and the application of a heuristic evaluation function [Kor90]. Traditionally, this is used with a heuristic evaluation function that estimates the distance to a goal, so the node chosen is one that minimizes this estimate. My program uses the payofforiented evaluation function, so it instead maximizes the value, but the idea is the same. The program looks ahead for a fixed period of time, starting with one tenth of a second per move and increasing the amount of time with each simulated game. This is similar to iterate deepening, but is based on time instead of search depth.

6.3.2.3

Depth-First Search

The depth-first search is like the uninformed version, but the heuristic evaluation function is used to order the nodes.

6.4

Algorithm Composition

During the deliberation phase of a single-player game, the program performs informed search and uninformed search in parallel as depicted in Figure 6.1. For each type of search, it first determines which of the algorithms just described are applicable. For example, heuristic repair is only applicable to constraint satisfaction problems. The program then divides it’s remaining deliberation time among the searches and executes them in sequence.

56

During game-play, two informed search algorithms are used in parallel. Normally, one is minimin lookahead search and the other is depth-first search. For constraint satisfaction problems, heuristic repair is used instead of minimin lookahead search.

6.5

Summary

This chapter has described techniques that I have implemented to exploit particular characteristics of single-player games. The program attempts to construct plans using several different search algorithms, each of which is effective for different types of problems, then it executes the best plan it can find while continuing to search for better plans. This “shotgun” approach of trying several search algorithms that are effective on more limited classes of games is designed to complement the general methods discussed elsewhere in this dissertation.

57

CHAPTER 7 Rollout-Based Monte Carlo Methods 7.1

Introduction

My approach to general game playing discussed thus far has been based on alphabeta minimax search. This algorithm has its roots in classical two-player turntaking games such as chess and checkers, particularly in games where there are well-established effective heuristic evaluation functions. We now turn our attention to a different approach to game-tree search known as Monte Carlo methods. The term Monte Carlo methods typically refers to estimation methods involving the averaging of results of random simulations. The use of random simulations makes Monte Carlo methods a natural fit for games with stochastic elements or imperfect information, so it should come as no surprise that game-playing programs have successfully employed Monte Carlo methods in backgammon [TG97], bridge [Gin01], and Scrabble [She02]. However, the use of Monte Carlo methods is not limited to domains with stochastic elements or hidden information. Its ability to exploit sampling as a means of performing sparse lookahead search has made Monte Carlo methods a technique of choice in the deterministic, perfect information game of Go [Bru93, BH03, GW06]. Kocsis and Szepesv´ari describe a specific use of Monte Carlo methods called rollout-based Monte Carlo planning [KS06]. The idea is to estimate the value of an action from a given state by sampling selectively, treating each state in a game

58

tree as a multi-armed bandit problem [ACF02]. Applied to games, the approach works as follows. During game-play, the value of possible actions to take are evaluated by simulating game-play and averaging returns. The simulation is initially random, but as simulations proceed, moves are biased towards those that have thus far produced a higher payoff. The biasing is performed using a data structure in which the average returns for each role is recorded for each state in the data structure. The data structure is a game-tree in which each node contains a map of actions to average payoff values and number of simulations performed with that action from the corresponding state. One component of rollout-based Monte Carlo methods is the policy used for selecting partially explored actions. Effective policies balance the competing goals of exploiting knowledge of which action has had the biggest payoff so far and exploring other actions to gain a more accurate estimate of their values. A simple but effective class of effective policies are the -greedy policies, which pick a random move with probability and pick the move with the highest average payoff with probability 1 − . This is a parameterized policy in that it can be used for any ∈ (0, 1). Typically, is a function of the number of simulations such that decreases with the number of simulations, producing the behavior that the algorithm is initially exploratory and becomes increasingly exploitative the longer it runs. In the 2007 AAAI General Game Playing Competition, at least two entrants successfully utilized some form of rollout-based Monte Carlo methods, including the winning entrant by Yngvi Bj¨ornsson and Hilmar Finnsson [BF07] as well as the entrant by Jean Mehat, which performed very well in the qualifying round [Meh07]. The success of these programs was the catalyst for my own more recent investigation of this approach.

59

7.2

Use of Heuristic Evaluation Functions

Rollout-based Monte Carlo methods are comprised of a tree-search component, in which the actions are chosen based on the policy (such as an -greedy policy), and a simulation component, in which actions are chosen randomly. In the simplest version of rollout-based Monte Carlo methods, the tree search proceeds until a leaf node is reached, then random simulation proceeds until a terminal state is reached, at which point the payoff values are computed and propagated back to the tree nodes. This process repeats until it is time to make a move, at which point the move with the highest expected value is chosen. When an action is taken, the child node corresponding to that action becomes the new root node and the process repeats. When a heuristic evaluation function is available, an alternative to simulating play to the end of the game is to simulate to a fixed depth and then evaluate the non-terminal state using the heuristic evaluation function as an estimate of the payoffs. The heuristic evaluation functions developed earlier provide an estimate of the final payoff value and can be used in this manner. A straight-forward extension of this idea is to perform iterative deepening, where the simulation portion is depth-limited starting at depth one and incrementally increasing the depth of the simulations.

7.3

Use of Action Heuristics

The simplest version of the random simulation component of Monte Carlo methods is to randomly pick an action from the uniform distribution of all legal moves for the given player from the given state. This approach is both simple and efficient, but in highly structured games, it does not typically result in simulations

60

that are very representative of intelligent game-play. Another approach is to bias to move-selection of the random simulations based on heuristics of what moves are more likely to be “good”. I will use the term action heuristics to refer to heuristics that provide recommended actions from a given state, as opposed to heuristic evaluation functions, which provide a numeric assessment of the states themselves. Action heuristics are more like the heuristics investigated by Lenat [Len82] than the heuristic evaluation functions that are the primary focus of this dissertation. Like heuristic-evaluation functions, action heuristics tend to be game-specific. An example of a game-specific action heuristic that has been successfully used in Go-playing programs is the locality heuristic. The idea is that in the game of Go, a good move is often in the near vicinity of a move that was just placed. Action heuristics based on locality can exploit this idea by choosing moves close to the previous move with a higher probability than those in other regions of the board.

7.4

Automatic Construction of Action Heuristics

The use of action heuristics in general game playing requires a technique for automatically constructing action heuristics from a game description. This section describes such a technique that builds upon the method for automatically constructing heuristic evaluation functions described earlier. The representation, interpretation and use of the heuristics are described first, followed by their method of construction. Each action heuristic is represented by a data structure containing an action expression and, optionally, a condition expression. The action expression is a GDL expression representing an action. The optional condition expression is a

61

GDL expression that must be true in the current state for the action heuristic to apply. Variables serve to abstract details from the actions and conditions, and can also express relations between the actions and the conditions. For example, in Chess, the action heuristic: move(X1, Y1, X2, Y2) if true(cell(X2, Y2, ?piece)) can be interpreted as a heuristic that recommends the action of moving a piece to a position where a piece already exists: in other words, it recommends capturing pieces. For a given game, heuristics are constructed in two sets: action heuristics and anti-heuristics. The interpretation of the latter is that they represent actions to avoid rather than actions to take. During simulation with action heuristics, the simulator generates the legal moves from the given state. Each move is checked to see if it matches any of the action heuristics. If a match is found, it is considered a preferred move. If there is at least one preferred move, a move is chosen randomly from the uniform distribution of preferred moves. If there are no preferred moves, then it checks the anti-heuristics and chooses randomly from legal moves excluding those that the anti-heuristics suggest avoiding. Finally, if all the moves are to be avoided, it chooses randomly among all the legal moves. The automatic construction of action heuristics and anti-heuristics is performed during the deliberation phase along with the construction of the heuristic evaluation function. Immediately after constructing the heuristic evaluation function, the program takes the sample states used during the heuristic evaluation function construction and arranges them in pairs, where each pair of states consists of a sample state and an immediate successor state. Associated with each pair is also the set of actions taken at the first state that led to the subsequent state. The heuristic evaluation function is applied to both states and the difference is taken. Next, a set of candidate action heuristics are generated. For each

62

candidate action heuristic, the state pairs are partitioned into those in which the move associated with the pairs applies and those which it does not apply. The average difference in evaluation function values is computed for both the applicable pairs and the inapplicable pairs. If the average is of the applicable pairs is significantly more than the average of the inapplicable pairs, the corresponding candidate heuristic is considered an action heuristic. If the average of the applicable pairs is significantly less than the average of the inapplicable pairs, the corresponding candidate heuristic is considered an anti-heuristic. If the average of the applicable pairs and the inapplicable pairs is roughly comparable, then the candidate heuristic is considered neither an action heuristic nor an antiheuristic. In the current implementation, “significantly more” and “significantly less” means at least two standard deviations. Monte Carlo methods using both heuristic evaluation functions described earlier and action heuristics as described here have been implemented in the general game player. Results of how this player performs relative to other techniques are presented in Chapter 8.

63

CHAPTER 8 Alpha-Beta Minimax versus Monte Carlo Methods 8.1

Overview

Although much of the emphasis of this dissertation is on general techniques, in practice every technique will tend to be more effective in some types of games and less effective in others. It is thus desirable to be able to characterize for a given technique the factors influencing its effectiveness. In the case of competing techniques, such as alpha-beta minimax search and Monte Carlo methods, it is beneficial to be able to characterize the conditions under which one technique outperforms the other. An ideal situation would be to identify an efficient procedure for accurately predicting which algorithm will be superior given the game description, start time, and time per move. A game playing program could then automatically determine, for example, that on a particular game alpha-beta minimax will be more effective, whereas in a different game Monte Carlo methods would be more effective. In practice, such a crisp, tractable characterization seems unattainable. Instead, I pursue here the less ambitious goal of identifying simple principles that roughly characterize the trade-offs between these techniques. I will focus on games that are large enough that they cannot be solved completely.

64

Consider the following conditions for a game and corresponding heuristic evaluation function: 1. The heuristic evaluation function is both stable and accurate. 2. The game is two-player. 3. The game is turn-taking. 4. The game is zero-sum. 5. The branching factor is relatively low. The conjecture proposed here is that when the above conditions are largely met, alpha-beta minimax generally outperforms Monte Carlo methods.1 Conversely, the further one deviates from the above conditions, the weaker alpha-beta minimax becomes relative to Monte Carlo methods. When the game departs far enough from the above set of conditions, the scale tips in favor of Monte Carlo methods. A few words of explanation are perhaps in order. Strictly speaking, the classical alpha-beta minimax algorithm only applies to two-player, turn-taking, zerosum games, so it is not meaningful to talk about its effectiveness outside of its domain of application. What I mean here is the algorithm that results from applying the various paranoid assumptions described in Chapter 5. The assumptions for each class are: multi-player games: that all opponents can be modeled as a collective operating as a single adversary. 1

There are other conditions relevant to alpha-beta minimax’s effectiveness that are outside the scope of the class of games considered in this dissertation, such as games with stochastic elements, and hidden information.

65

simultaneous moves: that my opponents can observe my move before committing to their own moves. non-zero-sum: that my opponents seek to minimize my payoff instead of maximizing their own payoffs. The remainder of this chapter describes an empirical investigation of the factors influencing the relative effectiveness of alpha-beta minimax and rollout-based Monte Carlo methods for general game playing.

8.2

Randomly Generated Synthetic Games

The method of investigation centers on running tests of the player with different features enabled or disabled on randomly generated games. The approach loosely follows the development of Pearl’s P-games [Pea84] and Nau’s N-games [Nau83]. P-games are two-person perfect information games randomly constructed as finite game trees with a given uniform branching factor b and depth d, in which leaf nodes are independently assigned either a win or a loss according to the same probability for all leaf nodes. N-games have a similar structure, but the win/loss probabilities are handled differently in order to model incremental variation in node strength. In N-games, each node has a strength and each child node has a strength that is either one more or one less than the strength of its parent node, randomly chosen with 50% probability each. A leaf node is given payoff values of one if the node has positive strength and negative one otherwise. The model used here is based on N-games, but with changes to accommodate the following design constraints: 1. Avoid explicitly enumerating the payoff values of the bd terminal states in

66

order to generate and play larger games. 2. Support simultaneous move games. 3. Support games of arbitrary numeric payoffs. To accommodate these constraints, the N-game model was modified in the following way. Each node has a strength, but the strength is per-player. In zerosum games, the strength of one player is incremented by the same amount that the strength of the opponent is decremented, but in non-zero-sum games, the strengths may be incremented and decremented independently. Each node in the game tree is represented by the strength numbers, the depth of the node, and a set of propositional fluents. The number of propositional fluents is a parameter that is input as part of the game generation. Using p fluents enables states with 2p possible values for the propositional fluents. Alternatively, this can be viewed as a bit-string of length p. Writing the transition rules and the final payoffs in terms of these fluents enables a complete game specification without the explicit enumeration of the terminal states, enabling large games to be specified relatively succinctly.

2

The initial values of the propositional fluents are chosen randomly with uniform probability. The initial strength for each player is 50 on a scale of [0, 100]. The transition rules describe the new values of the propositions in terms of the previous propositional values and the actions of the players. To do this, each propositional variable is associated with a randomly constructed Boolean circuit. The incrementing or decrementing of the strength values are also described with similar Boolean circuits. The terminal states are those with depth d and the payoff values are the final strength values of players. 2

For an alternative solution to the problem of efficiently describing randomly generated game trees, see [KC96].

67

Finally, this model lends itself to a straightforward GDL representation, so the game generator emits the generated games in GDL.

8.3

Experiments

In order to investigate the effects of various game parameters on the relative effectiveness of alpha-beta minimax and Monte Carlo methods, the game generator was used to randomly generate games according to various configurations. Common aspects of the configurations were a game length of 50 moves, states containing 20 propositions (plus the step counter and strength values). In turntaking games, the state also contained a fluent indicating whose turn it was. One hundred games were randomly generated for each configuration tested. Each game was played twice, in head-to-head matches between the alpha-beta minimax player and the Monte Carlo player, once with alpha-beta minimax as first player and once with Monte Carlo as first player. The players were given thirty seconds to analyze the game two seconds per move. These times are relatively short compared to games played in AAAI competitions, but the simple structure of the generated games lends itself to quicker game-play.

8.4

Results

The results are shown in Table 8.1. With one exception, the Monte Carlo algorithm improved relative to alpha-beta minimax as the branching factor increased for both turn-taking games and simultaneous move games. The exception was from two moves to four moves in the turn-taking games. The difference here was within one standard-deviation, so it is not clear whether this exception is significant or not.

68

Move Model

Table 8.1: Zero-Sum Games Number Branching Average Score of Moves

Factor

α-β

Standard

Monte Carlo Deviation

Turn-taking

2

2 54.5

45.5

2.3

Turn-taking

4

4 56.2

43.8

3.2

Turn-taking

8

8 55.9

44.1

3.8

Turn-taking

16

16 47.9

52.1

6.1

Turn-taking

32

32 42.2

57.8

7.9

Turn-taking

64

64 37.5

62.5

5.7

Simultaneous

2

4 57.1

42.9

4.0

Simultaneous

4

16 53.7

46.3

5.3

Simultaneous

8

64 43.2

56.8

6.6

A possible explanation for the relationship between branching factor and the relative performance of alpha-beta minimax and Monte Carlo methods is that alpha-beta minimax is more sensitive to the amount of time per move. This would mean that increasing the branching factor is especially disadvantageous to alpha-beta minimax unless the time per move is also increased. To test this explanation, I ran three of the turn-taking configurations varying the time per move. In each case, one-hundred games were generated randomly and the same games were run with both role-assignments for each move time. The results are shown in Table 8.2. The results suggest that although increasing the branching factor while holding the time per move constant tends to favor Monte Carlo methods, increasing in the time per move while holding the branching factor constant can favor alphabeta minimax.

69

Table 8.2: Varying Time Per Move Number

Time

of Moves

(Sec/Move)

Average Score α-β

Standard

Monte Carlo Deviation

4

2 56.2

43.8

3.2

4

16 56.4

43.6

3.6

16

2 47.9

52.1

6.1

16

16 59.0

41.0

4.8

64

2 37.5

62.5

5.7

64

16 45.8

54.2

11.3

The results we have discussed so far have been for zero-sum games.3 The situation is more complicated with non-zero-sum games. I ran a series of tests on randomly generated non-zero-sum games. The potential for cooperation between players in non-zero-sum games makes tests that simply pit alpha-beta minimax against Monte Carlo insufficient. In addition to this pairing, I also ran tests of each algorithm paired with another instance of itself and averaged the scores for each configuration. It is important to keep in mind that the players did not know what the other players were, nor was there any communication between the players. So there was no coordination of action, merely the possibility for players to choose moves that were mutually beneficial, mutually detrimental, or beneficial to one player and detrimental to the other. I did this for both turntaking games and simultaneous move games and varied the number of moves per step. The results are shown in Table 8.3. Here are my observations. First, neither alpha-beta minimax nor Monte Carlo consistently outplayed the other when playing head-to-head. Instead, for both 3

Technically, these are constant-sum games because they are played on a scale of [0, 100]. This distinction is not important for the present discussion, so I use the more common term zero-sum.

70

Move Model

Table 8.3: Nonzero-Sum Games Number of Moves Players

Average Score α-β

Monte Carlo

Turn-taking

16 α-β vs. Monte Carlo

54.3

50.6

Turn-taking

16 α-β vs. α-β

51.0

-

Turn-taking

16

-

53.4

Turn-taking

64 α-β vs. Monte Carlo

42.1

60.0

Turn-taking

64 α-β vs. α-β

50.1

-

Turn-taking

64

-

51.2

Monte Carlo vs. Monte Carlo

Monte Carlo vs. Monte Carlo

Simultaneous

4 α-β vs. Monte Carlo

59.4

46.4

Simultaneous

4 α-β vs. α-β

49.5

-

Simultaneous

4

-

57.3

Simultaneous

8 α-β vs. Monte Carlo

49.3

53.6

Simultaneous

8 α-β vs. α-β

50.2

-

Simultaneous

8

-

54.1

Monte Carlo vs. Monte Carlo

Monte Carlo vs. Monte Carlo

turn-taking games and simultaneous move games, increasing the branching factor tilted the balance more in favor of Monte Carlo methods when the two algorithms played against each other. This is consistent with the observations in the zero-sum games. However, for the non-zero-sum games, the average payoff of an instance of Monte Carlo playing against another instance of Monte Carlo is greater than the average payoff of an instance of alpha-beta minimax playing against another instance of alpha-beta minimax. This result seems reasonable because the alphabeta minimax implementation uses the paranoid assumption that the opponent will always try to minimize my score, so it doesn’t capitalize on opportunities for mutual benefit.

71

8.5

Real Games

To see how the methods compare on real games (as opposed to randomly generated games), I also ran trials of a collection of two-player games from past AAAI GGP tournaments. Each game was played six times (three with one role/player pairing and three with the other role/player pairing). Each game was played with 30 seconds per move. The start times were in the range of those used in AAAI competitions. The results are shown by game in Table 8.4 and by branching factor in Figure 8.2. The average branching factors were estimated by averaging the branching factor for each state in 100 games generated by random moves chosen from the uniform distribution of legal moves from each state. A couple cautionary comments are in order in interpreting the results. First, the games differ from each other in multiple ways, so the contributing reasons why one method did better than another on a particular game are not readily attributable to a single factor. Second, the games used were those that I have seen before in the context of competitions and have in some cases refined my program as a result of its performance. In particular, I have refined the feature detection algorithms as I have encountered games with obvious features not originally detected. This means that the games are more biased towards those with better heuristic evaluation functions than games which I have never seen before. I would expect this bias towards better heuristic evaluation functions to translate to a bias favoring alpha-beta minimax because Monte Carlo methods are more robust to poor heuristic evaluation functions than alpha-beta minimax. With these caveats in mind, the results are shown in tabular form in Table 8.4 and graphically in Figure 8.1 and Figure 8.2. The results shown in Figure 8.1 are based on percentage of combined payoff for both players, regardless of whether the game was strictly adversarial or not. For several of the games, alpha-beta

72

minimax and Monte Carlo methods had exactly the same average payoff. In some of these games, particularly the Tic-Tac-Toe variants, this is because the game was small enough that both versions were able to play optimally. In other cases, such as Amazons, the algorithms appeared evenly matched. In several games, alpha-beta minimax dominated Monte Carlo methods. I suspect that in several cases, such as breakthrough (regular and suicide) and chess, a significant factor in the dominance of alpha-beta minimax is that the heuristic evaluation functions constructed for these games are particularly good. This is consistent with the notion that a high-quality heuristic evaluation function benefits alpha-beta minimax more than Monte Carlo methods. In other games, Monte Carlo methods outperformed alpha-beta minimax. In some of these, such as Blocker Parallel, large branching factors is a likely explanation. However, as Figure 8.2 shows, branching factor alone is an insufficient to determine whether alpha-beta minimax is likely to outperform Monte Carlo methods. This is consistent with the supposition that other factors (such as the quality of the heuristic evaluation function) play a significant role. Table 8.4: Games from AAAI Tournaments Start Time

Branching

(Minutes)

Factor

α-β

Monte Carlo

Amazons

4

357

50.0

50.0

Amazons Suicide

4

360

33.3

66.7

Blocker

1

106

50.0

50.0

Blocker Parallel

1

20,417

8.3

75.0

Blocker Serial

1

106

25.0

75.0

Bomberman

4

17

41.7

58.3

Game

Average Score

Continued on next page

73

Table 8.4 – continued from previous page

Game

Start Time

Branching

(Minutes)

Factor

Average Score α-β

Monte Carlo

Breakthrough

4

26 100.0

0.0

Breakthrough Suicide

4

26 100.0

0.0

Checkers

4

6

48.3

51.7

Checkers (cylinder)

4

6

53.3

46.7

Chess

60

28

91.7

8.3

Chess (skirmish)

10

34

87.2

64.5

Chess (skirmish suicide)

10

12

13.0

15.2

Chinese Checkers

1

6

58.3

58.3

Connect-Four (7x6)

1

7 100.0

0.0

Connect-Four (7x6 suicide)

1

7

66.7

33.3

Connect-Four (8x6)

1

8

66.7

33.3

Connect-Four (8x6 suicide)

1

8

16.7

83.3

Ghost Maze

1

4

75.0

25.0

Kahala

4

5

63.3

36.7

Mummy Maze

1

5 100.0

0.0

Nothello

30

5

83.3

16.7

Othello

60

8

83.3

16.7

Pentago

4

15

50.0

50.0

Pentago Suicide

4

15

66.7

33.3

Racetrack Corridor

4

17 100.0

10.0

Tic-Tac-Toe

1

6

50.0

50.0

Tic-Tac-Toe Simultaneous

1

38

50.0

50.0

Tic-Tac-Toe Parallel

1

40

58.3

33.3

Continued on next page

74

Table 8.4 – continued from previous page Start Time

Branching

(Minutes)

Factor

α-β

Monte Carlo

Tic-Tac-Toe Serial

1

6

50.0

50.0

Tic-Tac-Toe 5x5

1

14

50.0

50.0

Tic-Tac-Toe 5x5 Suicide

1

10

50.0

50.0

Quarto

2

10

83.3

16.7

Quarto Suicide

2

10

66.7

Game

75

Average Score

33.3

aaai2p

Alpha-Beta

Monte Carlo

Breakthrough Breakthrough Suicide Mummy Maze Connect-Four (7x6) Chess Racetrack Corridor Quarto Nothello Othello Ghost Maze Connect-Four (7x6 suicide) Quarto Suicide Connect-Four (8x6) Pentago Suicide Tic-Tac-Toe Parallel Kahala Chess (skirmish) Checkers (cylinder) Blocker Tic-Tac-Toe Simultaneous Tic-Tac-Toe Amazons Tic-Tac-Toe 5x5 Suicide Tic-Tac-Toe 5x5 Tic-Tac-Toe Serial Chinese Checkers Pentago Checkers Chess (skirmish suicide) Bomberman Amazons Suicide Blocker Serial Connect-Four (8x6 suicide) Blocker Parallel

0%

50%

100%

Figure 8.1: Two-Player Games from AAAI Tournaments Page 1

76

sorted

100 90 80

Payoff

70 60 50 40 30 20 10 0

1

Alpha-Beta Monte Carlo

10

100

1000

10000

Branching Factor

Figure 8.2: Results by Branching Factor

77

Page 2

100000

CHAPTER 9 Empirical Results: AAAI GGP Competitions Successive versions of my program have participated in the AAAI GGP Competition since the competition’s introduction. This chapter describes the versions of the program submitted and the results achieved. The overall achievement was that my entrants advanced to the championship round each year, once winning the championship and twice placing second.

9.1

First Annual GGP Competition

The first version of my program used four heuristics: partial goal attainment, material, distance, and mobility. The partial goal attainment heuristic applies to games in which goals are expressed as a conjunction of subgoals. It attempts to maximize the number of subgoals attained. The material heuristic attempts to maximize material based on modeling assumptions about how movable, captureable pieces are expressed. The distance heuristic identifies goal expressions containing relations representing linear coordinates. It compares similar expressions in the current state to compute a distance between the current state and the goal state and favors states that minimize this distance. The mobility heuristic attempts to maximize the number of moves available to the player relative to the moves available to its opponents. The overall heuristic evaluation function is simply a linear combination of these components, where the weights are fixed.

78

Table 9.1: AAAI 2006 GGP Competition Leaderboard Player

Score

Programmers

FLUXPLAYER

2690.75

Stephan Schiffel, Michael Thielscher (Dresden University)

CLUNEPLAYER

2573.75

Jim Clune (UCLA)

UT-AUSTIN-LARG

2370.50

Greg Kuhlmann, Peter Stone (UT-Austin)

OGRE

1948.25

David Kaiser (FIU)

LUCKYLEMMING

1575.00

S. Disetelhorst, E. Wagner, M. Gunther (Dresden University)

NNRG.HAZEL

1543.00

Joe Reisinger, Igor Karpov, Erkin Bahceci (UT-Austin)

APPLE-RT

512.50

Xinxin Sheng, Paul Breimyer (NCSU)

THE

375.00

Michael Marlen (KSU)

ENTROPY

250.00

Richard Holcombe

GGP REMOTE AGENT

0.00

Mohammad Shafiei (Sharif Inst. Tech., Iran)

AIDRIVEN.LEARNER

0.00

Zhenzong Xu

JIGSAWBOT

0.00

Raghuvar Nadig (IITB)

The first competition consisted of nine participants and was held at AAAI 2005 in Pittsburgh. The games were played in a sequence of five rounds, with eliminations after each round. The games were all unknown to the participants prior to the match, and consisted of games such as peg solitaire, multiple chinese checkers variants, an Othello variant, and a game called racetrack corridor, described in Chapter 5. My player won the overall competition and I was awarded the $10,000 prize. There were a few lessons learned during the first competition. First, GGP is a very hard problem. The informal consensus among participants was that the quality of play in the first competition was not very high. Also, it was noted that it would be desirable to play significantly more games in order to better evaluate different approaches, preferably more games than would be logistically feasible at a AAAI conference. The majority of the participants expressed continued interest in the project and a desire to participate again the next year.

79

9.2

Second Annual GGP Competition

The second version of my program was built on the first version but introduced more heuristics, generalized existing heuristics, and included a provision for dynamically determining which heuristics to use in the deliberation phase. The player ran a mini-tournament between different versions of itself with different heuristics enabled. Whichever heuristics were empirically more successful during the mini-tournament were used during actual game-play. My intention was to also use this mini-tournament to determine the weights of the various heuristics used, but time restrictions in the deliberation phase made this impractical, so although the specific components used were determined dynamically, their relative weights were again fixed. In response to lessons learned the previous year, the second competition was organized somewhat differently than the first. The competition took place in a sequence of four rounds plus a final match. Each round consisted of several types of games played over a period of one to three days. The first three rounds were played over the Internet. The fourth round and the final match were played at AAAI 2006 in Boston on a local network. There was between one and two weeks between rounds and players were encouraged to improve their players between rounds. Points were accumulated with round one weighted 0.25, round two weighted 0.5, round three weighted 0.75, and round four weighted 1.0. There were twelve participants in the second competition, including revised or rewritten programs from each of participants that placed in the top four in the first competition. The winner of the competition was a program by Stephan Schiffel and Michael Thielscher of Dresden University. My program placed second overall. The final leaderboard is shown in Table 9.1. Although my program played well enough to place second overall, its weak-

80

nesses were revealed perhaps most clearly in the final match of the tournament. The game was cylinder checkers. The primary difference from traditional checkers is that the playing board has a topology in which the sides wrap around to form a cylinder. Also unlike traditional checkers, where the goal is to make one’s opponent unable to move, the goal in cylinder checkers is to maximize the number of one’s own pieces and minimize the number of the opponent’s pieces. (Kings and regular pieces are weighted equally.) The payoff values range from 0 to 100 and are proportional to the difference in piece count. As in traditional checkers, jumps are forced. The game ends when either player cannot move or after fifty moves, whichever comes first. In the deliberation phase, my player ran a self-tournament which resulted in enabling several heuristics, including both mobility and material. However, the lack of game-specific weight adjustment caused the mobility heuristic to dominate the evaluation function. Because jumps are forced, an easy way to increase your immediate mobility relative to your opponent is to sacrifice pieces that your opponent is forced to capture. The program pursued this flawed tactic, choosing moves such that the opponent would capture and my player would not have a responding capture. Needless to say, this resulted in extremely poor moves and a loss for my player. There are a few lessons illustrated by this match that have influenced my current approach. An obvious one is that fixed weights for heuristics can result in a single bad heuristic dominating move selection with disastrous results. Also, the limited deliberation time makes the high-speed mini-tournament an unreliable source of advice about which heuristics to use. An even more fundamental issue, however, is how to characterize criteria that make a heuristic effective, ineffective, or even counter-productive. One observation is that the forced jumps rule in

81

Table 9.2: AAAI 2007 GGP Competition: Results of Preliminary Rounds Player

Score

Programmers

CADIA-PLAYER

2723.50

Yngvi Bj¨ornsson, Hilmar Finnsson (Reykjavik University)

FLUXPLAYER

2355.50

Stephan Schiffel, Michael Thielscher (Dresden University)

ARY

2252.75

Jean Mehat (University of Paris)

CLUNEPLAYER

2122.25

Jim Clune (UCLA)

UTEXAS LARG

1798.00

Greg Kuhlmann, Peter Stone (UT-Austin)

JIGSAWBOT

1524.00

Raghuvar Nadig (IITB)

LUCKYLEMMING

1250.00

S. Disetelhorst, E. Wagner, M. Gunther (Dresden University)

W-WOLFE

821.25

Ben Handy

checkers makes mobility extremely unstable in the sense that the number of moves available tends to oscillate wildly depending on whether or not a jump is available. My intuition is that unstable features like this form a poor basis for evaluation functions.

9.3

Third Annual GGP Competition

The third version of my program incorporated the abstract-model technique described in Chapter 5 and the single-agent provisions described in Chapter 6. I entered the latest version of my program, based on these techniques in the third competition. The competition format was similar to the second competition, with preliminary rounds played over the Internet. Some players dropped out during the preliminary rounds. The scores of the remaining players at the end of the preliminary rounds were as shown in Table 9.2. The final games of the third competition were played at AAAI 2007 in Vancouver. My player advanced to the championship round, where it was pitted against the player from Yngvi Bj¨ornsson and Hilmar Finnsson of Reykjavik University.

82

The round consisted of a three matches of variants of a chess-like game called Skirmish. The game was played on a chess board with some missing squares. The locations of the missing squares varied for the three matches, but the common element was that pieces could neither land on nor move through the “holes” in the board. Traditional chess pieces were used, but points were awarded based on the number of pieces captured. There was no notion of checkmate and the king could be captured like any other piece. In the first match, my player played white. It played well and won the match. In the second match, my player played black and immediately started making very bad moves. It looked as if it was trying to give up pieces. Later examination revealed that there was a bug that I had inadvertently introduced between the previous round and the finals that caused a sign-error for the heuristic evaluation function in certain case. This case applied to the Skirmish game, but only if my player played black. The result was that my player pursued white’s heuristic evaluation function even though it should have been playing black’s heuristic evaluation function. Thus, the poor move selection actually coincided with the heuristic evaluation function, but the result was a loss of the match. In the final match, my player played white and won the match. However, the margin of the loss for the second match was so large that my player lost the round. Although the circumstances of the championship round were unfortunate, I should hasten to point out that the winning program by Bj¨ornsson and Finnsson also scored the most points in the preliminary rounds and had demonstrated excellent performance over a broad range of games.

83

9.4

Summary

The Annual AAAI GGP Competition provides an objective measure of general game playing performance in a competitive setting, drawing entrants from researchers worldwide. The sequence of players I have created and entered in the competition have consistently demonstrated world-class performance. Although my entrants have not always won, they have always advanced to the championship round, and even my entrant’s losses have been illuminating. The consistently high performance is evidence that the techniques forming the basis of the program are indeed effective in practice.

84

CHAPTER 10 Discussion and Conclusions 10.1

Summary

In this dissertation, I have presented the task of general game playing and techniques for this task, emphasizing techniques for automatically constructing heuristic evaluation functions. I have argued that general game playing is a promising area for advancing long-term research goals of artificial intelligence. Furthermore, the mathematical models of games are useful and appropriate for modeling a broad array of complex phenomena, suggesting that relevance of general game playing techniques extends beyond “games” in the traditional sense. I have presented a technique for automatically constructing heuristic evaluation functions from an abstract game model based on payoff, control, and termination functions, each of which is expressed in terms of stable numeric features. I have presented evidence that this technique is effective in practice through AAAI competition results making it to the championship round in all three years of participation. I have described how the heuristic evaluation function can be used in the context of Monte Carlo methods as well as alpha-beta minimax search, and presented some results suggesting that the larger the branching factor of a game, the more effective Monte Carlo methods tend to be relative to alpha-beta minimax.

85

10.2

Discussion and Future Work

The technique for constructing heuristic evaluation functions described in Chapter 5 is designed to be quite general, but it is better suited to some types of games than others. The games most amenable to this approach are those that are described in terms of expressions that can be interpreted numerically such that the resulting numeric features tend to vary incrementally and correlate to either mobility or final payoff or both. This suggests a few potential areas for future work. One is the construction of composite features that vary incrementally from primitive features that are unstable in isolation. For example, the notion of “stable disks” in Othello is a more specific notion than the notion of stability presented in this dissertation, and the automatic construction of such a feature from the Othello game description seems to require a more sophisticated technique. Another area of future work is to automatically compress game descriptions in order to mitigate any sensitivity to how a game is described. One area of future work is to automatically determine when to use alphabeta minimax and when to use rollout-based Monte Carlo methods. Results so far suggest that this determination should be based on criteria such as the game’s average branching factor and an estimated quality of the heuristic evaluation function. This suggests the need for a computationally tractable means of quantifying the estimated quality of a heuristic evaluation function, which is another area of future work. Another area of investigation is how to effectively utilize additional computational resources, such as a network of computers.

86

10.3

Conclusion

Evolutionary biology suggests that in a very real sense, all of life is a game. General game playing provides a framework for exploring techniques for playing a very broad class of games and this dissertation has presented effective techniques for this domain. Scientific endeavor itself is a type of game in which the payoffs to society are the practical benefits of technological innovation that result from scientific advances. We rely on limited look-ahead capabilities and heuristics to assess the merit of scientific contributions that have yet to mature into concrete technological innovations. At this stage, the prospects of general game playing techniques look promising, so I look forward to seeing the game unfold.

87

APPENDIX A Interpreting Heuristic Evaluation Functions One characteristic of the technique for constructing heuristic evaluation functions described in Chapter 5 is that the resulting evaluation functions, when described in English, are comprehensible to humans. A practical benefit of this comprehensibility is that it helps me to analyze and diagnose bugs in my program. Comprehensibility facilitates debugging by enabling me to interpret what the heuristic evaluation functions mean for known games and compare these functions with my own intuitions about how states should be evaluated. An additional potential benefit of comprehensible evaluation functions is the potential for the development of “mixed-initiative” systems, that is, collaborative efforts in which humans and machines work together toward common goals. Comprehensible evaluation functions may facilitate collaborations leveraging the complementary capabilities of humans and automated agents more easily than techniques that are more opaque to humans, such as large multi-layered neural networks. Although I have presented heuristic evaluation functions generated by my program in human-readable terms, my program does not translate it’s heuristic evaluation functions to English descriptions automatically. Instead, the program outputs some expressions which I interpret and paraphrase in English. This appendix describes what my program outputs and how I interpret it through a few examples.

88

A.1

Chess

The abstract model for Chess as reported in Chapter 5 was output from the program as follows: Mobility for white = 0.060 (cell ?x ?y wq) - (cell ?x ?y bq) +0.035 (cell ?x ?y wr) - (cell ?x ?y br) +0.027 (cell ?x ?y wb) - (cell ?x ?y bb) +0.017 (cell ?x ?y wn) - (cell ?x ?y bn) +0.0031 (cell ?x ?y wp) - (cell ?x ?y bp) The output includes only the mobility function, indicating that nothing was found for the payoff function. The coefficients of the linear combination are self-evident, whereas the features associated with each coefficient require some interpretation. The expressions within parentheses are GDL expressions. Each feature is associated with a numeric interpretation of a GDL expression. The default interpretation is the solution-cardinality interpretation. Each of the above expressions implicitly uses a solution-cardinality interpretation because the output doesn’t indicate otherwise. So the interpretation of (cell ?x ?y wq) is the number of unique bindings for the variables ?x and ?y such that the expression is true in the current state. Another aspect of interpreting the output is understanding what the gamespecific symbols mean. In this case, some familiarity with standard chess notation makes the meanings of the symbols fairly obvious. These meanings are confirmed by inspection of the game description:

89

wq: white queen

bq: black queen

wr: white rook

br: black rook

wb: white bishop

bb: black bishop

wn: white knight

bn: black knight

wp: white pawn

bp: black pawn

Note that if the author of the chess description had used non-standard symbols, or if the game-description was obfuscated as descriptions are in competition, then the meanings of the symbols would not be clear. I have no technique for dealing with this. If I want to know what an evaluation function means, I run it on a non-obfuscated game description. The cell relation corresponds to individual squares within the chess-board, so an expression such as (cell e 4 wp) means there is a white pawn on E4. Given the solution cardinality interpretation, it follows that (cell ?x ?y wp) refers to the number of white pawns on the board. The minus sign between the expressions indicate that the program has identified a symmetry and has formed a compound feature of the difference between the two individual features. Thus, the first term of the mobility function refers to the number of white queens minus the number of black queens. The overall heuristic evaluation function is a material function with piece evaluations of queens, rooks, bishops, knights, then pawns.

A.2

Chinese Checkers

The program outputs the following for Chinese Checkers:

90

Payoff for red = 0.792 goal for red, partial soln +0.792 early payoff for red -2.546 distance to red goal +70.129 This output has no mobility function, but does have a payoff function. None of the features here have the default solutions cardinality interpretation. The “early payoff for red” is just what it appears to be: the payoff values for red evaluated in non-terminal states. Seeing what this actually means requires looking at the game description. (It turns out that in this case, the payoffs were written to give 0, 25, 50, or 100 points depending on how many marbles were in the goal position.) The second feature refers to the partial solution interpretation of the expression corresponding to red’s goal. The partial solution interpretation applies to conjunctive expressions and is the percentage of conjuncts true in the given state. In this case, the conjuncts correspond to marbles, so the interpretation of this feature is the percentage of marbles in the goal position. Note that this is what the early payoff is also based on, so this feature is similar to early payoff for this game, but this is not necessarily the case in general. The final feature is distance to red goal. This is the distance interpretation applied to the red goal. Intuitively, this measures distance of marbles to the goal. In this case, it turns out to correspond to the sum of the distances the marbles need to travel to reach the goal, but this is not readily apparent from the output, rather it can be deduced by inserting diagnostics into the distance feature function.

91

A.3

Othello

The program outputs the following for Othello: Payoff for white = 2.246 (cell 8 1 white) - (cell 8 1 black) +1.756 (cell 1 8 white) - (cell 1 8 black) +1.561 (cell 8 8 white) - (cell 8 8 black) +1.418 (cell 1 1 white) - (cell 1 1 black) +1.372 (cell ?m 1 white) - (cell ?m 1 black) +1.308 (cell 8 ?n white) - (cell 8 ?n black) +0.898 (cell ?row 8 white) - (cell ?row 8 black) +0.605 (cell 1 ?n white) - (cell 1 ?n black) +50.000 Mobility for white = 0.0027 (cell ?m ?n black) - (cell ?m ?n white) Terminal = -0.0015 (cell ?m ?n green) +0.061 Stability for white: Payoff = 0.837, Mobility = 0.163 This time the result is a full model, including payoff, mobility, and termination. There is no mention of distances or partial goals, so each features has a solution cardinality interpretation. The writer of the game description has used a cell relation similar to the one used in chess, so the expression (cell 8 1 white) refers to a white disk in a particular corner. The orientation is arbitrary, so we can consider this the lower right corner, and the (cell 1 8 white) a disk in the upper left corner, and so on. The expressions with variables mean that coordinate

92

can be anything, so (cell ?m 1 white) - (cell ?m 1 black) refers to the number of white disks on the bottom edge minus the number of black disks on the bottom edge. The mobility function is a simple relative disk count. The mobility function for white maximizes black disks and minimizes white disks, which is an inverse material function. The probability of termination increases as the number of green (empty) cells decreases.

93

APPENDIX B Engineering Considerations This appendix discusses some engineering considerations and implementation details of my general game playing program of potential interest to programmers. The program is implemented in the OCaml programming language [RV98]. The code base has grown in size with each revision, with the version that ran in the third competition containing on the order of 10,000 lines of code.

B.1

Reasoning Module

One component utilized by a game player supporting GDL is a reasoning module that at a minimum supports the following operations: legal moves: computes the legal moves for a given role from a given state. terminal: takes a state and returns a Boolean value indicating whether or not the state is terminal. next state: computes the next state given a non-terminal state and the actions performed by each player. payoff: takes a terminal state and a role and returns the payoff value. In my program, the reasoning module is a back-chaining algorithm implemented in OCaml along with the rest of the program. Discussions with other participants

94

in the AAAI GGP Competitions have revealed that a more common approach is to translate the game descriptions into Prolog and to perform the reasoning operations as Prolog queries. An obvious advantage of the Prolog-based approach is that a high-performance back-chaining reasoner can be utilized without needing to write one from scratch. An advantage of the custom reasoner is that it more readily facilitates tighter integration between the reasoner and the core game-playing facilities, though in practice it is not clear that this benefit outweighs the cost.

B.2

Multi-Processor Utilization

At the time of this writing, the OCaml runtime is not safe with respect to concurrent execution, so multiple execution threads are scheduled on a single processor. Multi-threading in OCaml is supported and can be useful for overlapping I/O with computation and for coding convenience, but the limitation of thread-scheduling to a single processor results in ineffective utilization of multi-processor machines. The laptop that I used in the third competition was a MacBook Pro with an Intel Core Duo processor, so for the third version of my program, I utilized the following technique to overcome OCaml’s thread-scheduling limitation. Wherever I wanted the program to perform two computationally intensive operations concurrently, I had the program do the following: 1. Create a UNIX pipe. 2. Fork the process. 3. The child process and the parent process each perform their computations concurrently.

95

4. The child writes back the serialized result of the computation over the pipe. 5. The parent process reads the result and closes the pipe. In this way, both processors were utilized, one by each process. Parallelized operations utilizing this technique include: • Sample states were generated concurrently during construction of abstractmodel based heuristic evaluation functions as described in Chapter 5. • Informed and uninformed search procedures utilized to find plans in singleplayer games were performed concurrently as described in Chapter 6. • In multi-player games, during game-play, the program spawned a child process to search without a heuristic evaluation function. The rationale was that the heuristic evaluation function is important for making decisions when the end of the game cannot be reached, but as the end-game approaches, the tree can sometimes be searched completely more quickly without using a heuristic evaluation function. When the non-heuristic search obtained exact values, a move was chosen based on the exact values. Otherwise, the heuristic values formed the basis of move selection.

96

References [ACF02] Peter Auer, Nicol Cesa-Bianchi, and Paul Fischer. “Finite time Analysis of the Multiarmed Bandit Problem.” Machine Learning, 47(23):235–256, 2002. [Axe80]

Robert Axelrod. “Effective Choice in the Prisoner’s Dilemma.” The Journal of Conflict Resolution, 24(1):3–25, 1980.

[BF97]

Avrim L. Blum and Merrick L. Furst. “Fast planning through planning graph analysis.” Artificial Intelligence, 90(1–2):279–298, 1997.

[BF07]

Yngvi Bj¨ornsson and Hilmar Finnsson. Personal communication, 2007.

[BH03]

Bruno Bouzy and Bernard Helmstetter. “Monte-Carlo Go Developments.” In H. Jaap van den Herik, Hiroyuki Iida, and Ernst A. Heinz, editors, ACG, volume 263 of IFIP, pp. 159–174. Kluwer, 2003.

[Bra06]

Ronald J. Brachman. “(AA)AI More than the Sum of Its Parts.” AI Magazine, 27(4):19–34, 2006.

[Bru93]

Bernd Br¨ ugmann. “Monte Carlo Go.”, 1993. Unpublished manuscript. http://www.ideanest.com/vegos/MonteCarloGo.pdf.

[Bur02]

Michael Buro. “The evolution of strong Othello programs.” In IWEC, pp. 81–88, 2002.

[CK86]

Jens Christensen and Richard E. Korf. “A Unified Theory of Heuristic Evaluation Functions and its Application to Learning.” In AAAI, pp. 148–152, 1986.

[Clu07]

James Clune. “Heuristic Evaluation Functions for General Game Playing.” In Proceedings of the Twenty-Second National Conference on Artificial Intelligence. AAAI Press, July 2007.

[FN71]

Richard E. Fikes and Nils J. Nilsson. “New Approach to the Application of Theorem Proving to Problem Solving.” Artificial Intelligence, 2(3–4):189–208, 1971.

[FU92]

Tom Elliott Fawcett and Paul E. Utgoff. “Automatic Feature Generation for Problem Solving Systems.” In D. Sleeman and P. Edwards, editors, Proceedings of the 9th International Conference on Machine Learning, pp. 144–153. Morgan Kaufmann, 1992.

97

[Gin01]

Matthew L. Ginsberg. “GIB: Imperfect Information in a Computationally Challenging Game.” Journal of Artificial Intelligence Research, 14:303–358, 2001.

[GL05]

Michael Genesereth and Nathaniel Love. “General Game Playing: Game Description Language Specification.” Technical report, Computer Science Department, Stanford University, Stanford, CA, USA, March 2005. http://games.stanford.edu/gdl spec.pdf.

[GL06]

Alfonso Gerevinie and Derek Long. “Preferences and Soft Constraints in PDDL3.” In Proceedings of the ICAPS-2006 Workshop on Preferences and Soft Constraints in Planning, pp. 46–53, 2006.

[GLP05] Michael Genesereth, Nathaniel Love, and Barney Pell. “General Game Playing: Overview of the AAAI Competition.” AI Magazine, 26(2), 2005. [GNT04] Malik Ghallab, Dana Nau, and Paolo Traverso. Automated Planning: Theory and Practice. Morgan Kaufmann Publishers, San Francisco, CA, 2004. [GP07]

Ben Goertzel and Cassio Pennachin, editors. Artificial General Intelligence. Springer, 2007.

[GW06] Sylvain Gelly and Yizao Wang. “Exploration exploitation in Go: UCT for Monte-Carlo Go.”, December 2006. http://eprints.pascalnetwork.org/archive/00002713/. [HE05]

J. Hoffmann and S. Edelkamp. “The Deterministic Part of IPC-4: An Overview.” Journal of Artificial Intelligence Research, 24:519–579, 2005.

[HNR68] P. E. Hart, N. J. Nilsson, and B. Raphael. “A Formal Basis for the Heuristic Determination of Minimum Cost Paths.” IEEE Trans. Syst. and Cybernetics, SSC-4 (2):100–107, 1968. [Jen99]

Arthur R. Jensen. “The G Factor: the Science of Mental Ability.” Psycoloquy, 10(23), 1999.

[Kay06] Alan Kay. Personal communication, 2006. [KC96]

R. E. Korf and D. M. Chickering. “Best-first minimax search.” Artificial Intelligence, 84:299–337, 1996.

98

[KDS06] Gregory Kuhlmann, Kurt Dresner, and Peter Stone. “Automatic Heuristic Construction in a Complete General Game Player.” In Proceedings of the Twenty-First National Conference on Artificial Intelligence, pp. 1457–62, July 2006. [KM75]

D. E. Knuth and R. W. Moore. “An analysis of alpha-beta pruning.” Artificial Intelligence, 6:293–326, 1975.

[Kor85]

Richard E. Korf. “Macro-operators: a weak method for learning.” Artificial Intelligence, 26:35–77, 1985.

[Kor87]

Richard E. Korf. “Planning as Search: A Quantitative Approach.” Artificial Intelligence, 33(1):65–88, September 1987.

[Kor90]

Richard E. Korf. “Real-time heuristic search.” Artificial Intelligence, 42(3):189–212, 1990.

[Kor94]

Richard E. Korf. “Heuristic evaluations functions in artificial intelligence search algorithms.” Minds and Machines, 5(4):489–498, 1994.

[KS92]

Henry A. Kautz and Bart Selman. “Planning as Satisfiability.” In ECAI, pp. 359–363, 1992.

[KS06]

Levente Kocsis and Csaba Szepesv´ari. “Bandit Based Monte-Carlo Planning.” In Johannes F¨ urnkranz, Tobias Scheffer, and Myra Spiliopoulou, editors, ECML, volume 4212 of Lecture Notes in Computer Science, pp. 282–293. Springer, 2006.

[KS07]

Gregory Kuhlmann and Peter Stone. “Graph-Based Domain Mapping for Transfer Learning in General Games.” In Joost N. Kok, Jacek Koronacki, Ramon L´opez de M´antaras, Stan Matwin, Dunja Mladenic, and Andrzej Skowron, editors, ECML, volume 4701 of Lecture Notes in Computer Science, pp. 188–200. Springer, 2007.

[Len82]

Douglas B. Lenat. “The Nature of Heuristics.” Artificial Intelligence, 19(2):189–249, 1982.

[Lev95]

Robert Levinson. “General Game-Playing and Reinforcement Learning.” Technical Report UCSC-CRL-95-06, Department of Computer Science, University of California, Santa Cruz, 1995.

[LF06]

D. Long and M. Fox. “The International Planning Competition Series and Empirical Evaluation of AI Planning Systems.” In L. Paquete, M. Chiarandini, and D. Basso, editors, Proceedings of Workshop on Empirical Methods for the Analysis of Algorithm, 2006.

99

[May76] J. Maynard Smith. “Evolution and the Theory of Games.” American Scientist, 64:41–45, January 1976. [Meh07] Jean Mehat. Personal communication, 2007. [MJP92] S. Minton, M. D. Johnston, A. B. Philips, and P. Laird. “Minimizing Conflicts: a Heuristic Repair Method for Constraint Satisfaction and Scheduling Problems.” Artificial Intelligence, 58:161–205, 1992. [Nas50]

John F. Nash. Non-Cooperative Games. PhD thesis, Princeton University, Mathematics Department, 1950.

[Nau83] D. S. Nau. “Pathology on Game Trees revisited and an alternative to minimaxing.” Artificial Intelligence, 21(1-2):221–244, 1983. [Now06] Martin A. Nowak. Evolutionary Dynamics: Exploring the Equations of Life. Harvard University Press, 2006. [Pea84]

Judea Pearl. Heuristics: Intelligent Search Strategies for Computer Problem Solving. Addison-Wesley, Reading, Massachusetts, 1984.

[Pel93]

Barney D. Pell. Strategy Generation and Evaluation for Meta-Game Playing. PhD thesis, University of Cambridge, 1993.

[Pin99]

Steven Pinker. How the Mind Works. W. W. Norton & Company, January 1999.

[RV98]

D. Remy and J. Vouillon. “Objective ML: An effective object-oriented extension to ML.” Theory and Practice of Object Systems, 4(1):27–50, 1998.

[Sac74]

Earl D. Sacerdoti. “Planning in a Hierarchy of Abstraction Spaces.” Artificial Intelligence, 5:115–135, 1974.

[Sam67] A. L. Samuel. “Some studies in machine learning using the game of checkers. II - recent progress.” IBM Journal, pp. 601–617, November 1967. [SB98]

Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction. MIT Press, Cambridge, MA, 1998.

[She02]

Brian Sheppard. “World-Championship-Caliber Scrabble.” Artificial Intelligence, 134(1–2):241–275, 2002.

100

[SK00]

Nathan R. Sturtevant and Richard E. Korf. “On Pruning Techniques for Multi-Player Games.” In AAAI/IAAI, pp. 201–207. AAAI, AAAI Press / The MIT Press, 2000.

[ST07]

Stephan Schiffel and Michael Thielscher. “Automatic Construction of a Heuristic Search Function for General Game Playing.” In Seventh IJCAI International Workshop on Nonmontonic Reasoning, Action and Change (NRAC07), Hyderabad, India, January July-August 2007.

[TG97]

Gerald Tesauro and Gregory R. Galperin. “On-line Policy Improvement using Monte-Carlo Search.” In Michael C. Mozer, Michael I. Jordan, and Thomas Petsche, editors, Advances in Neural Information Processing Systems, volume 9, p. 1068. The MIT Press, 1997.

[Utg01]

Paul E. Utgoff. “Feature Construction for Game Playing.” In Johannes F¨ urnkranz and Miroslav Kubat, editors, Machines that Learn to Play Games, chapter 7, pp. 131–152. Nova Science Publishers, Huntington, NY, 2001.

[VB05]

Thomas L. Vincent and Joel S. Brown. Evolutionary Game Theory, Natural Selection, and Darwinian Dynamics. Cambridge University Press, 2005.

[vM53]

John von Neumann and Oskar Morgenstern. Theory of Games and Economic Behavior. Princeton University Press, Princeton, 1953.

[Wri95]

Robert Wright. The Moral Animal: Why We Are the Way We Are: The New Science of Evolutionary Psychology. Vintage, 1995.

[Zil]

Zillions Development Corporation. http://www.zillions-of-games.com.

101

“Zillions

of

Games.”