DECISION-THEORETIC PLANNING

A DISSERTATION SUBMITTED TO THE DEPARTMENT OF ENGINEERING-ECONOMIC SYSTEMS AND OPERATIONS RESEARCH AND THE COMMITTEE ON GRADUATE STUDIES OF STANFORD UNIVERSITY IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY

Mark Alan Peot June 1998

i

Copyright © by Mark Alan Peot 1998 All Rights Reserved

ii

I certify that I have read this dissertation and that in my opinion it is fully adequate, in scope and quality, as a dissertation for the degree of Doctor of Philosophy.

_______________________________ Ross D. Shachter (Principal Adviser)

I certify that I have read this dissertation and that in my opinion it is fully adequate, in scope and quality, as a dissertation for the degree of Doctor of Philosophy.

_______________________________ David E. Smith

I certify that I have read this dissertation and that in my opinion it is fully adequate, in scope and quality, as a dissertation for the degree of Doctor of Philosophy.

_______________________________ Edison T. S. Tse

Approved for the University Committee on Graduate Studies:

________________________________

iii

Abstract In recent years, researchers in AI planning have been trying to extend the classic planning paradigm to handle uncertainty and partial goal satisfaction. This dissertation focuses on the development of two partial-order planners: DTPOP, a contingent decision-theoretic planner, and UDTPOP, its non-contingent sibling. Both planners assemble actions with conditional, uncertain effects into plans of high overall utility.

A number of innovations are claimed for UDTPOP. UDTPOP uses a new criterion for identifying actions that are relevant to utility. Steps are constrained to be effective, that is, they are guaranteed to change the world in ways that can result in high utility outcomes. In nearly every other classical partial-order planner, an action can support an open precondition only if it has an outcome that can directly support that precondition. In UDTPOP support can be indirect as long as the added step supports desirable utility outcomes and doesn’t cause any other step to lose effectiveness. This new action selection criterion allows us to abandon the multiple support paradigm of the probabilistic planner Buridan. This, in turn, allows us to design a nearly systematic planner. The same innovation also allows us to develop a strict (and tight) upper bound on the utility of the plan that can be used as an admissible cost function for best first search. UDTPOP is significantly more effective than previous methods. An empirical study compares a variant of UDTPOP against Buridan. In nearly every domain, the performance of UDTPOP is superior, sometimes by 2-3 orders of magnitude.

The second planner, DTPOP, is the first sound and complete contingent partial-order planner. Efficient contingency plans satisfy a “no augury” condition: information used to decide between plan branches must be relevant (in the sense of D-separation) to the utility of the branches under consideration. DTPOP uses this criterion directly to develop tests that can reveal the distribution over unknown, but relevant, events. A test is a subplan that has observable outcomes that are dependent on these hidden events. During planning, DTPOP reverses its direction of search, using regression to establish preconditions and progression to identify observations.

iv

Acknowledgements This work was funded in part through DARPA contract F30602-91-C-0031, the Rockwell International Science Center, and by the National Science Foundation through a Graduate Fellowship.

The ideas in this dissertation benefitted considerably from discussions with my advisor Ross Shachter, Jack Breese, Denise Draper, Ken Fertig, Nir Friedman, Steve Hanks, David (DE2) Smith, Phil Stubblefield, Edison Tse, Dan Weld & Michael Wellman. The ideas in this dissertation arose through joint work performed with DE2 Smith and Jack Breese. Jim Martin and the Rockwell Palo Alto Laboratory provided the perfect working environment for my research.

Finally, I want to thank my friends, Brian Gregory, Michael Lippitz, Enrique Romero, David Smith, and Tim Stanley, for putting up with a lot of dissertation-related flakiness over the past years. Thanks also to my parents, Hans and Kathleen Peot, for their considerable moral support over the years.

My true love Lydia Chang managed my dissertation progress from Michigan.

v

vi

Table of Contents

Table of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii List of Illustrations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiv Notation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxi 1.0

2.0

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1

The Problem

1

1.2

The Solution

4

1.2.1

Non-contingent Solutions: UDTPOP

4

1.2.2

DTPOP

4

1.3

Roadmap of Regression Planner Development

5

1.4

Contributions

7

1.5

Outline of this dissertation

10

Action and Plan Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.1

Variables and Functions

14

2.2

Steps

15

2.2.1

Conditional Effects

15

2.2.2

Conditional Cost Models

16

2.2.3

Belief Network Representation for Steps

17

2.2.4

Example: Stress World

18

2.2.5

Frame Assumptions

21

2.3

Goals and Utility

22

2.4

Assembling Steps into Plans

23

2.5

Example

26

2.6

Related Work

27

2.6.1

Attributes of Probabilistic Steps

27

vii

3.0

2.6.2

Attribute Cardinality

27

2.6.3

World State Representation

30

2.6.4

Distribution Symmetry

31

2.6.5

Distribution Completeness

32

UDTPOP: Noncontingent Planning. . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.1

Overview

35

3.2

The Basic Ideas

36

3.2.1

Multiple Support

36

3.2.2

Single Support

40

3.2.3

Effectiveness

47

3.2.3.1

Possibility

48

3.2.3.2

Pertinence

48

3.2.3.3

Effectiveness

49

3.3

UDTPOP

53

3.3.1

Plan Flaws: Open Conditions

55

3.3.2

Plan Flaws: Threats

56

3.3.3

Adding Support: Add-Step and Add-Link

56

3.3.4

Resolving Threats: Promote and Demote

58

3.3.5

Resolving Threats: Persist-Support

60

3.4

Example

62

3.5

Approximating Effectiveness

67

3.5.1

Possibility

67

3.5.2

Pertinence

68

3.5.3

Effectiveness

69

Evaluating Plans

70

3.6 3.6.1

Search

70

3.6.2

Model Construction in Complete Plans

72

3.6.3

Model Construction in Partial Plans

78

3.6.3.1

Modeling Open Conditions

79

3.6.3.2

Modeling Open Conditions Using LPE

81

3.6.3.3

Modeling Threats

89

3.6.3.4

Model Construction and Evaluation for Partial Plans

91

3.6.3.5

Persist-Support

95

3.6.4

3.7

The Implementation of the Evaluator used in UDTPOP-B

Formal Properties

3.7.1 3.7.1.1

Soundness Markov Model

97

98 99 100

viii

3.7.1.2 3.7.2

Soundness Completeness

102

3.8

Empirical Results

103

3.9

Discussion

114

3.9.1

Mutual Exclusion

114

3.9.2

Comparison with Buridan

115

3.9.2.1

Link Establishment Order

115

3.9.2.2

Backtracking on Open Conditions

116

3.9.2.3

Plan Evaluation and Mutual Exclusion

117

3.9.2.4

Persist-Support vs. Confrontation

118

3.9.2.5

Soundness

118

3.9.2.6

The Buridan Heuristic

119

3.10 Extensions

122

3.10.1

Extending Relevance

122

3.10.2

Extending Effectiveness

123

3.10.3

Tightening the Upper Bound on Evaluation

124

3.10.4

Systematicity

125

3.10.5

Multiple Support and Ordering Constraints

126

3.10.5.1

Multiple Support = Fewer Ordering Constraints (sometimes...)

126

3.10.5.2

Commutivity

127

3.11 Contributions

4.0

128

Relevance and Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 4.1

Notation and Definitions

132

4.1.1

Belief Network Notation

132

4.1.2

Influence Diagrams

133

4.2

Observation Relevance

134

4.3

Identifying Relevant Nodes in Belief Networks

137

4.3.1

The Ne, Np, Nr and Ni sets

137

4.3.2

The Bayes Ball Algorithm

139

4.3.3

Ne, Np, Ni, and Nr Examples

142

4.3.3.1

Collect-Requisite

142

4.3.3.2

Collect Relevant

144

4.4

5.0

102

Dynamic Influence Diagrams

146

DTPOP: Contingent Planning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149 5.1

Overview

151

5.2

Basic Ideas

152

5.2.1

Constructing Contingent Plans

152

ix

5.2.1.1

Warplan-C

153

5.2.1.2

CNLP

155

5.2.1.3

C-Buridan

158

5.2.1.4

Cassandra

162

5.2.1.5

DTPOP

163

5.2.2

5.3

Identifying Observations

DTPOP

165

168

5.3.1

DTPOP Plan Elements

171

5.3.2

Mutual Exclusion Constraints and Execution Policies

172

5.3.3

Plan Flaws

174

5.3.3.1

Plan Flaws: Threats

174

5.3.3.2

Open Uncertainties

175

5.3.4

UDTPOP Modifications

177

5.3.5

Threat Resolution: Branch

177

5.3.6

Discovering Alternative Plans: Add-Branch

178

5.3.7

Identifying Observations: Add-Link-Forward and Add-Step-Forward

178

5.3.8

Remove-Open-Uncertainty

180

5.4

Plan Optimization

185

5.4.1

Contingent Plans and Asymmetric Influence Diagrams

186

5.4.2

Plan Optimization

187

5.4.2.1

Plan Model Construction

188

5.4.2.2

Pass 1

191

5.4.2.3

Pass 2

195

5.4.2.4

Plan Optimization

195

5.4.2.5

Simplifying the Decision Problem

196

5.4.3

5.5

Approximating Plan Optimization

198

Recognizing Open Uncertainties

204

5.5.1

Background

204

5.5.2

The Information Relevance Network

205

5.6

Heuristics

214

5.6.1

Inter-branch threats

214

5.6.2

Closing and Constraining Open-Uncertainties

215

5.7

Example

216

5.7.1

The First Plan Branch

218

5.7.2

Making the Initial Plan More Robust

220

5.7.3

Inter-Branch Threats

222

5.7.4

Searching for Observations

224

5.7.5

Constructing the Decision Tree.

225

x

5.7.6

5.8

Solving the Decision Tree

Formal Properties

229

232

5.8.1

Soundness

232

5.8.2

Completeness

233

5.9

Discussion

5.9.1

Related Work

234 234

5.9.1.1

Causal Link Planners

234

5.9.1.2

Markov Decision Processes

237

5.9.1.3

MAXPLAN

241

5.9.1.4

Knowledge-Based Model Construction

242

5.9.2

Knowledge Preconditions

242

5.9.3

Observation Actions

246

5.9.4

Classes of Plans and Independence

248

5.10 Extensions and Conclusions

250

5.10.1

Experimental Validation

250

5.10.2

Evaluation of Partial Plans

251

5.10.3

Multiplicative Growth in Search Space with Plan Branches.

252

5.11 Contributions

253

6.0

Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255

A.

UDTPOP Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273 A.1

Effective Support

273

A.2

Pertinence

275

A.3

Soundness

276

A.4

Completeness

281

A.4.1

Identifying Causal Structure

281

A.4.2

The Clairvoyant Decision Policy and PertinentC

283

A.4.3

Completeness

288

A.5

B.

Admissible Upper Bound

292

DTPOP Proofs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 299 B.1

Notation

299

B.2

Eliminating Decisions

299

B.3

Observation Relevance

300

B.4

Soundness

302

B.5

Completeness

305

B.5.1

Identifying the Contingent Plan

305

B.5.2

No Fusion

308

xi

B.5.3

Completeness

309

xii

List of Tables

TABLE 2. The drink-coffee step.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

TABLE 3. The effect of the write step on total number of pages written. TABLE 4. The effect of the Write step on Harvey’s state of alertness. TABLE 12.

UDTPOP-B vs. Buridan-R and Buridan-F

TABLE 13.

Empirical Comparison in a Navigation Domain

. . . . . . . . . . . . . . . . . . . . . . . . . 20 . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122

xiii

List of Illustrations

FIGURE 1.

Roadmap of Regression Planner Development

..................... 6

FIGURE 2.

Relative performance of Buridan and UDTPOP-B

FIGURE 3.

Step Model

FIGURE 4.

Steps in Stress World

FIGURE 5.

Causal Links.

FIGURE 6.

A short plan

FIGURE 7.

A chain of causal links protecting attribute variable .

FIGURE 8.

Asymmetric conditional effects

FIGURE 9.

A Buridan step

FIGURE 10.

Increasing the probability of a precondition with a causal link

FIGURE 11.

Increasing the probability of a precondition with multiple links from the same step 39

FIGURE 12.

Increasing the probability of a precondition using a causal link from a different step 40

FIGURE 13.

Single Support

FIGURE 14.

Increasing the probability of a precondition using a causal link from a different step 42

................... 9

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 . . . . . . . . . . . . . . . . 29

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 . . . . . . . . . 39

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

xiv

FIGURE 15.

Choices in Single and Multiple Support

. . . . . . . . . . . . . . . . . . . . . . . . . . 44

FIGURE 16.

Proliferation of Structure in Multiple Support Planners

FIGURE 17.

A very simple navigation domain.

FIGURE 18.

A “state transition” diagram for the “Go North” step.

FIGURE 19.

Constraints.

FIGURE 20.

Effectiveness

FIGURE 21.

Threats

FIGURE 22.

Promotion

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

FIGURE 23.

Demotion

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

FIGURE 24.

Before Persist-Support

FIGURE 25.

After Persist-Support

FIGURE 26.

Initial Plan:

FIGURE 27.

The example after adding SW1

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

FIGURE 28.

The example after adding SW2

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

FIGURE 29.

The example after adding a link from SIC to SW2

. . . . . . . . . . . . . . . . . . 65

FIGURE 30.

The example after adding a link from SIC to SW1

. . . . . . . . . . . . . . . . . . 65

FIGURE 31.

The example after persist-support

FIGURE 32.

The final complete plan

FIGURE 33.

Best First Search.

FIGURE 34.

Model_CE

. . . . . . . . . . . . . . . 45

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 . . . . . . . . . . . . . . . . . 52

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

xv

FIGURE 35.

A Model for a Complete Plan

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

FIGURE 36.

Model_CE(RGoal, SGoal)

FIGURE 37.

Model_CE(Pages(SW1+), SW1).

FIGURE 38.

Trace of Model_CE

FIGURE 39.

Trace of Model_CE continued

FIGURE 40.

Trace for Model_CE completed

FIGURE 41.

A fallacious argument for using decisions for representing open conditions 80

FIGURE 42.

A simple partial plan that breaks the ‘straw’ model construction algorithm

FIGURE 43.

Using SA to support both preconditions maximizes utility

FIGURE 44.

LPE Example I

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

FIGURE 45.

LPE Example II

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

FIGURE 46.

Active Set 1

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

FIGURE 47.

Active Set 2

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

FIGURE 48.

Active Set 3

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

FIGURE 49.

Modeling every completion of a partial plan

FIGURE 50.

EvaluateUB

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

FIGURE 51.

Model_CE2

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

FIGURE 52.

UB

FIGURE 53.

Persist Support can cause dual support

FIGURE 54.

The contradiction in the proof for Theorem 8

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

81

. . . . . . . . . . . . 81

. . . . . . . . . . . . . . . . . . . . . . . 89

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 . . . . . . . . . . . . . . . . . . . . . . . . . . 95 . . . . . . . . . . . . . . . . . . . . . . 97

xvi

FIGURE 55.

The cluster tree constructed implicitly by the UDTPOP evaluator

. . . . . . 98

FIGURE 56.

Markov Model.

FIGURE 57.

The Markov model for a 2 step plan.

FIGURE 58.

Relative Performance of UDTPOP-B and Buridan-R

. . . . . . . . . . . . . . . 108

FIGURE 59.

Relative Performance of UDTPOP-B and Buridan-F

. . . . . . . . . . . . . . . 109

FIGURE 60.

The branching factor of Buridan and UDTPOP-B as a function of search space depth in the Mocha Blocks World 0.899 domain 111

FIGURE 61.

The branching factor of Buridan and UDTPOP-B as a function of search space depth in Diamond World 111

FIGURE 62.

The relative search space sizes for Buridan-R and UDTPOP-B in Mocha Blocks World 0.899 112

FIGURE 63.

The relative search space sizes for Buridan-R and UDTPOP-B in Diamond World 114

FIGURE 64.

Simple Link Domain

FIGURE 65.

The network of roads used for the navigation domain.

FIGURE 66.

The Move Operator

FIGURE 67.

Multiple support can result in fewer ordering constraints.

FIGURE 68.

The Dry-With-Wet-Towel Action from Wet Towel World

FIGURE 69.

Influence Diagram

FIGURE 70.

A simple decision problem with one decision and no observations

FIGURE 71.

A simple decision problem with one observation

FIGURE 72.

Irrelevant observations

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 . . . . . . . . . . . . . . 121

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 . . . . . . . . . . . . 127 . . . . . . . . . . . . . 127

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 . . . . 135

. . . . . . . . . . . . . . . . . . 136

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137

xvii

FIGURE 73.

Bayes_Ball

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140

FIGURE 74.

Collect-Requisite

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141

FIGURE 75.

Collect-Relevant

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141

FIGURE 76.

Collect Requisite Cases.

FIGURE 77.

Relevance Examples

FIGURE 78.

Planning in Warplan-C

FIGURE 79.

Planning in CNLP

FIGURE 80.

Outcomes for a gluing operation.

FIGURE 81.

Observation Plan Construction in DTPOP

FIGURE 82.

The top level of the DTPOP planning algorithm

FIGURE 83.

Threats

FIGURE 84.

Voltage Measurement Example

FIGURE 85.

Branch

FIGURE 86.

Inefficient Plans

FIGURE 87.

Remove-Open-Uncertainty Cases I and 2

. . . . . . . . . . . . . . . . . . . . . . . 183

FIGURE 88.

Remove-Open-Uncertainty Cases 3 and 4

. . . . . . . . . . . . . . . . . . . . . . . 184

FIGURE 89.

Plan Optimization

FIGURE 90.

PM

FIGURE 91.

PM_1

FIGURE 92.

Model_Step

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159 . . . . . . . . . . . . . . . . . . . . . . . 167 . . . . . . . . . . . . . . . . . . . 170

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192

xviii

FIGURE 93.

Model_CE

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193

FIGURE 94.

Observability and Model_CE

FIGURE 95.

Model_CL

FIGURE 96.

An algorithm for systematically generating partial orders with equivalent value topological sorts 198

FIGURE 97.

A Decision Tree

FIGURE 98.

Build-Tree

FIGURE 99.

Prune-Mutex

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202

FIGURE 100.

Build-Action

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203

FIGURE 101.

Merge

FIGURE 102.

Active Subgraph Example, Part I

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206

FIGURE 103.

Active Subgraph Example, Part II

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207

FIGURE 104.

The VOM calibration example

FIGURE 105.

There are possibly many active paths for a single relevance relation

FIGURE 106.

Relevance For Observable Nodes.

FIGURE 107.

Collect-Relevant

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212

FIGURE 108.

Collect-Requisite

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213

FIGURE 109.

Bayes_Ball_IRN

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213

FIGURE 110.

A non-contingent plan for partying outdoors.

FIGURE 111.

Expected Value of Party-Outdoors

FIGURE 112.

The plan after starting another plan branch.

FIGURE 113.

The plan after completing the second plan branch.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208 . . . 209

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211

. . . . . . . . . . . . . . . . . . . . . 220

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220 . . . . . . . . . . . . . . . . . . . . . . 221 . . . . . . . . . . . . . . . . 221

xix

FIGURE 114.

Expected Value for the party alternatives

. . . . . . . . . . . . . . . . . . . . . . . . 222

FIGURE 115.

Inter-branch threats.

FIGURE 116.

Mutex Sets After Branching

FIGURE 117.

The Information Relevance Network

FIGURE 118.

The Information Relevance Network after Add-Step-Forward

FIGURE 119.

One possible topological sort for the contingent plan

FIGURE 120.

The Full Decision Tree Constructed by Build-Tree

FIGURE 121.

Decision Tree after Marginalizing Out the Unobserved Variables

FIGURE 122.

Decision Maximization

FIGURE 123.

Marginalization of Observable Variables

FIGURE 124.

Value of the contingent plan

FIGURE 125.

Simple Forecast Example

FIGURE 126.

Multiple Forecasts

FIGURE 127.

The transformation from Mm to Mm’

FIGURE 128.

Discover Links.

FIGURE 129.

Action schemata for SA and SB.

FIGURE 130.

Constructing Q

FIGURE 131.

Belief networks for Case I

FIGURE 132.

Gadget for Representing a Disjunct in 3SAT

FIGURE 133.

Representation for 3SAT.

FIGURE 134.

Discover_Contingent_Links

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223 . . . . . . . . . . . . . . . . . . . . . . . . . . . 224 . . . . . . . . 225

. . . . . . . . . . . . . . . 226

. . . . . . . . . . . . . . . . . 229 . . . . . . 230

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231 . . . . . . . . . . . . . . . . . . . . . . . . . 231

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 248 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 278

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 296 . . . . . . . . . . . . . . . . . . . . . 301

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 302 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 307

xx

Notation A, B, Variable : A, B, Variables :

Attribute variables (Generalized propositions). Sets of attribute variables.

a, b, value ( a, b, values ) : S 1, S A, S Go : S i - ( S i+ ) :

V

ordering constraint, “ S 1 completes before S 2 commences.”

: A causal link from S E to S C protecting attribute variable V .

ST ⊗ L : ST :

Actions.

The time immediately before (after) action S i executes.

S 1 < S 2 : An S E →S C

Attribute values (sets of values).

threatens causal link L .

The conditional outcome “ e given c .”

P{.} :

The upper bound on an interval probability distribution.

P{.} :

The lower bound on an interval probability distribution.

[ S 1, …, S n ] : {S A, S B }⊥ :

A sequence. A mutual exclusion constraint.

xxi

1.0 Introduction

One of the landmarks of intelligent behavior is the ability to proactively prepare a detailed course of action that can be applied in order to accomplish one or more desirable objectives. This activity is called planning. The objective of this dissertation is to explore a number of issues surrounding the development of partial-order plans in uncertain domains. In this chapter, we will define the basic problem addressed by this dissertation and provide a road map of related work in this subfield of planning. The final sections of this chapter provide a brief summary of the remainder of the dissertation, including a brief synopsis of contributions claimed.

1.1

The Problem

A plan is a sequence of actions that can be executed in order to achieve some objective. A planner assembles a set of primitive actions into an overall program of behavior that addresses the objective. For example, in a travel domain, we might have primitive actions for individual travel legs, such as primitive actions for flying or taking a cab. One objective in this domain might be to travel from Palo Alto, California to New Carlisle, Ohio. The objective of a planner is to assemble the primitive travel elements into a program of activity that accomplishes this activity. In this situation, such a plan might look like: • Take a cab from Palo Alto to San Francisco International Airport. • Fly from San Francisco to Minneapolis. • Fly from Minneapolis to Dayton. 1

• Take a cab from Dayton International Airport to New Carlisle.

Most of the past work in planning is aimed at developing planners that solve the following problem: Given a set of deterministic primitive actions, a set of deterministic initial conditions and a goal, find a sequence of these primitive actions that accomplishes this goal with certainty. This problem is the classic planning problem. The objective of this dissertation is to discuss the design of two planners that generalize this classic planning paradigm in three ways: • Actions may have multiple, uncertain outcomes. • Goals have value. • The outcomes of the primitive actions and initial conditions are observable.

Uncertain Actions

In the classic planning paradigm, primitive actions are deterministic. Application of the same primitive action under the same conditions always results in an identical outcome. We will relax this restriction, allowing uncertain primitive actions; actions that have multiple possible uncertain outcomes. This will model domains, such as medicine, in which primitive actions can have wildly varying consequences. For example, a surgical procedure might be highly uncertain with possible outcomes ranging from a perfect cure through a painful death. Several outcomes might result from the prescription of an antibiotic: a patient may take a full course of the drug, possibly curing the infection; the patient may stop taking the drug early, resulting in a higher probability of reinfection, or the patient might have an allergic reaction to the drug.

Utility

In classical planning, planning objectives are captured using goals. A plan is not acceptable unless the plan achieves the goal with certainty. The objective for the planners

2

described in this dissertation, UDTPOP and DTPOP, on the other hand, is specified using an atemporal utility function. This utility function is composed of two components: • a reward utility function that rates the world states resulting from the execution

of a plan, and • additive cost functions for each of the actions used in the plan.

The use of a utility function to capture planning objectives allows the planner to trade the achievement of various objectives against each other and against the cost of a plan. If the actions required to achieve an objective are too expensive, then the planner may return an empty plan indicating that the optimal policy is to do nothing. If the planner cannot simultaneously achieve a set of objectives, it may still be able to identify a plan that balances partial satisfaction of some subset of the overall objectives against the cost of the plan. For example, during the treatment of a patient with a fatal disease, we need to balance the patient’s quality of life against the cost or disutility of measures designed to extend the patient’s life. Achieving total success, both in curing the disease and in minimizing the impact of treatment on the patient may not be possible.

Observations

In the classical planning paradigm, both the initial conditions (the state of the world prior to plan execution) and the individual primitive actions are deterministic. Given the initial conditions, it is possible to predict exactly the world state that results from the application of one or more actions. In the planning domains addressed by this dissertation, both the initial conditions and the individual primitive actions are no longer deterministic. It is not possible in all situations to project a unique world state that results from the application of one or more primitive actions. The world state is not completely observable. The plan execution agent can only learn about the state of the world through actions with observable outcomes. The observable

3

outcomes of an action may not map exactly into the full set of outcomes of the action. In C-Buridan [Draper et al, 1995], the outcomes of actions are grouped into “distinguishable equivalence classes”, each consisting of a set of outcomes that cannot be distinguished after execution of the action. For example, there might be three outcomes from a “glue part” operation: “no bond”, “weak bond” and “strong bond.” It may be possible to distinguish the first outcome (no bond) from the second and third outcomes, but it may be impossible to distinguish between a weak and a strong bond without further testing.

1.2

The Solution

This dissertation discusses two partial-order planners: UDTPOP and DTPOP.1 Both of these planners assemble actions drawn from a library of action schemata into plans that maximize some atemporal utility function.

1.2.1

Non-contingent Solutions: UDTPOP

UDTPOP finds non-contingent solutions to planning problems. “Non-contingent” means that action execution is not a function of observations made during plan execution. Although the outcomes of individual actions can be a function of the outcomes of other actions, action execution is not. For example, say that our objective is to dry some object. UDTPOP might synthesize a plan that contains 3 drying operations in order to increase the probability that the item is dry. It cannot synthesize action sequences in which the dryness of the object is checked and used to condition the execution of future drying actions.

1.2.2

DTPOP

DTPOP can synthesize contingent plans. A DTPOP plan is a partial-order graph of contingent actions. The execution of each contingent action is conditioned on a function over 1. pronounced “you-dee-tee pop” and “dee-tee pop,” respectively.

4

previously made observations. When this function is true, the action can execute. For example, we might synthesize a plan in which the dry operation is only performed if we observe that the target object is wet.

1.3

Roadmap of Regression Planner Development

Figure 1 shows the relationship between the work performed in this thesis and the various contingent and probabilistic planning algorithms that are most closely related to UDTPOP or DTPOP. Section 5.9.1 describes related work that is less closely related to the content of this dissertation, including the relationship between this dissertation and work on refinement planning and Markov decision processes. Underlined planners are probabilistic. Planners in bold are contingent.

5

1976

WARPLAN-C (Warren)

SNLP (McAllester & Rosenblitt)

1991

CNLP (Peot& Smith)

1992 SENSp (Etzioni, et al.)

Cassandra (Pryor & Collins)

1993

1994

UCPOP (Penberthy & Weld)

Plinth (Goldman & Boddy)

Buridan (Kushmerick, et.al.)

C-Buridan (Draper, et. al.)

ε-safe CNLP (Goldman & Boddy) 1995

ε-safe Plinth (Goldman & Boddy)

UDTPOP (Peot) 1996 Puccini (Golden)

DTPOP (Peot)

1997

FIGURE 1. Roadmap of Regression Planner Development.

UDTPOP was inspired by examining the structure and performance characteristics of Buridan [Kushmerick, et al.], the first sound and complete probabilistic planner contained actions with uncertain and unobservable outcomes. UDTPOP uses a variant of Buridan’s representation for conditional effects. In addition, UDTPOP’s probabilistic threat-resolu-

6

tion operation, persist-support, is based on the probabilistic confrontation mechanism of Buridan. DTPOP has its roots in a planner (named CNLP) developed by David Smith and myself in 1992 [Peot+Smith, 1992]. CNLP is a partial-order planner inspired by the contingent total-order planner WARPLAN-C [Warren, 1976]. CNLP plans for goals using regression. If CNLP needs to use an uncertain, but observable outcome in order to achieve its goal, it derives a contingency plan for the alternative outcomes. CNLP introduced a new mechanism for threat resolution: conditioning (now referred to as branching [Draper, et al, 1994]) in order to resolve resource conflicts between contingent plan branches. Branching forces the planner to commit to one action or the other based on the results of an observable action that occurred earlier in the plan. This plan construction operation resolves resource conflicts between contingency plans by forcing the actions involved in the resource conflicts into different plan branches. The conflict disappears because the conflicting steps can never both be executed during any particular execution of the plan – the steps are contingent on mutually exclusive outcomes of some earlier observation action. CNLP inspired the contingent version of Buridan: C-Buridan [Draper, et al, 1994]. In addition to extending Buridan, C-Buridan fixed flaws in CNLP’s action representation language by using the conditional effect mechanism of Buridan to capture observation/observation dependencies and by introducing the notion of discernible equivalence classes to describe the outcomes of an action that can be distinguished from each other.

1.4

Contributions

UDTPOP

This dissertation outlines the design for a new probabilistic planner UDTPOP. Specific contributions include:

7

• UDTPOP is an “relatively efficient” partial-order planner that can construct util-

ity-maximizing plans from actions with probabilistic effects. • UDTPOP is shown to be sound in the sense that the expected utility of a UDT-

POP plan is identical to the utility of any Markov model of a topological sort of the plan. • UDTPOP is shown to be complete in the sense that it is always guaranteed to

find the plan of highest expected utility. • An admissible evaluation function is developed that can be used with A* to iden-

tify the plan of highest overall utility. • The causal link mechanism of Buridan [Kushmerick, et al. 1993] is re-engi-

neered to improve planning efficiency. UDTPOP uses a ‘fat’ causal link that serves the same purpose as multiple Buridan links. This reduces the number of choices that must be made during planning, which in turn reduces the depth, d , of the overall search space. Since the time complexity for identifying a solution varies as O ( b d ) given comparable branching factors b , the UDTPOP search space can be exponentially smaller than that of Buridan. • A mechanism is developed that allows UDTPOP to ‘close’ open conditions

without compromising completeness. UDTPOP only needs to consider one source of support for any precondition. If every precondition in the plan is closed (there is one source of support for that precondition) and there are no threats in the plan, then that plan is complete. A Buridan plan, on the other hand, is never complete. Buridan can always attempt to increase the probability of a desirable precondition by adding additional causal support. After Buridan identifies one source of support for each precondition in the plan, Buridan attempts to increase the probability of the plan by choosing to add a causal link to some already-supported precondition within the plan. This choice increases the branching factor (the b ) of the Buridan search space, increasing the time and memory required to identify a solution.

8

• A new ‘relevance’ criterion is introduced. UDTPOP uses this to identify actions

that can increase the utility of a plan. This criterion is also used to constrain partial plans–if it is no longer possible for an action in a partial plan to contribute constructively to the objective, that partial plan is pruned. • Empirical results are presented that convincingly demonstrate the performance

gains claimed in the last two bullets. The summary of these results is shown in Figure 2. This log-scale plot graphs the number of plans created by UDTPOP and Buridan when solving a variety of different domains.

1000000

UDTPOP-B 10000 1000

Buridan-R

100 10 1

UDT Simple Lens* Bomb&Toliet* Wet Towel Bite Bullet* Slippery Blocks* Chocolate* Waste Time* IC5 Lens* IC6 Single Link Mocha 0.88* Mocha 0.85* Bomb&Toliet2* Mocha 0.88* Mocha 0.89* Mocha 0.899* Diamond World P1 P2 P3 P4 P5 P6

Plans Created

100000

FIGURE 2. Relative performance of Buridan and UDTPOP-B. The vertical axis is the log of the number of plans created while solving the problem instances listed along the horizontal axis. several of these examples, UDTPOP outperforms Buridan by 3-4 orders of magnitude.

On

DTPOP

This dissertation also outlines the design of a new contingent planner, DTPOP. • DTPOP is demonstrated to be sound.

9

• DTPOP is complete in the sense that it identifies the best contingent plan of a

given size. • A mechanism is proposed to add relevant observation actions to the plan. • A new threat resolution mechanism is introduced based on branching. • Two techniques are proposed for evaluating and optimizing contingent plans.

1.5

Outline of this dissertation

Chapter 2:

Action and Plan Representation

This chapter discusses the knowledge representation used for UDTPOP, including: • Representation of events; • Representation of actions, including conditional effects and conditional cost

models; • Plan representation; and • The influence diagram induced by a partial-order plan.

Chapter 3:

UDTPOP

This chapter outlines the design of UDTPOP, a non-contingent decision-theoretic partialorder planner, including: • A description of the UDTPOP planning problem and algorithm, • ‘Fat’ causal links, • The persist-support operation for threat resolution, • Evaluation techniques for partial plans, and • Empirical results comparing the performance of UDTPOP with that of Buridan.

Chapter 4:

Relevance and Independence

This chapter reviews past research on influence diagrams and probabilistic relevance. Topics discussed include: 10

• Definitions for probabilistic relevance, • Definitions for relevant and requisite sets: sets of nodes that are relevant to a

probability or are required for answering that probability query; • A description of the Bayes Ball algorithm [Shachter, 98] for identifying the rele-

vant and requisite nodes in belief networks or influence diagrams; and • Conclusions and future work.

Chapter 5:

DTPOP

This chapter outlines the design for DTPOP, a contingent decision-theoretic partial order planner. Topics covered in this chapter: • A detailed example of planning using DTPOP; • A description of the plan construction and plan optimization algorithms; • Theorems for soundness and completeness; • Techniques for identifying open uncertainties; and • Conclusions and future work.

Appendix A: UDTPOP Proofs Appendix B: DTPOP Proofs Appendices with proofs for both UDTPOP and DTPOP.

11

12

2.0 Action and Plan Representation

This chapter defines the representation used for world states, steps, plans, and objectives in UDTPOP and DTPOP. This representation differs from that of classic planners in four respects: • The initial world state and the outcomes of actions are uncertain; • Utility and cost functions, rather than goals, drive the selection of actions during

planning; • Generalized propositions (variables with 2 or more possible states) rather than

binary propositions are used to capture salient properties of the world; and • Step execution is not necessarily contingent on the value for the preconditions of

the action (although the effects of the action may be). In this chapter, I describe 1. The representation used by both UDTPOP and DTPOP for representing actions and the world. 2. The representation used for plans, which are partially-ordered sets of actions. 3. How planning objectives are specified. 4. The difference between this representation and that used by other probabilistic planners. Plan and action representation features that are unique to DTPOP (observable outcomes and contingent plans) are discussed in Chapter 5.0.

13

2.1

Variables and Functions

DTPOP and UDTPOP capture the state of the world using a set of discrete domain variables, X = { X 1, …, X n } , that each represent some salient attribute of the world that we wish to model. These attribute variables can each assume one of a finite set (called the domain of the variable) of mutually-exclusive values. The state of the world at any one point of time is the conjunction of the values corresponding to these attribute variables. Boutilier [96] calls these variables generalized propositions. The analogue to a literal1 is equality between the variables in generalized propositions and their values (e.g. A = true or B = b).

Boolean functions can be defined on generalized literals.

The state of the world changes as a function of the execution of steps in a plan. In UDTPOP, we uses generalized fluents to represent the evolution of attribute variables over time. The generalized fluent Robot-Location(t) maps time into variable values: Robot-Location(t): ℜ → domain(Robot-Location) .

All changes in the world state in UDTPOP occur due to step execution, thus the only times of interest are the times that individual steps are executed. The time immediately after (before) the execution of step S is S+ ( S- ). Step execution is modeled as atomic; no other step S 2 can execute in the interval [ S 1-, S 1+ ] . Sans serif capital roman letters or words will denote fluents and variables or propositions, e.g. A ( t ), Battery-Charge, Robot-Location , etc. The values for these functions or variables will be represented using lowercase (e.g. a , charged , home ). Sets of functions, variables, or values will be indicated using overlines (e.g. Robot-Locations(t) , and places ). A capital X

will denote the set of all variables used to model the world. The trajectory of world

states over time is X(t) .

1.

Recall that a literal is either the proposition, p, or its negation, ¬p.

14

2.2

Steps

A plan contains a partially-ordered set of steps S . The description for each step S i in this set consists of a set of conditional effects, Eff(S i) , and a set of step cost functions Cost(S i) . Each conditional effect is a conditional probability distribution over a set of variable values given the world state prior to the execution of S i . Each cost function describes one component of the total cost of executing S i . The cost of executing S i is the sum of the individual cost functions in Cost(S i) . In UDTPOP, the conditional effects of each step are unobservable. Although the effects of steps may be conditioned on the results of earlier steps in the plan, the set of steps executed during plan execution may not be contingent on the results of steps that were executed earlier in the plan.2 In Chapter 3.0, we will describe UDTPOP, an open-loop planner that can find optimal open-loop plans. In Chapter 5.0, we will extend the representation of S

to include observable effects which will, in turn, allow contingent execution.

2.2.1

Conditional Effects

The effect of a step is a conditional probability distribution over a set of effect variables given a set of values for precondition variables. The effect variables, E(S+), refer to those variables that actually can change as a function of step execution at the time immediately after the execution of S. The precondition variables, C, refer to variables that affect the outcome distirubiton over E. Each conditional effect distribution P S { E ( S+ ) C ( S- ) } is a distribution over values for a set of effect variables E given the values for a set of precondition variables C .

2.

In control theory terms, plan execution is open loop.

15

I will often refer to the individual nonzero components of the conditional effect distributions. These components are called the conditional outcomes of the conditional effect distribution. Each conditional outcome has the form <( E 1 = e 1 ) ∧ … ∧ ( E m = e m ) ( C 1 = c 1 ) ∧ … ∧ ( C n = c n ) > = p

and captures the probability that a particular outcome E = e will result given the trigger or precondition C = c . The set of all of the conditional outcomes in a conditional effect distribution CE is COs(CE) .

2.2.2

Conditional Cost Models

The set of step cost functions Cost(S i) describe how much it costs to execute each step S i under different circumstances. Each conditional cost function maps a subset of the world state to the reals. Multiple cost functions on a step (or across several steps) combine additively.3 If it is possible for E { Cost(S) } ≤ 0 then the optimal plan might be infinitely long. TO see why, consider the “slippery gripper” problem presented in the Buridan paper [Kushmerick, et al, 94] illustrates why. The dry-gripper step dries a robot gripper with probability 0.5

if the gripper is wet and does nothing otherwise. The probability that the robot gripper

is dry can be increased to be arbitrarily close to 1.0 by concatenating a sufficiently large number of dry-gripper steps. If the cost of dry-gripper is negative or zero, then the optimal UDTPOP plan will have an infinite number of dry-gripper steps. Theorem 1 (Sufficient Conditions for a Finite Solution): The

non-contingent plan with the maximum expected value has a finite number of steps if there are a finite number of possible action schema and for each possible action schema S , ∀Cost i ∈ Cost(S), Cost i(X) > 0 .4 3.

They are additive subvalue nodes [Tatman+Shachter, 90]

16

We will restrict the domains of UDTPOP to contain only steps with positive costs in order to guarantee that we can find an optimal plan. Unfortunately, this “UDTPOP Domain Restriction” will not prevent the discovery of infinite contingent plans in DTPOP for reasons that I will discuss in Section 5.0.

2.2.3

Belief Network Representation for Steps

PRECONDITION VARIABLE S

EFFECT VARIABLE S

CONDITIONAL EFFECT DISTRIBUTION

E1

C1 LINKS TO PRECONDITION C2 VARIABLE DISTRIBUTIONS

Cm

P{E|C} E2

Cost{B}

+

En

SUBVALUE NODE (STEP COST FUNCTION)

FIGURE 3.

Step Model. A belief network fragment representing a single step. A single step can have more than one cost function or conditional effect distribution–only one of each is shown here.

A simple generic belief network representation can be used to represent steps. One such representation is shown in Figure 3. The central node models the joint distribution over all of the effects of the step. The deterministic nodes (double ovals) pull distributions over individual variables out of the joint conditional effect distribution so that they can be used 4.

Williamson [96] allows steps with negative cost or resource use but solves the plan existence problem by adding the restriction that no step can increase the net amount of resources available to later steps. For example, we might sell one resource (say gasoline) to increase our wealth, but we cannot execute steps that increase the total value of all of the resources (gas and wealth) available to us.

17

to condition other conditional effect distributions. The diamond is an additive subvalue node [Tatman+Shachter, 90] and contains the cost function for the step. The expected cost for all of the steps in the plan is just the sum of the expected costs for each of the steps in the plan.

2.2.4

Example: Stress World

We will use a simple example to illustrate the step and state representation used in UDTPOP. drink-coffee

Alert

P{E|C}

write

Alert

Alert

P{E|C}

Alert

Pages

P{E|C}

Pages

C{B}

+

C{B}

+

FIGURE 4. Steps in Stress World.

Harvey Hacker5 is a typical caffeine-powered graduate student. In order to maximize his productive output (measured in pages written), he needs to carefully modulate his coffee intake. When he has quaffed exactly the right amount of caffeine, he is “in the zone” and is maximally productive. If he has quaffed too much or too little coffee, he is working outside “the zone” and his productivity suffers. The generalized propositions used to model Harvey’s state in Stress World might include: Alert

: Harvey’s state of “alertness,” one of { asleep, conscious, in-the-zone, wired } .

Pages : The 5.

number of pages that Harvey has written. One of { 0, …, n } .

Harry’s brother for those of you that care about that sort of thing...

18

Harvey’s state is completely determined at any point in time by values for the variables X = { Alert, Pages } . At T 1 , his state might be X(T 1) = { asleep, 0 } ; at another time, his state

might be X(T 2) = { wired, 10 } , etc. Stress World contains two steps: drink-coffee and write. drink-coffee: Coffee has an uncertain effect on Harvey’s level of alertness that is dependent on Harvey’s level of alertness prior to drinking coffee. The conditional outcomes describing the effect of coffee on Harvey are shown in Table 2. Alert is both the precondition and effect variable of drink-coffee.

Preconditions

Outcomes

Probability

Alert(S java-)

Alert(S java+)

P { Alert(S java+) Alert(S java-) }

asleep

asleep

0.2

conscious

0.8

conscious

0.2

in-the-zone

0.8

in-the-zone

0.2

wired

0.8

wired

1.0

conscious

in-the-zone

wired

TABLE 2. The drink-coffee step.



probability

of

and are conditional outcomes of this step because the each



of

these

conditional

outcomes

is

greater

than

zero.

is not because P { in-the-zone asleep } = 0.0 .

write: Writing “burns up” the caffeine in Harvey’s blood stream (lowering his level of alertness) and produces pages of written material. write contains two independent conditional effect distributions:

19

• one describing the effect of Harvey’s state of alertness and the amount written

thus far on the total number of pages written, and • one describing Harvey’s state of alertness as a function of his previous state of

alertness. The precondition variables for the first conditional effect are { Alert, Pages } . The outcome variable is Pages . Harvey’s output is highest when he has quaffed exactly the right amount of coffee and his state of alertness is “in-the-zone.” His productivity drops to zero if he has quaffed too much caffeine (“wired”) or if he is not very alert (“asleep”). The conditional probability distribution describing this effect is shown in Table 3. Preconditions

Outcomes

Alert(S write-)

Pages(S write-)

Probability

Pages(S write+)

asleep

n

n

1.0

conscious

n

n n+1

0.2 0.8

in-the-zone

n

n+1

1.0

wired

n

n

1.0

TABLE 3. The effect of the write step on total number of pages written.

The precondition and effect variable for this second conditional effect is Alert . The conditional probability distribution describing this effect is shown in Table 4. Preconditions

Outcomes

Probability

Alert(S write-)

Alert(S write+)

P { Alert(S write+) Alert(S write-) }

asleep

asleep

1.0

conscious

asleep

0.1

conscious

0.9

TABLE 4. The effect of the Write step on Harvey’s state of alertness.

20

Preconditions

Outcomes

Probability

Alert(S write-)

Alert(S write+)

P { Alert(S write+) Alert(S write-) }

in-the-zone

conscious

0.1

in-the-zone

0.9

in-the-zone

0.1

wired

0.9

wired

TABLE 4. The effect of the Write step on Harvey’s state of alertness.

2.2.5 Frame Assumptions A conditional effect describes the explicit effect of a step on a variable or set of variables. If a variable, A , is not mentioned in a conditional effect of step S then the value for the function corresponding to that variable does not change when S is executed, e.g. A(S+) = A(S-) .

This frame assumption6 [McCarthy+Hayes, 69] is shared with most other

causal link planners. For example, drink-coffee does not explicitly influence the number of pages written. Thus, the number of pages written thus far is the same immediately before and immediately after the drink-coffee step is executed. I will also assume no spontaneous action. If no step is executed between time t 1 and time t2 ,

then the state of the world, X , does not change, e.g. X(t 1) = X(t 2) . The values of func-

tions representing the world state may only change due to the execution of steps.7 All of the direct and indirect effects of a step must be captured explicitly in the conditional effect distribution of the step. One of the ramifications [Ginsberg+Smith, 88a; 88b] of inverting a container is that the contents of the container might drain out. If we desire to accurately model all of the direct and indirect ramifications of this step, then we need to 6.

Called the Law of Persistence [Georgeff, 86].

7.

It is possible to model the effect of possibly relevant exogenous events by inserting them as dummy actions into the initial plan. See Blythe [96].

21

explicitly represent the result of inverting the container when it is full of sugar, water, concrete, acid, etcetera.

2.3

Goals and Utility

The objective of both UDTPOP and DTPOP is to identify the plan with maximum expected value.8 The utility of a plan in UDTPOP is composed of two components. The first component is the reward. The reward, denoted R ( X(P+) ) , is a function on a subset of the variables describing the state of the world after all of the actions in plan P have been executed. The second component of the value function is a set of cost functions, each denoted Cost S, i(X ( S- )) , that penalizes each step in the plan.9 The expected value of the plan is the sum of the expectations of the cost functions in the plan: ⎛ V(P) = ⎜ ⎝



x ∈ X(P+)



⎞ R(X(P+) = x)P { X(P+) = x }⎟ ⎠ ⎛

∑⎜ S ∈ P⎝ j



⎛ ⎜ ⎝



Cost S, i ∈ Cost(S j) x ∈ X ( S- )

(1)

⎞⎞ Cost S, i(X(S-) = x)P { X(S-) = x }⎟ ⎟ ⎠⎠

One popular kind of objective used in AI planning is the goal. A goal is a reward that provides a fixed reward iff the final world state is one of a set of goal world states. That is,

R ( X(P+) ) = ⎛ K , if X ( P+ ) ∈ goal ⎞ ⎝ 0, otherwise ⎠

(2)

where goal is the set of desired world states. With this definition, goals are provisional; if a goal is too expensive to achieve, the planner will abandon that goal.10 8.

Synonymous with utility when the decision-maker is risk neutral

9.

It is easy to generalize the cost function to be a function of both the outcomes and the preconditions of an operator.

22

UDTPOP can be used to emulate a probabilistic planner like Buridan by setting all of the step costs to zero and setting the reward function to a goal function with K set to 1.0; e.g. ∀S ∈ P, C S(A) = 0

and R ( X(P+) ) = ⎛ 1, if X ( P+ ) ∈ goal ⎞ . The value of this plan is exactly ⎝ 0,

otherwise ⎠

equal to the probability that the goal is achieved. We will use this trick in order to compare the performance of the utility-based planner UDTPOP with the goal-based Buridan planner in Section 3.7.

2.4

Assembling Steps into Plans

A plan is a directed-acyclic graph of actions augmented with bookkeeping information that summarizes commitments made during the planning process. A UDTPOP plan is a tuple, , where S,

represents the steps in the plan (the nodes of the directed graph); O , denotes ordering constraints (the arcs of the directed graph); L represents the causal links between steps in the plan and K represents a set of probabilistic constraints. The steps and ordering constraints define the actual plan. The causal links, L , commit to and protect the relationships between the effect of one step and the precondition of another step. UDTPOP constructs plans by identifying deficiencies in this causal structure. The probabilistic constraints, C , restrict the planner search space either to reduce search space redundancy or to prevent discovery of provably nonoptimal plans.

Goal and Initial Condition Steps

There are two distinguished dummy steps in every plan: a goal step, S Goal , and an initial conditions step S IC . These steps are added to the plan to simplify the design of the plan10. Contrast this definition of goal with that of Wellman & Doyle [1992]: (paraphrased) A goal is a strict preference relation on world states. Any world in which the goal is achieved is superior to any world in which the goal has not been achieved.

23

ner11; the same planner mechanisms used to handle links between the steps in the plan itself are used to construct links to the initial conditions and to the preconditions in the goal step. The initial conditions step, S IC , is ordered before every other step in the plan and has no preconditions or cost functions. The conditional effect distributions of S IC contain distributions over all of the attribute variables in the domain. Every variable that appears as a precondition of any other step must also be an outcome variable of the initial conditions step. The goal step, S Goal , is constrained to occur after every other step in the plan and has no conditional effects. The cost function for S Goal is the reward function capturing the objectives for the planner. The UDTPOP domain restriction does not apply to S Goal , the reward function may return any bounded, real value.

Ordering Constraints

The temporal-ordering constraints O restrict the time of execution of some steps to occur before or after the time of execution of other steps. The constraint S 1 < S 2 indicates that S1

must complete its execution before the execution of S 2 begins.

Causal Links

The plan’s causal links, L , record commitments made by the planner in order to establish a X

source of support for preconditions. The causal link S E →S C indicates that the planner has committed to using an outcome variable of S E in order to establish a distribution over X

precondition variable X for step S C . If there exists a causal link S E →S C connecting two 11. This is the standard trick for handling initial conditions and goals in planners [Weld, 94].

24

steps S E and S C , we will say that S E (the establisher of the causal link) establishes precondition variable X for step S C (the consumer of the causal link). A causal link represents the following commitment:

Definition 1 (Causal Link Commitment): If a complete plan P has a causal link X

S E →S C

and steps S , then

1. X is an outcome variable of S E and a precondition variable of S C , and 2. S E < S C . 3. For all S T ∈ S , if it is possible for S T to execute between the times that S E and SC

execute, then X cannot be an effect variable of S T .

Note that this definition of causal link is slightly different than that used in Buridan and other causal link planners [Kushmerick, et al, 94; 95; McAllester+Rosenblitt, 91; Penberthy+Weld, 92; Peot+Smith, 92]. Rather than protecting an individual variable value, this causal link protects the distribution over the set of mutually-exclusive values denoted by a single variable. The causal link commitment means that X i(S E+) = X i(S C-) in every completion of a partial plan.12 Each causal link corresponds to an arc (or set of arcs) from the node representing the effect X i(S E+) to the conditional effects of S C that have X i in their preconditions. This relationship is shown below in Figure 5. In this diagram, a single causal link corresponds to two arcs in the corresponding influence diagram. X

The causal link commitment on a particular link S E →S C is threatened if there exists a step S T that can execute between S E and S C that has the same effect variable X protected

12. Actually, this commitment is partially rescindable. See Section 3.3.5.

25

X

by the causal link. In order to preserve the commitment denoted by S E →S C , UDTPOP will resolve this threat using one of a number of threat resolution techniques.

Belief Network (Plan Model):

Xi

SE

Causal Link (Plan):

SC E|Xi,P

+

Xi SE

U|Xi,Q

SC

FIGURE 5. Causal Links. Influence diagram arcs induced by a causal link. If the plan has a causal link between SC and SE protecting Xi, then the plan model will contain an arc from the probability node representing Xi in SE to all of the conditional effect distributions in SC that are conditioned by Xi.

The plan is called complete if there are no flaws in the causal link structure of the plan, that is: • each relevant precondition variable is supported by a causal link and • the causal link commitment for each causal link holds for each topological sort

of the plan (e.g. there are no threats).

2.5

Example

A complete UDTPOP plan is illustrated below. This plan models the effect of two write steps. In this plan, the reward utility function is a function only of the number of pages written. During planning, the planner commits to using S W2 to establish the number of pages for the Pages precondition of the goal step. The single causal link from S W2 to S Goal

captures this commitment.

26

Both of the write step rely on the use of other steps to establish support for their preconditions, Pages and Alert. These commitments are captured by the causal links between the initial conditions step and S W1 and the causal links between S W1 and S W2 .

PAGES SIC

SW1

ALERT

PAGES SW2 ALERT

PAGES SGOAL

FIGURE 6. A short plan.

2.6

Related Work

The step and variable representation scheme proposed in this chapter is similar to schemes proposed by several authors. In this section, we will describe how these different representation schemes compare and justify some of the choices made for the representation used in UDTPOP.

2.6.1

Attributes of Probabilistic Steps

The knowledge representation schemes proposed for probabilistic planners can be characterized in terms of four attributes: • Attribute cardinality: 2 or n. • World state representation: factored vs. flat. • Distribution symmetry: Symmetric vs. Asymmetric. • Distribution completeness: Complete vs. Incomplete.

2.6.2

Attribute Cardinality

The attribute cardinality is the number of states that each attribute variable can assume. The alternatives are binary (propositions) or non-binary (the variable/state or generalized 27

proposition [Boutilier, et al, 96] approach discussed in this chapter). Most probabilistic planners use propositions, including [Blythe, 94, 96; Draper, et al, 93, 94; Goldman+Boddy, 94a, 94b, 96; Haddawy, 91; Hanks 90; Kushmerick, et al, 95; Milani, 94; Pryor+Collins, 93] although a few others use variables [Boutilier, 94; Doan, 96; Doan+Haddawy, 95] or make no particular commitment on a technique for representing components of the world state [Dean, et al., 93]. UDTPOP uses generalized propositions for a number of reasons: • The concept of mutual exclusion is very natural when describing the properties

of objects within a domain. For example, objects might have only one location at one time. Things can be living or dead, but not both. When a propositional representation is used, only the mutual exclusion of Fact and ¬Fact is guaranteed. Other mutual-exclusion relationships have to be discovered through inference. • In classical planners, mutual exclusion relationships between propositions are

enforced in a relatively unnatural way by carefully engineering each action schema to maintain the mutual-exclusion relationship. For example, a movement operator that asserts at(object, new-location) must also assert ¬at(object, oldlocation). If this mutual exclusion relationship is violated in any one of the action schema, then the plans generated by the planner are no longer guaranteed to be sound. For example, if we accidentally write that the only effect of a movement operator is at(object, new-location), then it may be possible that both of the fluents, at(object, new-location) and at(object, old-location), will be true at the same time. • The mutual exclusion of variable bindings also allows us to infer whether it

might be possible add a step or resolve a threat to a link by examining the history of bindings for a particular variable. Suppose that there is a continuous chain of causal links protecting the distribution over a specific variable, X , that connects the initial conditions step with the goal step (see Figure 7). If there is a threat to

28

any one of the causal links protecting X , then we know that that threat cannot be removed through promotion or demotion.

X

SIC

S1

X

S2

X

S3

X

SGoal

FIGURE 7. A chain of causal links protecting attribute variable X .

• The ‘easy’ probabilistic representation of an attribute variable is much more

compact than the most obvious probabilistic representation for the equivalent set of propositions (unless extra machinery is used in the propositional planner to detect mutual exclusion). For example, in Stress World, there are 4 mutuallyexclusive levels of alertness. If we use belief networks to model uncertainty in this domain, only one discrete variable with 4 possible states is required to model the position of any object at any one point in time. In a naive propositional approach, we might use 4 binary variables, each representing one possible level of alertness. The joint distribution of these variables has 16 components ( 2 4 ). However, only 4 of these components can be non-zero. In order to derive this fact, the evaluator needs to consider all of the steps in the plan in order to derive the proper mutual-exclusion relationships or rely on the user to explicitly declare the propositions to be mutually exclusive [Breese, 92]. If the number of possible states in the alertness variable was increased to 10, the joint distribution over the propositions representing the different states of this variable would contain 1024 components; only 10 of which are non-zero. The shift of representation from propositions to attribute variables is not completely without cost. The variable representation makes it easy to model mutual exclusion of attributes when they vary across one dimension, but may make it more difficult to model more com-

29

plicated mutual exclusion relationships. Say that we are attempting to model trains in a railroad domain. The variable representation of UDTPOP makes it easy to capture the mutual exclusion of possible locations for each train13, but makes it more difficult to write step descriptions that encode the mutual exclusion of trains per each location14 without the use of a second dependent variable. In these situations, it is still possible to fall back to a purely propositional representation, by making each proposition a binary attribute variable.

2.6.3

World State Representation

Almost all probabilistic planners (including UDTPOP and DTPOP) reason with a propositional or factored representation based upon some variant of STRIPS [Fikes+Nilsson, 71]. All of these planners represent probabilistic steps as a (possibly factored) conditional probability distribution over a set of outcome variables given a set of precondition variables. An alternative to this representation is that of a traditional Markov Decision Process (MDP) [Dean, et al., 93]. An MDP uses a step representation based on a transition matrix between all of the possible world states. The factored or propositional representation is typically exponentially smaller than the equivalent MDP representation. Even though the representation is exponentially smaller, a planner that uses a factored representation is not necessarily more efficient than an MDP. The plan evaluation and plan existence problems for the MDP representation are PL-complete and NP-complete, respectively [Goldsmith, et al, 97].15 The plan evaluation and plan existence problems for UDTPOP, on the other hand, are PP-complete and NPPP-com13. Each train can only be in one location. 14. Each location can host only one train. 15. The class PL is the set of problems that can be solved on a Turing machine in polynomial time and logarithmic space.

30

plete, respectively, for polynomially bounded plans.16 The complexity of partial-order probabilistic planning is greater due to the relative sizes of the representations: the size of the problem specification for an MDP is exponential in the number of state variables used.

2.6.4

Distribution Symmetry

In a symmetric representation of conditional effects, the same precondition and outcome variables appear in each conditional effect. Many planners, notably Buridan [Kushmerick, et al., 94], DRIPS [Haddawy+Doan, 94], and the MDP planners developed by Boutilier [Boutilier, et al, 96] allow conditional effects to have the structure of an unbalanced tree. Such a tree is shown in Figure 8. The set of triggers C i are collectively exhaustively and mutually-exclusive boolean expressions on the set of all preconditions of the step. C1 C2 C3 Cn

A1 A2 A3 An

FIGURE 8. Asymmetric conditional effects.

UDTPOP uses a symmetric representation where each of the C i in the figure above are conjunctions over the same set of precondition variables, but an asymmetric representation is clearly a good idea. An asymmetric distribution is potentially exponentially smaller than one that is symmetric. 16. The notation NPC-complete means that the problem would be NP-complete if there is an oracle for problems in class C. The canonical PP-complete problem is MAJSAT: “do the majority of assignments satisfy a 3CNF?” [Papadimitriou, 94] Theorists think that PP is really hard because every problem in the polynomial hierarchy can be reduced to PPP [Toda, 89], the class of problems solvable on a polynomial time turing machine using a PP oracle. The canonical NPPP-complete problem is E-MAJSAT: “given a 3CNF with proposition sets A and B, is there an assignment for A such that the 3CNF is satisfied for the majority of assignments for B?” [Littman, 97]

31

2.6.5

Distribution Completeness

Buridan uses a single tree to represent the outcomes of a step. This tree does not specify a complete distribution over an effect variable, but rather specifies how the step can change the state of the variable. This is illustrated by the “slippery gripper” example from Kushmerick, et al [94]. drygripper dries a robot gripper so that it can pick up a block successfully with high probability. The robot gripper’s state, Dryness(t), can be either be wet or dry. dry-gripper dries the gripper with probability 0.5 if the gripper is wet and has no effect on the gripper if the gripper is dry. The effect of this step is modeled as a single (Buridan-style) conditional effect <∅, dry, 0.5 > . This representation is incomplete because the probabilities of the conditional outcomes affecting Dryness do not sum to 1.0 . In UDTPOP, we explicitly model the persistence of dryness. For this example, the equivalent complete conditional effect for dry-gripper would be: P { Dryness(S+) = dry Dryness(S-) = dry } = 1.0 P { Dryness(S+) = wet Dryness(S-) = wet } = 0.5 P { Dryness(S+) = dry Dryness(S-) = wet } = 0.5

The

only

effective

outcome

〈 Dryness(S+) = dry Dryness(S-) = wet〉 .

in

this

conditional

effect

is

The rest of the conditional outcomes in this step

are designed to passively persist their preconditions. UDTPOP uses a symmetric representation for conditional effects because this representation makes explicit that the probability of dryness after dry-gripper is executed is dependent on how dry the gripper was before the step was executed. This allows UDTPOP to increase the probability of success for dry-gripper by finding the appropriate support for the implicit precondition Dryness . In addition, most influence diagram and belief network algorithms (barring those listed in [Hanks, 90] and [Kushmerick, et al, 94]) are designed to take advantage of the Markov property: the distribution over any conditional event is independent of its non-descendents given the values of its predecessors. The distributions for 32

incomplete nodes must be “completed” by the inference algorithm before the inference algorithm can reason with them. Obviously, any incomplete distribution can be turned into a complete distribution by adding passive conditional outcomes. The complete form of a conditional effect distribution is, in the worst case, exponentially larger than the incomplete form of the conditional effect (growth is exponential in the number of outcome variables).

33

34

3.0 UDTPOP: Noncontingent Planning

UDTPOP is an open-loop1 decision-theoretic partial-order planner. Although individual steps may have effects that are dependent on the effect of previous steps, the sequence of steps that is actually executed is not dependent on information observed during runtime. The same step sequence is executed regardless of the state of the world.

In this sense,

UDTPOP plans are similar to the plans generated by Buridan [Kushmerick, et al, 1994]. This kind of planning is called conformant planning [Goldman+Boddy, 1994b]. UDTPOP constructs the plan that maximizes the value of a multi-attribute value function. This value function is the sum of a reward value function and step cost functions. UDTPOP can trade planning objectives against each other or against the overall cost of the plan. UDTPOP is not required to identify a plan that accomplishes the objective; if the objective is too expensive, the planner will abandon that objective.

3.1

Overview

This chapter includes: • an introduction to the UDTPOP algorithm; • an extended example; • a description of the formal properties of UDTPOP; • an empirical comparison of UDTPOP with Buridan; and 1. This is a term from control theory: “Open loop” means that you cannot observe the state of the system while you are controlling it. The opposite is “closed loop” meaning that you can observe the system while controlling it and compensate for errors in control.

35

• a long discussion on design issues and extensions of UDTPOP.

In order to simplify the presentation, many of the proofs have been moved to Appendix A.

3.2

The Basic Ideas

This section will focus on the key insights that led to the development of UDTPOP. These insights center around 1) the nature of the commitment implied by each causal link and 2) a simple technique for recognizing pertinent steps when steps are deterministic or near deterministic. In sections 3.2.1 and 3.2.2, I will discuss the first of these insights. I will contrast two alternative strategies for adding causal links to probabilistic plans: multiple support, used by Buridan [Kushmerick, et al, 94], and single support, used by UDTPOP. We will argue that a single-support planner can be much more efficient than a multiple-support planner. In order to realize the potential efficiency gain of single-support planners in domains that contain steps with a strong degree of determinacy, we need to be able to quickly prune the list of steps that are not relevant to an open precondition variable. Section 3.2.3 describes a simple technique that allows the planner to determine whether a step is effective, that is, it contributes in a positive way to the overall mission of the plan.

3.2.1

Multiple Support

Multiple-support was pioneered in the Buridan planner [Kushmerick, et al, 94]. In multiple support, a single precondition can be established by multiple steps. The multiple-support strategy can be characterized by the following three properties: 1. The step representation for a multiple-support planner describes how each step changes the world (recall 2.6.5). 2. This change-based representation allows a planner to use multiple causal links to establish a single precondition proposition. 3. No a priori restriction is made on the order of execution between multiple steps supporting the same precondition (ordering constraints may be added later to resolve threats). 36

Over the next few pages, I will explain what these properties mean and what their implications are for probabilistic planning. In Buridan, the effects of a step are described using a set of conditional outcomes . Each trigger, c i , of each conditional outcome is a set of propositions. The triggers for the set of conditional outcomes are mutually exclusive ( ∀c i ≠ c j , c i ∧ c j = ⊥ ) and collectively exhaustive ( c 1 ∨ … ∨ c n = T ). Each effect e i is a set of propositions denoting the set of changes that are made to the state of the world if that outcome is the result of step execution. Any e k can be empty, indicating that conditional outcome has no effect on the world state. The set of conditional effects defining a step can be represented as a probability tree. The leaves of this tree indicate the individual outcomes ( e i ). The path from the root of the tree to each leaf encodes the trigger, c i , of each conditional outcome. Figure 9 illustrates a hypothetical drink-coffee step. If the coffee is caffeinated, drink-coffee changes the drinker’s level of alertness to conscious with an 80% probability (Leaf 1 in Figure 9) and has no effect at all with a 20% probability (Leaf 2). If the coffee is decaffeinated, drinkcoffee results in consciousness with a probability of only 5% and has no effect 95% of the time. 1 {Alert=conscious} p=0.8

2 {} p=0.2

Coffee=caf

{Alert=conscious} p=0.05

{} p=0.95

Coffee=decaf

FIGURE 9. A Buridan step.

Causal links in Buridan record the planner’s commitment to use the contribution of individual leaves of conditional effect distributions to establish support for precondition prop-

37

ositions. A “Buridan-style” or “thin” causal link links a single conditional outcome O of one step S E to a single precondition value p of another step S C . This thin causal link captures the commitment that p is true at the time that S C is executed because effect O of S E makes it true, at least some of the time. Buridan can increase the probability of p by adding more causal links to p , each originating in a distinct leaf from the same step or from a different step. The basic idea: the original thin causal link only causes p to be true some of the time–if we support p with additional causal links, the establishing steps for these additional causal links might make p true in case the original establishing step fails. Buridan can also increase the probability of an effect protected by a thin causal link by adding support to the trigger propositions of the conditional outcome that establishes the link. I define a multiple-support planner to be a planner that uses a change-based step representation with thin causal links. Consider the following example from Stress World. Harvey desires to be conscious so that he can drive home safely. Harvey can increase the probability of a successful drive by drinking coffee before departing, because if the coffee is caffeinated, he will be conscious with a probability of at least 80%. The causal link capturing this planning commitment runs from the effect Alert = conscious of the leftmost branch of drink-coffee’s probability tree to the precondition Alert = conscious of drive-home (see Figure 10).

38

{My-Location = home} Alert=conscious

{Alert=conscious} p=0.8

Drive-Home

Alert=asleep

{} {Alert=conscious}

p=0.2 Coffee=caf

p=0.05

{} p=0.95

Drink-Coffee

Coffee=decaf

FIGURE 10. Increasing the probability of a precondition with a causal link.

The probability of driving home successfully might be increased still further by committing to use the other conscious effect of drink-coffee (Figure 11) or, better still, by drinking two cups of coffee (Figure 12). The reasoning: If either cup of coffee wakes Harvey up, then Harvey is guaranteed to get home.2 {My-Location = home} Alert=conscious

{Alert=conscious} p=0.8

Alert=asleep

{} {Alert=conscious}

p=0.2

CoffeeType=caf

p=0.05

Drive-Home

{} p=0.95

Drink-Coffee

CoffeeType=decaf

FIGURE 11. Increasing the probability of a precondition with multiple links from the same step.

2. In the latter case, Buridan does not need to commit to the order of the two drink-coffee steps. We will see later that a single-support planner, such as UDTPOP, must impose “artificial” ordering constraints on steps that all support the same precondition. Because of this, Buridan can find more general plans than UDTPOP. In Section 3.10.5.2, we show how some of this generality can be recovered.

39

{My-Location = home} Alert=conscious

{Alert=conscious} p=0.8

{} {Alert=conscious}

p=0.2

CoffeeType=caf

p=0.05

Alert=asleep

{} p=0.95

CoffeeType=decaf

Drive-Home

{Alert=conscious} p=0.8

{} {Alert=conscious}

p=0.2

CoffeeType=caf

Drink-Coffee

p=0.05

{} p=0.95

CoffeeType=decaf

Drink-Coffee

FIGURE 12. Increasing the probability of a precondition using a causal link from a different step. Adding a second causal link increases the probability of consciousness from at least 0.8 to at least 0.96 if both cups of coffee are caffeinated.

If Buridan wanted to increase the probability of the partial plan in Figure 12 still further, it might attempt to add support to any conditional outcome that contributes a causal link. In this case, the “open preconditions” would be the trigger CoffeeType = caf on each of the drink-coffee steps. In addition, the Alert = conscious precondition on drive-home remains open, because Buridan could continue to add additional links (such as more drink-coffee) to support it. Buridan would not attempt to support CoffeeType = decaf until it uses one of the conditional outcomes that depend on this trigger proposition.

3.2.2

Single Support

UDTPOP and DTPOP are based around a competing notion, that of single support. Single support differs from multiple-support in three respects: 1. The step representation describes a complete distribution over all of the variable values that can result after the execution of the step. If a variable appears in one conditional outcome, it must appear in all of them. 2. Only one causal link establishes support for each open precondition. 3. If multiple steps can support the same precondition, only one supports that precondition directly. The rest must add their contribution indirectly through the passive conditional outcomes of intervening steps.

40

In a single-support planner, the distribution over a particular variable of interest, V , is established by the last step that can affect the probability distribution over V . A single fat causal link is used to capture the distribution over the precondition variable.3 Once this primary causal link has been added to the plan, the formerly open precondition may only be supported indirectly by satisfying the preconditions of the step that established the causal link. There is a philosophical argument for the single support approach – if two actions both affect a single proposition, it is usually the case that the actions cannot be independent and cannot be combined independently as they are in Buridan. When this somewhat unrealistic independence assumption does not hold, the performance of Buridan suffers. In the coffee example used in the previous section, the drink-coffee step either increases the level of alertness or persists the level of alertness that existed prior to drinking the coffee. Harvey’s level of alertness, which was not a precondition of the Buridan-style step, is a precondition of drink-coffee in a single support planner. In order to increase the probability that Harvey is alert, we add support that changes the probability of the preconditions of drink-coffee: either we can either increase the probability that the coffee is caffeinated, or increase the probability that the driver was awake prior to drinking this particular cup of coffee (Figure 13).

3. This causal link is “fat” because it captures the same protection and commitment implied by several “thin” causal links.

41

My-Location

Drive-Home

Alert

Drink-Coffee Alert

CoffeeType

FIGURE 13. Single Support. In single support, only one causal link is used to satisfy a precondition. In this case, Drink-Coffee establishes a distribution over Alert for Drive-Home.

Figure 14 illustrates the UDTPOP equivalent of multiple support. Rather than adding a second causal link to the Alert precondition variable of drive-home, a single-support planner attempts to increase the probability of Alert = conscious by increasing the probability of the Alert = conscious precondition of Drink-Coffee.

My-Location

Drive-Home

Alert

Drink-Coffee2

Alert CoffeeType

Alert

Drink-Coffee1 Alert

CoffeeType

FIGURE 14. Increasing the probability of a precondition using a causal link from a different step.

In a single-support planner new steps are always added to the leaves of the causal link tree rather than at arbitrary points as they are in a multiple-support planner. This ‘artificial’

42

restriction on step order is one of the key distinctions between a multiple-support planner like Buridan and a single-support planner like UDTPOP. The preconditions for a step are the variables4 that condition the conditional effect distribution used to establish that step’s “outbound” causal links. In deterministic or near deterministic domains, the techniques discussed in Section 3.2.3 are used to restrict the set of steps that are relevant to any particular precondition. This mechanism is the key to the efficiency of UDTPOP on deterministic domains. In summary, the single support approach that I outline differs from the multiple support planner Buridan in three respects: 1. Single Support for Preconditions: A single source of support is used for each precondition. 2. Variables: Variables are used instead of propositions. 3. Fat Causal Links: If there is a causal link protecting a given effect variable, it protects all of the generalized propositions concerning that variable that are asserted by the establisher for the causal link. The empirical section (Section 3.8) demonstrates that a single support planner can be much more efficient than a multiple support planner. We list (and briefly explain) several of the reasons below. We will revisit many of these issues in more detail in Section 3.9.

Decreased search space breadth:

When adding a causal link to a plan, the single-support planner has to choose between one of several possible steps that can influence a given precondition variable. Because a Fat Causal Link is used to capture all of the effects on the given precondition variable, there is no need to select between the individual conditional outcomes of the step. A multiple-support planner, on the other hand, must not only decide between these steps, but also must also choose which of the conditional outcomes of the step to use in order to support the precondition (Figure 15). This additional choice tends to increase the breadth of the search tree increasing the overall size of the search space. 4. Not just a single value of the variable.

43

V

V=T

? V Step 1

? V Step 2

V=T

V=T Step 1

V=T

V=T Step 2

FIGURE 15. Choices in Single and Multiple Support. A single support planner only needs to decide between steps that can influence a given precondition variable (A total of 2 options in the figure above.). A multiple support planner needs to also decide which of the specific effects should be used to establish a given precondition (A choice between 4 possible options).

Decreased search space depth:

The use of Fat Causal Links reduces the depth of the single-support planner’s search space. A single-support planner only needs to add a single causal link to capture all of the influences that one step S A has on a variable Xi that is pertinent to another step S B . In order to capture the same set of influences, several causal links need to be added: one from each conditional outcome on S A that influences a desirable value of variable Xi.5 Each additional bit of structure forces the multiple-support planner to make additional plan construction decisions, increasing the depth of the search space.6 5. The behavior of Buridan depends on the evaluator used. The FORWARD evaluator [Kushmerick, et al, 93, 95] uses only the steps in the plan and the ordering constraints to determine the minimum probability of the goal, thus it is insensitive to the number of links between steps as long as enough links are present to force Buridan to add enough preconditions to the plan. The REVERSE evaluator, on the other hand, only uses the causal links that are actually in the partial plan to calculate the goal probability. When the goal probability is set sufficiently high, Buridan may have to add enough links between two steps to capture all of the influences on a single proposition.

44

Xi

Xi

Xi Establisher

Xi=T

Xi=F

Xi=T

Xi=F

FIGURE 16. Proliferation of Structure in Multiple Support Planners. A single-support planner needs to add only one fat causal link in order to capture all of the influences that an establishing step has on a single precondition variable. A multiple-support planner may need to add a causal link for each leaf of the conditional effect tree in order to model the full spectrum of effects of the establisher on the precondition variable. Adding each ‘thin’ causal link increases the depth of the solution in the search space.

Nonsystematic search:

The space is said to be systematic if each partial plan appears in only one place in the search space. Search is highly nonsystematic in multiple-support planners. Imagine that step S A has a precondition Xi. Imagine further that step S B has several (say 10) conditional outcomes that all influence proposition Xi. When the multiple-support planner first considers linking S B to S A , it must choose one of 10 possible ways to add the first causal link. If the planner attempts to increase the probability of Xi by adding further links from SB ,

it find additional support in one of the 9 unused conditional outcomes of S B . The

problem: in one part of the search space, the planner might add a link from the first leaf of

6. The “confrontation” threat resolution mechanism in Buridan also greatly increases the number of planning decisions. When a threat is resolved via confrontation, additional safety (pre)conditions are added to consumer of the threatened causal link and additional postconditions are added to the nonthreatening effects of the threatening step. Buridan decreases the probability of the threatening effects of the threat by adding causal links to safety conditions and to the preconditions that tend to increase the probability of those safety conditions on the threat. This additional structure increases the number of decisions that Buridan must make during the planning process, increasing the size of the search space.

45

SB

and then add a link from the second leaf of S B . In another part of the search space, the

planner might add the links in the opposite order, duplicating the plan. Thus a multiplesupport planner not only searches over all of the possible combinations of causal links between S A and S B , but also searches over all of the possible orders for adding these links to the plan. In this example, a multiple support planner might identify as many as 10! = 3628800

plans that differ only in the order that the links were added from S B to

S A .7

A single-support planner can add links in a more systematic fashion because of SingleSupport for Preconditions. Each precondition can only have one source of support – once a link is added to the open precondition, it is illegal to add another link to that same precondition. Support for open preconditions can be identified systematically without backtracking.

Redundant support:

It may be difficult to determine when additional support is redundant in a multiple-link planner. For example, say that the step Flip-Coin has two mutually exclusive and collectively exhaustive effects: Coin-Face = heads and Coin-Face = tails , each occurring with probability 0.5. The Buridan planner might add links from two separate instances of the Flip-Coin step in order to attempt to increase the probability that Coin-Face = heads is true. However, unless the execution of the second Flip-Coin step is made contingent on the first (cheating in a non-contingent planner), the second Flip-Coin step will erase any contribution from the first Flip-Coin step. The last Flip-Coin step executed always establishes the face of the coin that is showing.8

7. This is definitely true for the current Buridan implementation. It may be possible to impose some discipline on the order that links are added to a plan in order to increase the systematicity of the search space.

46

A single-support planner will never make this mistake. Since the result of the Flip-Coin step does not depend on the previous state of the coin, the step does not have Coin-Face as a precondition and only one Flip-Coin step can be used to establish the goal.9

3.2.3

Effectiveness

The second insight underlying the design of UDTPOP is the design of a mechanism for pruning irrelevant steps when conditional distributions are deterministic or nearly deterministic. Most steps comprising a probabilistic domain contain conditional effect distributions that are relatively sparse. Many steps are ‘designed’ to do nothing if their preconditions are not satisfied. For example, if we wish to command a robot to stack one block on top of another, we might require that 1) we are holding one of the blocks, 2) the second block is clear, 3) the robot is powered up, etcetera. If any of these conditions are false, the robot does nothing and passively preserves the conditions that were true before the step was “executed.” In UDTPOP, I require that each step in a plan be effective – there should be some non-zero probability that each step can change the state of the world in some way that contributes to the overall goal of the plan. There are three ingredients for effectiveness: 1. Each step must have a conditional outcome (called an effective conditional outcome) that can change the state of the world. 2. This conditional outcome must be possible, that is the joint probability of the precondition values in the trigger for this conditional outcome must be greater than zero. 8. It is easy to modify Buridan so that it recognizes when distributions are complete. This, unfortunately, does not solve the problem. Dependencies in the preconditions between multiple supporting steps can introduce correlations that make it impossible to accomplish the desired goal using some of the causal links in the plan. The general problem of determining whether a thin causal link can support a precondition in any situation is NP-complete. 9. It doesn’t need to have Coin-Face as a precondition since there is no possibility that the step will preserve the previous state of the coin.

47

3. The value of the conditional outcome should either help the plan accomplish the goal or influence the value of a step cost function (pertinence). 3.2.3.1 Possibility

A conditional outcome is possible when there is some completion of the partial plan in which the probability of its trigger, c , is greater than zero.

Definition 2 (Possible): A precondition or effect value is possible if its probability is

greater than 0.0. 3.2.3.2 Pertinence

A step is pertinent 10 to an open precondition if that step has an effect that makes it possible for the plan to achieve its objective. For example, say that our goal is to write a four page paper. So far, we have assembled a partial plan consisting of one write step and wish to determine which of the preconditions of write should be satisfied in order to construct an efficient plan. Recall that a single write step can add 0 or 1 pages of written material to a paper and that the preconditions for write are Pages and Alert . In order for the final write step to achieve the objective of the plan, it must be the case that we have three or four pages of written material immediately before this final write step executes. A step S A is pertinent to the open precondition Pages of the final write step if Pages = 3 or Pages = 4

are possible effects of S A .

The set of conditional outcomes that are pertinent is contingent on the set of conditional outcomes that are possible. Say that it is known that the step immediately before the final write step in the example above results in three pages of written material with certainty. Then it is no longer possible to achieve the final objective (4 pages) by using a passive conditional outcome of write; the write step must produce 1 page of written material. 10.“Pertinence” is used to describe this relationship rather than the more obvious term “relevance.” Relevance can be confused with probabilistic relevance which will be used extensively in Chapters 4.0 and 5.0.

48

This, in turn, implies that the writer (Harvey) must either be “conscious” (he produces 1 page of writing with probability 0.8) or “in-the-zone” (he produces 1 page of writing with probability 1.0). The set of pertinent values for Alert is a function of the set of possible values for Pages . A precondition value is pertinent if the precondition supports a value function and adding the step that supports that precondition value makes it possible to achieve better than the worst possible value for the goal. A plan should make it possible to at least partially achieve the goal, otherwise there is no purpose in pursuing that plan.

Definition 3 (Pertinence): A precondition value x is pertinent if the precondition

supports a value function in the plan and setting the precondition to x makes it possible to achieve a value that is at least as high as the worst possible value in the reward utility function. Definition 4 (Pertinent Step): A step is pertinent if it has a conditional outcome that

can support a pertinent precondition. 3.2.3.3 Effectiveness

The conditional outcomes for each step are divided into two classes: conditional outcomes which change the state of the world (effective conditional outcomes) and conditional outcomes that persist the state of the world (persistent conditional outcomes).

Definition 5 (Effective Conditional Outcome): An outcome is effec-

tive with respect to variable A if either • A ⊂ E but A ⊄ C , or • ∃i, j such that A = E i = C j and e i ≠ c j .

49

Definition 6 (Passive Conditional Outcome): The

conditional outcome is passive with respect to variable A if ∃i, j such that A = E i = C j and ei = c j .

It is possible during planning to add steps that make subsequent steps superfluous. For example, in the example above, the final write step is superfluous if the step executed immediately before write can guarantee 4 pages of writing. In order to prevent UDTPOP from exploring or completing one of these provably inferior plans, UDTPOP adds a constraint that each step be effective: each step must have at least one effective conditional outcome that is both possible and pertinent.

Definition 7 (Effective Support): In order for a step S E to provide effective support

to another step S C , there must exist some possible effective conditional outcome of SE

that supports a pertinent precondition of S C .11

When UDTPOP uses a step to support a precondition, it adds an effectiveness constraint that guarantees that the step will provide effective support for the precondition in every V

completion of a partial plan. I will use the notation “ effective(S E →S C) ” to denote the conV

straint “support for causal link S E →S C should be effective.” The opposite of effective support is passive support. During threat resolution, UDTPOP may require that a step provide passive support to persist desired values of a threatened causal link. In order to enforce this constraint, UDTPOP uses another kind of constraint V

called a persistence constraint, denoted persist(S E →S C) .

11. Kambhampati proposes a similar constraint on effectiveness for his multi-contributor planner [Kambhampati, 94b]. A step is pruned if it cannot provide ‘effective’ support for at least one of the plan completions implied by the multi-contributor structure. Effectiveness in a UDTPOP plan, on the other hand, is a function of the multiple possible world ‘trajectories’ that might result from the execution of a plan. UDTPOP prunes a step if it cannot provide effective support for at least one of these ‘possible world trajectories.’

50

Definition 8 (Passive Support): In order for a step S E to provide passive support to

another step S C , there must exist some possible passive conditional outcome of S E that supports a pertinent precondition of S C . A simple constraint engine enforces effective and persistence constraints.

EXAMPLE

The effectiveness and persistence constraints can restrict planning options considerably in domains with goal-like value functions and steps with relatively few effective conditional outcomes. Figures 17 through 20 illustrate the import of effectiveness constraints on step selection in a simple robot navigation domain (Figure 17). The nodes and arcs on this graph denote cities and roads. Our objective is to derive a sequence of “Go-XXX” actions that move the robot to city a. The reward function is 10 if the robot ends up at a and is 0 otherwise. N

a (Goal) b c d

FIGURE 17. A very simple navigation domain. a , b , c , and d are cities connected by a single road.

One of the operations that our robot can execute is Go-North. This step moves the robot north one city at a time until it can go no further. Figure 18 illustrates the effect of this step; the robot’s location is advanced one step north unless the robot is already in City a .

51

Effects

a

b

c

d Effective Outcomes

Passive Outcome

a

b

c

d

Preconditions

FIGURE 18. A “state transition” diagram for the “Go North” step. “Go-North moves a robot north until it can’t go any farther. In this domain, Go-North moves the robot to City c if it starts in d , to City b if it starts in c and to City a if it starts out in b . If the robot is already in a , it stays in a .

A simple partial plan comprised of three Go-North steps is shown in Figure 19. The large triangular region in this figure captures the constraint that the steps be pertinent to the goal Loc = a .

The barred parallelograms denote the states that have to be possible in order for

Go-North 3 and Go-North 2 to be effective. The intersection of these three regions is the set of states that must be possible in order for these steps to be effective.

a

b

c

d

Go North 3

Pertinent to goal

a

b

c

d

a

b

c

d

a

b

c

d

Go North 2

Step 3 is Effective Step 2 is Effective

Go North 1

FIGURE 19. Constraints.

The intersection of these constraints implies that go-north 1 must be able to achieve Loc = b

or Loc = c . If there were to be a step prior to go-north 1, then that step should

have Loc = c or Loc = d in its effects. This implies, for example, that we cannot add a fourth go-north operation because d is not one of that step’s effects.

This makes sense:

if we can start no more than three cities south of A , it never makes sense to execute four

52

go-north steps. In such a plan, one of the go-north steps (the last one) would be ineffective, persisting the location of the robot rather than moving the robot north. In fact, the effectiveness criterion alone provides pruning even if the robots goal is to end up in any city (See Figure 20).

a

b

c

d

a

b

c

d

Go North 3

Go North 2

Step 3 is Effective Step 2 is Effective

a

b

c

d

a

b

c

d

Step 1 is Effective

Go North 1

FIGURE 20. Effectiveness. Even if our goal is to get to any city, the effectiveness constraints still prevent the planner from adding a step that doesn’t support d .

The Appendix shows that every step must provide effective support in the optimal plan: Theorem 5 (Effective Support is Necessary for Plan Optimality): If plan P is

optimal, then every step in P (except for S IC and S Goal ) provides effective support. Proof

See Appendix A.1.

Definition 9 Effective Plan: If every step in a plan provides effective support to another step, then we will say that the plan is an effective plan.

3.3

UDTPOP

In the next 6 sections, we will describe the UDTPOP algorithm. • This section will outline the top level design of the planning algorithm.

53

• Section 3.4 illustrates the planning algorithm using an example from Stress

World. • Section 3.5 describes the details of effectiveness, pertinence, and possibility cal-

culations. • Section 3.6 describes how to calculate the expected value for complete plans and

how to calculate bounds on the expected value of partial plans. • Section 3.7 describes the formal properties of UDTPOP. • Finally, section 3.8 benchmarks the performance of UDTPOP against the perfor-

mance of Buridan on a variety of domains. UDTPOP solves the following problem: given a planning problem , where R is the reward utility function, A IC

is the set of allowable steps, and is the distribution over all of the variables in the domain

immediately before the execution of any plan steps, UDTPOP finds a partial plan, assembled from the steps in A that maximizes the expected value of the reward utility function R . UDTPOP returns no plan if there is no plan that can result in an outcome better than the worst outcome in the reward utility function. At the top level, the design of UDTPOP is similar to other causal link planners. completeplan starts with an empty plan consisting of only an initial conditions step and goal step. UDTPOP incrementally completes this plan by repairing flaws (open conditions and threats) in the plan’s causal structure. When UDTPOP can find no further flaws in the plan, UDTPOP returns both the plan (a partially ordered sequence of steps) and a set of causal links that capture the set of cause-and-effect relationships required to compute the plan’s expected value. A sketch of the algorithm is shown below.

54

Complete-Plan ( P , Flaws ) if ( constraint_violated(P) ) return ∅ else if ( Flaws = ∅ ) return P else Choose a flaw, f , in Flaws . 1. if ( f is an open condition) either 1.a. P′ = Add-Step( f , P , Flaws – f ) 1.b. P′ = Add-Link( f , P , Flaws – f ) 2. if ( f 2.a. 2.b. 2.c.

is a P′ = P′ = P′ =

threat) either Promote( f , P , Flaws – f ) Demote( f , P , Flaws – f ) Persist-Support( f , P , Flaws – f )

return P′

All of the flaws in the plan are recorded in a flaw set Flaws . When this set is empty, there are no flaws left in the causal structure of the plan and the plan is complete. constraint_violated

checks the effectiveness constraints in the plan. If any constraint is vio-

lated, the plan is pruned.

3.3.1

Plan Flaws: Open Conditions

UDTPOP establishes explicit causal support for every pertinent precondition in a plan. A variable is a precondition if it conditions a cost function, conditions the reward utility function, or conditions a conditional effect distribution that is used to establish a causal V

link. A precondition V of step S is open if there is no causal link S E →S that establishes V.

Open

An open condition is denoted by ( → S ) where Open is the open precondition vari-

able and S is the step containing the open precondition variable.

55

Open conditions can be repaired by adding causal links to the plan, either through addlink or add-step.

3.3.2

Plan Flaws: Threats

ST

V

V SE

SC

V

FIGURE 21. Threats. S T threatens S E →S C . If a step (other than the establisher) can modify the variable protected by a causal link, then that step threatens the commitment denoted by that causal link and invalidates the underlying causal model. V

Definition 10 (Threat): A step S T is said to threaten a causal link S E →S C if

1. S T can possibly occur between S E and S C , and 2. V is an effect variable of S T . The notation S ⊗ L is used to represent a threat to causal link L from step S . Threats are resolved using promotion, demotion, and a variant on Buridan’s confrontation operator, persist-support.

3.3.3

Adding Support: Add-Step and Add-Link

Add-Step and Add-Link add new causal links to plans in order to repair open conditions. Add-Step selects a step (an action schema) from the set A of possible actions and copies it into the plan in order to establish support for an open precondition. Add-Link uses an effect variable from a step that is already in the plan in order to establish support. Both Add-Step and Add-Link use Add-Support to do all of the work.

56

Open

Add-Step( oc = ( → S C ) , P = , Flaws ) if there exists a step S E in the domain description that has Open in its effect variables, return Add-Support( oc , S E , , Flaws ) else return ∅ .

Open

Add-Link( oc = ( → S C ) , P = , Flaws ) if there exists a step S E in S such that S E is possibly before S C and Open is an effect variable of S E , return Add-Support( oc , S E , P , Flaws ) else return ∅

Open

Add-Support( oc = ( → S C ) , S E , P = , Flaws ) if (there is a conditional outcome ce = of step S E such that 1. ∃j such that Open = E j , 2. e j is pertinent for precondition Open of step S C 3. c is possible, and 4. ce is effective for Open.) then { Open let K new := effective(S E → S C ) Open let P' := let Flaws' := Prune(Flaws) ∪ newOCs P(S E, Open, P) ∪ newThreats(P) return Complete-Plan( P′ , Flaws' ) } else return ∅

Each call to add-support adds: Open

• a causal link, S E → S C • an ordering constraint, S E < S C , that forces the establisher of the causal link to

occur before the consumer; and

57

Open

• a constraint, effective(S E → S C ) , that forces the step to provide effective sup-

port for the causal link. The function newOCs adds new open conditions. A precondition variable is added to the open conditions list if it has not been established by any causal link and either: 1) is the precondition of a conditional effect distribution that is used to establish a causal link, 2) is the precondition of a step cost function, or 3) is a precondition of the reward utility function. The function newThreats discovers all of the new threats in the plan. Threats arise from two sources. When a new step is added, it can threaten existing causal links. When a new causal link is added, existing steps might threaten that new link. newThreats uses the threat definition to identify these new threats. When ordering constraints are added to the plan, it may no longer be possible for a step to threaten a causal link. The function prune removes threats that are resolved by the addition of ordering constraints.

3.3.4

Resolving Threats: Promote and Demote V

If S T threatens causal link S E →S C then it is possible for S T to be executed between the times that S E and S C are executed. One standard method for resolving this threat is to require that S T be ordered to execute either after S C executes (promotion) or before S E (demotion).

58

Promotion

V

ST V

SE

SC CAUSAL LINK ORDERING CONSTRAINT

FIGURE 22. Promotion. Promotion resolves a threat by forcing the threat to execute after the causal link. V

Promotion( T = S T ⊗ ( S E →S C ) , P = , Flaws ) If ( S C < S T is possible){ O′ := O + ( S C < S T ) return Complete-Plan( , Prune(Flaws) ) } else return ∅ .

Demotion

ST

V

SE

V

SC

CAUSAL LINK ORDERING CONSTRAINT

FIGURE 23. Demotion. Demotion resolves a threat by forcing the threat to execute before the causal link.

59

V

Demotion( T = S T ⊗ ( S E →S C ) , P = , Flaws ) If ( S T < S E is possible){ O′ := O + ( S T < S E ) return Complete-Plan(, Prune(Flaws) ) else return ∅ .

As in add-support, any time that ordering constraints are added to the plan, additional threat flaws may disappear.

3.3.5

Resolving Threats: Persist-Support

Persist-Support is similar in intent to the confrontation operation used in Buridan and UCPOP [Penberthy+Weld, 92]. Persist-support resolves the threat by making it impossible for the threat to change the state of the protected variable. Persist-support does this by redrawing the plan’s causal link structure so that the effect protected by the link passes through the passive conditional outcomes of the threatening step.12 Causal support can only be persisted by a threat if the threat has passive conditional outcomes that can serve as a “tunnel” to carry the desired effect of the establisher of the threatened link to the consumer via the threat. Since these passive conditional outcomes are now pertinent, UDTPOP will take step to increase the probability of these conditional outcomes, effectively “widening the tunnel” (Figures 24 and 25).

12.In this respect, UDTPOP’s persist-support operation is more similar to the confrontation operation of UCPOP. UCPOP adds ordering constraints that force the threat to occur between the establisher and consumer of the threatened link. Buridan, on the other hand, does not. Confrontation does not constrain the order of the establisher with respect to the threat.

60

V

V ST

V SE

SC PRECONDITION OR EFFECT THREATENED LINK

FIGURE 24. Before Persist-Support. The threatening step S T must have passive conditional outcomes for V .

NEW OPEN CONDITION(S) ON “TUNNEL”

ST

PASSIVE CONDITIONAL EFFECT “TUNNEL”

V V SE

SC NEW CAUSAL LINK THREATENED LINK (REMOVED)

FIGURE 25. After Persist-Support. The threatened causal link is ‘threaded’ through the ‘tunnel’ formed by the passive conditional outcomes of S T . Constraints ensure that the tunnel can possibly persist the state of V for pertinent values of V . Optional constrains ensure that S T has no effective conditional outcome that can influence a pertinent value of V for the new causal link V

S T →S C .

61

V

Persist-Support( T = S T ⊗ ( S E →S C ) , P = , Flaws ) if (there exists a conditional outcome ce = in S T such that ∃j such that Ej = V, ce is a passive conditional outcome for V, c is possible, ej is a pertinent value for V.) V V V then Let L′ := L – ( S E →S C ) + ( S E →S T ) + ( S T →S C ) O' := O + ( S E < S T ) + ( S T < S C ) V K new := persist(S T →S C) 13 P' = Flaws′ = Prune(Flaws) ∪ NewOCs(S T, V, P′) return Complete-Plan( P′ )

(

) else return ∅

It is possible for persist-support to redraw the structure of the plan so that two causal links are providing support to the same precondition variable of S T . This is a temporary state: if the causal links supporting S T originate in two different steps,14 each step will threaten the other’s support of S T . Resolving these threats resolves the dual support problem: only one causal link will support S T .

3.4

Example

In this section, we will illustrate some of the principles of UDTPOP using a simple example drawn from Stress World. Harvey’s objective is to write a short paper (1 to 2 pages). 1

13.The systematicity of UDTPOP can be improved by adding a constraint that S T not provide effective support for S C .Currently, UDTPOP’s constraint engine cannot enforce the negative constraint V

¬effective(S T →S C) (See Section 3.5).

14.If they originate in the same step, both causal links are actually the same link.

62

written page is worth $10, 2 pages are worth $20 and 0 pages are worth $0. In the initial state, Alert = conscious and Pages = 0 . The initial plan consists of an initial conditions step with two conditional effects capturing the initial conditions, Alert = conscious and Pages = 0 , and a goal step capturing the reward utility function.

PAGES = 0 PAGES = 1 OR 2

SIC

SGOAL

ALERT = CONSCIOUS

FIGURE 26. Initial Plan: The initial plan for the example has only two steps, S Goal and S IC .

Pages

Step 1: Use Add-Step to repair open condition ( →

SG ) Pages

The only flaw to repair in the initial partial plan is the open condition ( →

S G ) . In

order to achieve one of the desirable effects of S Goal with some probability, we will need to support Pages using a new step (the effect of S IC would result in a payoff of $0). Add-Step inspects the steps available to the planner and discovers that write has at least one conditional outcome that can change the state of the world to a state that is rewarded by the reward function. Add-Step adds •a new write step, S W1 , to the plan; Pages

•a causal link, S W1 →

S Goal ;

•a constraint that guarantees that write provides effective support for the

goal; and •two new open conditions and .

63

NEW CAUSAL LINK

PAGES

PAGES

PAGES = 0

SGOAL

SW1

SIC ALERT

ALERT = CONSCIOUS

ALERT

FIGURE 27. The example after adding SW1. Pages

Step 2: Use Add-Step to repair open condition ( →

S W1 )

We will choose to use add-step to add another write step to establish support for Pages

(→

S W1 )

(we could have used add-link to establish support from the initial con-

ditions). In order for S W1 to be effective, the new write step must be able to establish that Pages = 0 or Pages = 1 . There are effective conditional outcomes in write that can accomplish at least one of these goals, so write is a pertinent step Pages

for ( →

S W1 ) .

PAGES

SW2

PAGES ALERT

SIC ALERT

PAGES

PAGES ALERT

SW1

SGOAL

ALERT

FIGURE 28. The example after adding SW2.

64

Pages

Step 3: Use Add-Link to repair ( →

S W2 ) Pages

The next open condition that we will attack is ( →

S W2 ) . If we link S IC

to S W2 , then it is

still possible for • Pages to be 1 or 2 in the final reward function, and • Pages to be 0 or 1 for the precondition of S W1 . Pages

We can, therefore, add the causal link S IC →

SW2

PAGES ALERT

SIC

S W2 .

PAGES

PAGES ALERT

SGOAL

SW1 ALERT

ALERT

FIGURE 29. The example after adding a link from SIC to SW2.

Alert

Step 5: Use Add-Link to repair ( → S W1 ) Alert

In a similar fashion, we resolve the open condition ( → S W1 ) by adding the causal link Alert

S IC → S W1 .

SW2

PAGES

SIC

ALERT

PAGES

PAGES ALERT

SW1

SGOAL

ALERT

FIGURE 30. The example after adding a link from SIC to SW1.

65

Alert

Step 6: Resolve the threat S W2 ⊗ ( S IC → S W1 ) using Persist-Support S W2

Alert

threatens S IC → S W1 . This threat cannot be resolved by promotion or demotion

because S W2 is forced to occur between S IC and S W1 . The only threat resolution operation is persist-support. We can use persist-support if it is possible for S W2 to execute without changing the value of Alert . One of the conditional outcomes of write is <( Alert = conscious ) ( Alert = conscious ) >

so this is possible. Persist-support removes the

Alert

threat to S IC → S W1 by splicing S W2 into the middle of this causal link, resulting in two Alert

Alert

Alert

new causal links: S IC → S W2 and S W2 → S W1 . The old causal link, S IC → S W1 (with its effectiveness constraint), is removed from the plan. Notice that persist-support also repairs the final open condition.

SW2

PAGES

SIC

SGOAL

SW1

ALERT

ALERT

PAGES

PAGES

ALERT

FIGURE 31. The example after persist-support.

At this point, there are no further flaws in the plan–every pertinent precondition is supported by a causal link and there are no further threats. The plan is complete.

SW2

PAGES

SIC

ALERT

PAGES

PAGES ALERT

SW1

SGOAL

FIGURE 32. The final complete plan.

66

Note that we should be able to derive a better plan if we are allowed to use contingent steps–steps whose execution is a function of the number of pages written thus far and Harvey’s level of alertness. Contingent steps and observations will be discussed in Chapter 5.0.

3.5

Approximating Effectiveness

Recall (Section 3.2.3) that UDTPOP uses effectiveness constraints to improve the efficiency of planning when plan steps have conditional effect distributions that are deterministic or are near-deterministic. The definition of effectiveness relies on two concepts: possibility and pertinence. Possibility and pertinence are defined using probability queries on the model for a complete plan. In this section, I present a tractable technique for computing pertinence and possibility in partial plans that preserves completeness. This definition of pertinence and possibility is used by the constraint engine in both UDTPOP and DTPOP as well as in the completeness proofs for both planners. The approximation for pertinence and effectiveness does not require the construction and evaluation of a belief network. Instead, pertinence and possibility is estimated by tracing through the nonzero probabilities in the conditional effect distributions.

3.5.1

Possibility

Every value for a precondition variable is assumed to be possible if that precondition is open. Otherwise, a precondition is possible if any of its establishing causal links make it possible. A conditional outcome of a step is possible if all of the precondition values in its trigger are possible.

Definition 11 (PossibleP Precondition): An precondition value V = v in step S is

possibleP if: • V is open, or

67

V

• there is a causal link S E →S and V = v is a possible effect of S E 15.

Definition 12 (PossibleP Effect): An outcome O = o of step S is possibleP if there

exists a conditional outcome of S such that ∀c j ∈ c, possible P(c i) . 3.5.2

Pertinence

Definition 13 (PertinentP Effect): An outcome V = v of step S is pertinentP when V

∀( S →S C )

, V = v is a pertinentP precondition value of S C .

Definition 14 (PertinentP Precondition): A precondition value, V = v , is perti-

nentP if either: • V = v is a precondition of a cost function, • V = v is a precondition of the reward function and there exists a utility outcome

such that R > R min , or

• V = v is a precondition of conditional effect and e is pertinentP.

We prove in the Appendix (Section 1.2) that if an effect is possible , then it is certainly possible P .

Likewise if a step is pertinent , it is also pertinent P .

Theorem 6 (PertinentP and Pertinence): In pertinent(e) ⇒ pertinent P(e)

a

complete

plan,

and possible(e) ⇒ possible P(e) .

15. V can be established by more than one link before threat resolution.

68

3.5.3

Effectiveness

Definition 15 (EffectiveP ): In order for a step S E to provide effectiveP support to

another step S C , there must exist some effective conditional outcome of S E that is possibleP and supports a pertinentP precondition of S C . The completeness proof (Sections 3.7.2 and A.4.3) demonstrates that pertinent p and possible P

are sound for determining step effectiveness even when there are threats in the

plan. This proof relies on a trick that will be used throughout this chapter. We can place an extra persistence constraints on the persist-support step without losing completeness. The constraint that we will place on threat resolution is the following: Corollary 6 (Persist-Support Constraint): Say that we wish to resolve the threat V

S T ⊗ ( S E →S C )

. We can add the following constraints after persist-support without

compromising completeness: if is an effective conditional outcome of S T , and c is possible , then V = v cannot be a pertinent p precondition of S C . This is a restatement of the clairvoyant decision policy lemma (Lemma 3) used by the completeness proof. The basic idea behind the proof (the proof is actually quite a bit more complicated than this): V

Suppose that S T threatens S E →S C . If S T had an effective conditional outcome that was relevant to precondition variable V of S C in some completion of the plan, then it would have been possible to use either add-link or add-step to establish the causal link between ST

and S C . If we had drawn the links in this order when constructing the plan, there

would be no threat to resolve. Without losing completeness, we can restrict persist-support so that the threatening step S T will never contribute active support to S C .

69

If this persist-support constraint is used, we can also guarantee that threat resolution will never increase the utility of a partial plan. This fact will allow us to derive an evaluation function (see the next section) for A* search that guarantees that we can always find the best plan in the search space.

3.6

Evaluating Plans

This section discusses three topics: 1) evaluation and model construction of complete plans, 2) evaluation and model construction of complete plans, and 3) model construction and evaluation of partial plans. • Section 3.6.1 describes the basic UDTPOP search algorithm. • Section 3.6.2 presents an algorithm for constructing a belief network for a com-

plete plan and demonstrates that the expected value computed from this belief network is exactly the same as the expected value of the plan. • Section 3.6.3 presents a model construction algorithm for partial plans that uses

interval probabilities to represent open conditions. The upper bound on expected value calculated from this model is an upper bound on expected value for any essential completion of a partial plan.

3.6.1

Search

UDTPOP uses best first search (with pruning16) over the space of partial plans in order to identify the plan of highest possible utility (Fig. 33). This routine selects and prunes partial plans according to an upper ( UB ) and lower bound ( LB 17) on the expectation of the best completion of each partial plan. The upper bound is used to select plans to expand – if the upper bound is guaranteed to be at least as large as the exepected value of the best

16.This is the search strategy proposed for Pyrrhus [Williamson, 96]. 17.This is not the expected value of the worst plan. LB is the best expected value that can be guaranteed for some completion of the partial plan.

70

completion of that partial plan, then the search algorithm is guaranteed to identify the plan of maximum expected value. One of the central objectives of this section is to identify an evaluation function for partial plans that is a tight upper bound on the utility of the best completion of the partial plan. The optional lower bound is used to prune plans from the best first search queue. If we can prove that there is at least one plan with an expected value of LB, we can eliminate all partial plans from consideration that have an upper bound on value, UB, that is less than LB.

A lower bound can be computed by finding any completion for any partial plan. The upper bound is trickier and will be discussed in 3.6.3. Let GLB := V ( P ∅ )

(Optional)

Let Open = { P Initial } Loop { If Open = ∅ then return ∅ Remove P best from Open , s.t. P best has the largest UB . If P best is complete, stop and return P best . If LB(P best) > GLB then

GLB := LB(P best)

(Optional)

Let P be the set of plans that result from fixing one flaw in P . Let Open := Open ∪ P . Prune the plans in Open that have an upper bound that is less than or equal to GLB . (Optional) }

FIGURE 33. Best First Search. P ∅ is the empty plan consisting of SIC, SGoal, and causal links for all of the preconditions of SGoal. GLB is the largest lower bound on utility.

71

If pruning is critical (because of memory limitations, for example), it may make sense to set GLB to a higher value to force more pruning (this sacrifices completeness).

3.6.2

Model Construction in Complete Plans

Belief networks are used for plan evaluation [Pearl, 87]. The causal links, cost functions and conditional effect distributions in the plan imply a belief network that models all of the essential distributions in the plan. This belief network is comprised of probability nodes representing conditional effect distributions, deterministic nodes18 representing distributions over individual effect variables, and additive subvalue nodes19 [Tatman+Shachter, 90] representing the cost and reward functions. This network is constructed in a fairly straightforward fashion by tracing causal links back from the final reward function and step cost functions in the plan. If there is a causal link protecting Xi between two steps S Establisher and S Consumer , then there are one or more arcs from variable X i(S Establisher+)

to conditional effect distributions or cost functions in S Consumer .

An algorithm that constructs a belief network from a complete plan is illustrated in Figure 34. There are two global variables: the plan, P = and a belief network M =

consisting of distributions N and arcs A . The model construction algorithm is

started by calling Model_CE on the plan’s goal step and reward function (Model_CE( R , S Goal )). The algorithm includes a conditional effect distribution only if it is a requisite element for calculating the expectation of any of the value functions in the plan [Shachter, 88; 98] (see also Section 4.2).

18.A deterministic node is a probability node that has a probability distribution that consists entirely of 1’s and 0’s. A double oval (two concentric ovals) will be used to denote the deterministic node in a belief network. 19.Value nodes will be denoted by diamonds in belief networks.

72

Model_CE( CE , Step ) { if CE ∉ N then ( N := N + CE

for all V ∈ PreVars(CE) { V let S E →Step ∈ L be the causal link protecting V for all Cost E ∈ Cost(S E) , Model_CE( Cost E , S E ) find CE E ∈ Eff(S E) such that V ∈ OutVars(CE E) Model_CE( CE E , S E ) if V(S E+) ∉ N then {

* * * * *

E := EffVars(CE E) V(S E+) := the distribution P { V ( S E+ ) = v V = v, E\V } = 1.0 N := N + V(S E+) A := A + ( CE E, V(S E+) )

} A := A + ( V(S E+), CE )

} } }

FIGURE 34. Model_CE. An algorithm for constructing a belief network from a complete UDTPOP plan.

Each conditional effect distribution of a step may contain more than one effect variable. If a conditional effect distribution contains more than one variable, then the starred lines of Figure 34 establish intermediate deterministic nodes that each represent a single effect variable of the conditional effect distribution. For example, if a distribution over variable V

is needed and the belief network contains only a conditional effect distribution over U

and V , then the starred lines of Model_CE will add the following belief network fragment to extract variable V from UV : UV

V

Figure 35 illustrates the belief network that Model_CE constructs for the final plan in the example of Section 3.4. Since every conditional effect distribution in StressWorld has a single effect variable–Model_CE does not need to add any deterministic variables to the plan model.

73

PAGES(SIC+) PAGES SIC

SW2 PAGES ALERT SW1 ALERT

PAGES SGOAL

PAGES(SW2+)

ALERT(SIC+)

PAGES(SW1+)

RGOAL

ALERT(SW2+)

FIGURE 35. A Model for a Complete Plan.

Let’s step over Model_CE to see how it constructs this plan model.

PAGES SIC

SW2 PAGES ALERT SW1 ALERT

PAGES SGOAL

RGOAL

FIGURE 36. Model_CE(RGoal, SGoal).

Initial Call: Model_CE(RGoal, SGoal), B = <∅, ∅ > . B is empty so Model_CE adds the distribution RGoal to N. Model_CE loops over all of precondition variables in RGoal looking for causal links. There is only one such variable (Pages) from step SW1 that is established by a distribution over Pages(SW1+) in SW1. Model_CE recurs calling Model_CE(Pages(SW1+), SW1)

in order to construct a node for Pages(SW1+) so that Model_CE can add an arc

between Pages(SW1+) and RGoal. Model_CE(Pages(SW1+), SW1): Pages(SW1+) is not in the network, so Model_CE adds it. Model_CE then constructs a model for the first precondition of SW1, Pages.

74

PAGES SIC

FIGURE 37.

SW2 PAGES ALERT SW1 ALERT

PAGES SGOAL

PAGES(SW1+)

RGOAL

Model_CE(Pages(SW1+), SW1).

Model construction continues in a similar fashion in depth first fashion until Model_CE encounters a node with no preconditions. At this point, the stack begins to unwind–as each call to Model_CE exits, Model_CE adds an arc between the newly constructed node and the node in the argument to Model_CE. If precondition Pages is always before Alert in the preconditions list, the sequence of model construction actions will evolve as shown in Figures 37 and 38.

75

Model_CE(Pages(SW2+), SW2) PAGES SIC

SW2 PAGES ALERT SW1 ALERT

PAGES SGOAL

SIC

SW2 PAGES ALERT SW1 ALERT

PAGES(SW1+)

RGOAL

PAGES(SIC+)

Model_CE(Pages(SIC+), SIC) PAGES

PAGES(SW2+)

PAGES SGOAL

PAGES(SW2+)

PAGES(SW1+)

RGOAL

PAGES(SW1+)

RGOAL

Model_CE(Alert(SIC+), SIC) PAGES(SIC+) PAGES SIC

SW2 PAGES ALERT SW1 ALERT

PAGES SGOAL

PAGES(SW2+)

ALERT(SIC+)

Return from Model_CE(Pages(SW2+), SW2) PAGES(SIC+) PAGES SIC

SW2 PAGES ALERT SW1 ALERT

PAGES SGOAL

PAGES(SW2+)

PAGES(SW1+)

RGOAL

ALERT(SIC+)

FIGURE 38. Trace of Model_CE.

76

Model_CE(Alert(SW2+), SW2) PAGES SIC

SW2 PAGES ALERT SW1 ALERT

PAGES SGOAL

PAGES(SIC+)

PAGES(SW2+)

ALERT(SIC+)

Model_CE(Alert(SIC+), SIC) PAGES SIC

SW2 PAGES ALERT SW1 ALERT

PAGES SGOAL

Return from Model_CE(Alert(SW2+), SW2)

SIC

SW2 PAGES ALERT SW1 ALERT

PAGES(SW2+)

PAGES SGOAL

Return from Model_CE(Pages(SW1+), SW1)

SIC

SW2

ALERT

PAGES ALERT SW1

ALERT(SW2+)

PAGES SGOAL

PAGES(SW1+)

RGOAL

ALERT(SW2+)

PAGES(SIC+)

PAGES(SW2+)

ALERT(SIC+)

PAGES

RGOAL

PAGES(SIC+)

ALERT(SIC+)

PAGES

PAGES(SW1+)

PAGES(SW1+)

RGOAL

ALERT(SW2+)

PAGES(SIC+)

PAGES(SW2+)

ALERT(SIC+)

PAGES(SW1+)

RGOAL

ALERT(SW2+)

FIGURE 39. Trace of Model_CE continued.

77

Return from Model_CE(RGoal, SGoal) PAGES SIC

SW2 PAGES ALERT SW1 ALERT

PAGES SGOAL

PAGES(SIC+)

PAGES(SW2+)

ALERT(SIC+)

PAGES(SW1+)

RGOAL

ALERT(SW2+)

FIGURE 40. Trace for Model_CE completed.

3.6.3

Model Construction in Partial Plans

The goal of model construction in partial plans is to derive an upper bound on the utility of the partial plan so that any completion of the partial plan will have utility that is less than or equal to this bound. When the upper bound has this property, we can guarantee that the best possible plan can be found by using this upper bound to guide best first search. In Section 3.5, we described a technique for discarding sections of the search space without compromising completeness. If we can use this persist-support constraint to discard a partial plan, we will say that the completions of that partial plan are not essential. An upper bound is admissible if the upper bound on the utility of each partial plan is at least as large as the expected value of any of the essential completions of that partial plan. In order to derive this admissible upper bound on expected value, we will need to carefully study the effect of threats and open conditions on an evolving model of the plan. The next three sections show how UDTPOP estimates the upper bound on expected value for a partial plan. In order to do this, we will need to be able to model the effect of threats and open conditions on the expected utility of the partial plan. Modeling open conditions appears to be easy: we can replace all of the open conditions with decisions and identify the best possible combination of decision values to maximize the expected value of the completed portions of the plan. Section 3.6.3.1 demonstrates that

78

this intuitive approach is wrong! Section 3.6.3.2 reviews a probability interval calculus developed by Draper [96] and demonstrates that probability intervals are, in fact, the correct method to use for modeling open conditions. Section 3.6.3.3 describes how to properly model and evaluate partial plans with threats.

3.6.3.1 Modeling Open Conditions

One plausible way to model open conditions is to model them as decisions. This seems reasonable because, after all, if we could choose the optimal value for each of these open condition decisions, then we should be able to realize an expected value that is higher than if these decision variables were replaced by any other joint probability distribution. This (incorrect) argument is illustrated in Figure 41. Say that the set of open conditions is O and that the expected utility of the plan given any value for O is V { O } . It is always the case that max V { O } ≥ ∑ P { O }V { O } .20 FurtherO

O

more, the contribution due to the cost functions of each new step can only reduce the expected utility–only the UDTPOP reward function can have a utility that is greater than zero. It seems plausible, then, that the expected value of every completion of the partial plan must be less than or equal to max V { O } . This argument, however, is incorrect. What is O

the flaw with this argument?

20.The second equation is the convex combination of several conditional utilities, the largest of which is equal to max U { O } , therefore the sum cannot be larger. O

79

G

O1

O2

G

O3

O4

P{O1,O2,O3,O4}

FIGURE 41. A fallacious argument for using decisions for representing open conditions. The figure on the left represents the model for a partial plan. Open conditions are all replaced by decisions. By picking appropriate values for these decisions, we can always achieve a utility that is at least that of replacing the decisions by any joint distribution (including that represented by a plan) over the open conditions.

The problem with this argument is this: the add-link operation can be used to link the preconditions of steps to other steps that already exist in the plan. This, in effect, can allow the value for the precondition decision, O i , to be conditioned on the outcomes of existing steps in the plan. A simple example will illustrate why the naive decision approach fails. Suppose that our goal is to find a plan that sets binary variable A equal to the value of binary variable B . The reward function 1 if binary variable A is equal to binary variable B and is zero otherwise. There is one action schema in this problem domain, S A . S A has no precondition and has

two

conditional

= 0.5 .

outcomes:

= 0.5

and

The cost for S A is 0 .

80

R

Goal B

A

SA

B?

P{A} P{AB}

FIGURE 42. A simple partial plan that breaks the ‘straw’ model construction algorithm.

The partial plan in Figure 42 illustrates why the decision approach will not work. If we substitute a decision D for the open condition B on SGoal, then we compute an expected value of 0.5 regardless of the value for D (because of the uncertainty in the outcome of S A ).

But note that add-link can link precondition variable B of SGoal to the B outcome

variable of SA. This completion of the plan has an expected value of 1.0 , double that computed by replacing B with a decision (Figure 43).

R

Goal A

P{A} B

SA

P{B} P{AB}

FIGURE 43. Using SA to support both preconditions maximizes utility. Since A is always equal to B in SA, using SA to support both preconditions of SGoal results in a plan that has an expected utility of 1.0.

3.6.3.2 Modeling Open Conditions Using LPE

The right way to model open conditions21 is to use interval probabilities to model the plan. Localized partial evaluation (or LPE) [Draper, 96] is a family of techniques for approximating multiply-connected belief networks. Using interval arithmetic, LPE computes bounds on the joint and conditional probability distributions for subgraphs (called active 81

sets) of belief networks. The bounds achieved using only the active sets are guaranteed to contain the joint or conditional distribution that would result from the evaluation of the complete belief network. Figure 44 and Figure 45 illustrate the application of LPE to a simple inference problem. The belief network on the left in each figure represents the original multi-connected belief network. The middle figure highlights the active set that will be used to approximate the belief network. The right figure illustrates the interval belief network that the LPE algorithm constructs to bound the probability distributions. The contribution that the “nonactive” set makes to the active set is summarized using a set of vacuous probability distributions (distributions that contain probabilities that are bounded only to be anywhere between zero and one). For every arc that leaves the active set, LPE adds a vacuous likelihood distribution that is conditioned on the node at the tail of the deleted arc. For every arc that enters the active set, LPE adds a vacuous prior distribution with the same number of states as the deleted node. Given any belief network and any active set, the LPE algorithm computes a set of bounds on any probability query that is guaranteed to contain the value of the query that would be computed using the full belief network. In Figure 44, we illustrate the effect of approximating the belief network on the left with the active set in the middle. This active set consists only of the nodes R and B. The space of all possible contributions from the belief network outside the active set is summarized by two vacuous prior distributions representing the missing predecessor nodes, A and C. Say that we want to find a bound on P left { R = r } .22 The probability of this query is bounded above and below by the LPE bound computed from the interval belief network at the right of this diagram, that is, P left { R = r } ∈ P right { R = r } . 21.Actually, given a topological sort of the plan, it is possible to draw informational arcs from each of the effects in the partial plan to each of the open condition decisions that can possibly occur after that effect. In order to compute an upper bound, we would need to evaluate one of these influence diagrams for every topological sort of the steps that either contain an open condition or are possibly before an step containing an open condition.

82

R

R

R

B A

B

B

A C

C Active Set Vacuous Distribution

FIGURE 44. LPE Example I.

Figure 45 is an example using a different active set. This active set is missing only one arc (the arc between A and C). The contribution from A to C and from C to A are summarized using a vacuous prior and a vacuous likelihood distribution, respectively.

22.In this section, we will use a subscript on P to denote the belief network (not the active set) that is used to compute the query, for example, we might write P A { X Y } to denote the conditional probability distribution over X given Y using belief network A . When the active set is a subset of the full belief network, P A { X Y } is a probability interval. We will use subset notation to compare probability intervals of inter-

val or exact belief networks. P 1 ∈ P 2 means that exact distribution P 1 is contained inside the interval distribution P 2 . P 1 ⊆ P 2 means that the interval distribution for P 1 is contained inside P 2 . When we need to refer to the upper bound of a probability bound, we will use the notation P A { x } . The lower bound for the same query is P A { x } .

83

R

R

R

B A

B

A

B

A C

C

C

FIGURE 45. LPE Example II.

It is not necessary to know the details of LPE in order to understand the remainder of the chapter. The following points, though, are important: • If an arc leaves the active set in the LPE approximation, we replace that arc with

an arc to a vacuous likelihood distribution. • If an arc enters the active set, we replace the arc with an arc from a vacuous prior

distribution. • Adding a vacuous likelihood λ′ ( . X ) = [ 0, 1 ] to node X in interval belief net-

work α results in a set of bounds over probabilities that is at least as large as the bounds computed from the original belief network: P α { X Y } ⊆ P [ α + λ′ + ( X → λ′ ) ] { X Y }

.

• If there is a likelihood λ′ ( . X ) = [ 0, 1 ] in belief network α , then P { X } = [ 0, 1 ]

(a vacuous distribution). [Draper, 96]

Now, let’s get down to the business of using this machinery to model open conditions. Our strategy will be to assume that if a precondition is open, then any step that can possibly precede the precondition might conceivably form the root for a chain of steps that eventually will support the open precondition. The model for the partial plan can be modeled as an active set of the complete plan. Since there are possibly several (unknown) completions

84

for each partial plan, the model for the partial plan needs to be the union of the active sets for all possible completions. We are, in essence, pretending that the partial plan model is an active set of a larger belief network that contains arcs and distributions that correspond to plan construction operations that occur deeper in the planner’s search space.23 In figures 46 through 48, we illustrate that the model for a partial plan is an active set of one of the completions of that plan (neglecting threats). An interval belief network can be constructed for each of these active sets that consists of • probability nodes for each of the conditional effect distributions in the partial

plan that is connected to a cost or reward function via a set of causal links, • deterministic nodes for each of the utility and cost functions in the partial plan, • vacuous probability distributions for each of the open conditions, and • vacuous likelihood functions for each probability distribution that contributes an

causal link in the complete plan but not in the partial plan. In the following figures, we will simplify the plan model so that the relationship between the plan model (a belief network) and the plan itself (a partial-order graph of steps) is more obvious. We will assume, for example, that each step only has one conditional effect distribution with one outcome and that there is only one utility subvalue function (the reward) in each plan. The names used for each conditional effect distribution will be the same as the step that defines it.

23.Actually, this is not quite accurate – persist-support can delete arcs from the belief network used for the partial plan. The admissibility proof (Section 1.5) demonstrates that we can ignore threats when we are computing an upper bound on the plan’s utility.

85

SIC

SIC

SIC SA

SIC SA

SA

SA

Open Condition

SB

SB

SB SC

SB SC

SC

SC

SGoal

SGoal

SGoal

SGoal

A.

B.

C.

D.

FIGURE 46.

Active Set 1. Diagram A illustrates the original partial plan. This partial plan has a single flaw: an open condition on S B . A single arc from S IC to S B is used to complete the plan (B). The full model for B is shown in Figure C. The dotted region denotes the active set of the complete model that corresponds to the partial plan shown in Figure A. An interval belief network (D) can be constructed from the active set in C. Construct this belief network by replacing every arc that crosses the boundary of the active set with either: •A vacuous likelihood distribution P { . X } = [ 0, 1 ] , if the arc leaves the active set; or •A vacuous prior probability distribution P { X } = [ 0, 1 ] (if the arc enters the active set). The “plaid” nodes in this diagram denote vacuous distributions. The bounds on the expected utility of the reward node computed using network D contains the expected utility calculated from the full belief network (C).

Figures 46-48, each contain a partial plan, a completion of that plan, the model for the completion, and the LPE approximation of the model for the completion using only the nodes in the model for the partial plan. Figures 46.a-48.a all illustrate the same partial plan. Figures 46.b-48.b each illustrate one possible completion of this partial plan. A simplified version of the plan model computed by Model_CE is shown in each of Figures 46.c-48.c. The lightly shaded region illustrates the active set that corresponds to the distri-

86

butions and arcs that are contained in the original partial plan. Figures 46.d-48.d depict the interval belief networks that represents the active sets shown in 46.c-48.c.

SIC

SIC

SIC SA

SA

SA

SB

SA

SD

SD

Open Condition

SIC

SB

SB SC

SB SC

SC

SC

SGoal

SGoal

SGoal

SGoal

A.

B.

C.

D.

FIGURE 47.

Active Set 2. In this figure, the open condition in A is repaired by inserting an step S D that, in turn, links to S A .

87

SIC

SIC

SIC SA

Open Condition

SE

SA

SA

SA

SF

SF

SB

SE

SIC

SB

SB SC

SB SC

SC

SC

SGoal

SGoal

SGoal

SGoal

A.

B.

C.

D.

FIGURE 48.

Active Set 3. Any number of steps might be added to complete the partial plan. These steps can link to any number of conditional effect distributions scattered through out the steps of the plan that can possibly precede the steps with the open preconditions. In this example, a 2 step subplan is used to satisfy the open condition. This subplan is conditioned on the outcome variable of both S A and S IC .

In each case, note that a vacuous prior distribution replaces the open condition. In each case, one or more of the nodes that can possibly precede the open condition have vacuous likelihoods. This observation leads to our strategy: 1. Replace every open condition with a vacuous prior probability distribution. 2. If an step SA possibly occurs before any step containing an open condition, attach a vacuous likelihood distribution to each of the relevant conditional effect distributions in SA . This strategy is illustrated in Figure 49.

88

SIC

SIC

SA

SA Open Condition

SB

SB

SC

SC SGoal

SGoal

A.

B.

FIGURE 49. Modeling every completion of a partial plan.

3.6.3.3 Modeling Threats

In order to properly model unresolved threats, we will take advantage of the “clairvoyant persist-support constraint”. We will use this constraint to argue that threat resolution steps can only reduce the upper bound on the essential completions of the partial plan Imagine that a plan is complete except for some number of unresolved threats. If these threats are resolved via promotion or demotion, then there will be no impact on expected utility. The only way to change the expected utility of the plan is to change the causal link structure. The only threat resolution operation that can change the causal structure of the plan is persist-support. The clairvoyant persist-support constraint says that persist-support will never cause a threatening step S T to provide effective support to the condition protected by the threatV

ened link S E →S C . This means that if S T contains an effective conditional outcome

89



that is possible , then V = v cannot be a relevant precondition value for S C . This

has two implications: • First of all, passive conditional outcomes have no effect on the variable V and,

therefore, have no direct effect on the utility or cost functions in the plan. • Say that S T has effective conditional outcomes over variable V that are possible.

It cannot be the case that all of the values for variable V are relevant, since otherwise S T would be providing effective support to V . Since all of the variable values for cost functions are relevant, this implies that the causal link only impacts the reward function. All of the possible effective conditional outcomes of S T support precondition values v that are not pertinent p and, therefore, the outcomes are not pertinent . This, in turn, implies that these outcomes can only “transfer” probability

mass from the superior alternatives of the reward function to the worst possible outcome. Thus, if we assume the persist-support constraint of Section 3.5.3, threat resolution can have no effect on the expected values of the cost functions that are present in the plan prior to threat resolution. Furthermore, threat resolution cannot increase the expected value of the reward function. Persist-support may force the planner to search for support for the preconditions that become relevant because they condition the passive conditional outcomes used by persistsupport. The planner finds support for these open conditions either by using steps whose only purpose is to support these “secondary preconditions” or by using steps that have other purposes in the plan. In the first case, the utility of the plan can only decrease because every cost function is strictly positive. In the second case, using the outcomes of steps that are already in the plan cannot decrease the cost of the plan. Thus threat resolution either has no impact on the utility of the plan or reduces the utility of the plan (A detailed proof is provided in Appendix A.).

90

3.6.3.4 Model Construction and Evaluation for Partial Plans

The discussion in the last two sections suggests a rather surprising model construction and evaluation algorithm for providing an upper bound on the utility of the essential completions of any partial plan. This algorithm, EvaluateUB, consists of two phases. In the first phase, Model_CE2 constructs a single model for the entire plan. In the second phase, UB crafts a subgraph of this model in order to evaluate an upper bound for the expectation of the reward function and the expectation of each of the cost functions in the plan. The LPEbased model construction algorithm is illustrated in Figures 50 through 52.24 Model_CE2 (Fig. 51) is identical to Model_CE (Fig. 34) except for the starred lines. The starred lines do the following: • If a precondition is open, the distribution over the precondition variable is mod-

eled using a vacuous probability interval. This vacuous probability interval indicates that the precondition can have any prior probability in the interval [0, 1]. EvaluateUB calls UB to individually evaluate each utility function in the plan. UB adds vacuous likelihood nodes to the network to model possible sources for causal links to open conditions in the plan. UB prunes the network constructed by Model_CE2 to the smallest network that can be used to evaluate a specific reward or cost function. Every vacuous distribution in an interval belief network tends to increase the width of all of the interval probability queries that can be computed from the belief network. If we can prune extra vacuous distributions from the belief network, we can often find an interval around the expectation of the subvalue node or cost function that is significantly tighter than if we had not pruned the network.

24.UDTPOP never actually needs to construct the plan model. The model construction algorithm describes how the model is “hidden” inside of the plan–the plan evaluator can compute the utility of the plan directly from the structure of the plan itself rather than by constructing an independent belief network.

91

EvaluateUB( Plan = ) { Let M := <∅, ∅ > //the model (a global variable) Util = R ∪ ( – Cost(S i) )



Si ∈ S

Model_CE2( R , S Goal , Plan ) return }



//Modifies M

UB( U i , M )

U i ∈ Util

FIGURE 50. EvaluateUB.

92

Model_CE2( CE , Step , Plan = ) if CE ∉ N then { N := N + CE for all V ∈ PreVars(CE) V { if ∃( S E →Step ∈ L ) then { for all Cost E ∈ Cost(S E) , Model_CE( Cost E , S E ) find CE E ∈ Eff(S E) such that V ∈ OutVars(CE E) Model_CE( CE E , S E ) if V(S E+) ∉ N then { E := EffVars(CE E) V(S E+) := the distribution P { V ( S E+ ) = v V = v, E\V } = 1.0 N := N + V(S E+) A := A + ( CE E, V(S E+) ) } A := A + ( V(S E+), CE ) } else { // V is an open condition if ∃ a probability node for precondition var V of S then { let O V := be that node } else { let O V := a probability node w/states Domain(V) . //a vacuous distribution. P { O V } = [ 0, 1 ]

* * * * * * *

N := N + O V } A := A + ( O V, CE ) } }

}

FIGURE 51. Model_CE2. Model_CE2 is identical to the Model_CE algorithm except for the starred lines. These lines add a vacuous distribution for each open precondition.

93

UB( U i , M = ) { Let M′ = be the subgraph of M that is in the N p set of U i .[Shachter 98] For all S i ∈ S { if ( ∃ some S j such that S j is possibly after S i and S j has an open precondition) then { Let N S be the set of nodes N that correspond to conditional effect distributions of S i . For all N k ∈ N S { Let X be a vacuous likelihood function P { . N k } = [ 0, 1 ] N′ := N′ + X A = A + ( N k, X ) } } Use LPE to compute the posterior bounds P M′ { U i } . Let u Max := max u u ∈ Uj return max(u Max,



u ⋅ P M′ { U i = u })

u ∈ Domain(U i)

}

FIGURE 52. UB. UB first extracts the subset of the belief network that is relevant to U i (Remember that if U j corresponds to a step cost function, then U i = – C i ). After extracting this subset, UB adds a vacuous likelihood to each distribution that might possibly condition one of the open precondition variables, either directly or indirectly. LPE uses this interval belief network to compute an upper bound on the expectation for U i .

Theorem 7 (Upper Bound): If there is only one source of support for every precondition in plan P , EvaluateUB( P ) can compute an upper bound, UB(P) ,on expected utility and this bound is admissible. When the plan is complete, the expected value of the plan V(P) and the upper bound UB(P) are equal. Proof

See Appendix A.5.

94

3.6.3.5 Persist-Support

Appendix A proves that the evaluation algorithm illustrated in Figures 50-52 computes an admissible upper bound for any partial plan as long as it is the case that each open condition is established by at most one causal link. Unfortunately, persist-support can sometimes cause an open condition to have multiple sources of support (let’s call this problem, V

the dual support problem). Say that ST threatens S E1 →S C and there is already a causal V

link S E2 →S T (see Fig. 53). After persist-support resolves the threat, there will be TWO causal links supporting the same precondition.

SC ST

V

SC

V

Persist-Support

ST

V

V SE 2

SE 1

SE 2

V SE 1

FIGURE 53. Persist Support can cause dual support.

This is a temporary condition because the establishers for the dual links threaten each other’s links; future threat resolution activity will eventually resolve these flaws. Unfortunately, even temporary dual support introduces a difficult utility evaluation problem. Fortunately, it is possible to avoid dual support by selecting an appropriate sequence for resolving the threats. Theorem 8 proves that it is always possible to find a threat to resolve that will not introduce dual support. Theorem 8 (Guaranteeing Single Support): It is always possible to order threat

resolution operations so that no precondition will ever be established by more than one causal link.

95

V

We will show that there exists a threat S T ⊗ S E →S C such that either:

Proof

1. the V precondition of S T is open or V

2. there is a causal link S E →S T . In either case, resolving the threat does not introduce dual support. Find any topological sort of the steps in the plan. One or more of these steps will threaten a causal link. Pick the earliest of these threatening steps in the topological sort. This step V

may threaten one or more causal links. We will select T = S First ⊗ ( S E1 →S C ) such that the establisher S E1 is at least as early in the topological sort as the establishers of the other causal links threatened by S First . There are three ways to resolve this threat. Promote and demote do not affect the causal links in the plan, so they cannot introduce dual support. V

Persist-support can only cause dual support when there exists a causal link S E2 →S First and S E1 ≠ S E2 , so we only need to consider this case.25 Now we know that S E2 cannot V

threaten S E1 →S C because S E1 is before S First in any topological sort and S First is, by V

construction, the earliest threatening step. We know that S E1 must threaten S E2 →S First because V

1. S E1 > S E2 (otherwise S E2 ⊗ ( S E1 →S C ) , a contradiction) and V

2. it must be possible for S First > S E1 because S First ⊗ ( S E1 →S C ) . V

We can resolve S E1 ⊗ ( S E2 →S First ) by any means because there can be no causal link V

S →S E

1

. If this causal link did exist, then it would also be the case that S E1 < S First V

(because otherwise S First would threaten S →S E1 and S is before the earliest establisher,

25.UDTPOP treats two causal links as being instances of the same link if all of the particulars of the causal link (the establisher, consumer, and variable) are the same.

96

SE

1

). This, in turn, leads to a contradiction:

V

S E ⊗ ( S E →S First ) 1

2

and S E1 is before S First

in any topological sort of the steps. (This situation is illustrated in Figure 54).

SC SFirst

V SE 1

V

This link cannot exist.

SE 2

V S

FIGURE 54. The contradiction in the proof for Theorem 8. .

3.6.4

The Implementation of the Evaluator used in UDTPOP-B

Any belief network algorithm can be used to evaluate a complete plan model. The algorithm used by UDTPOP-B (the variant of UDTPOP used in Section 3.8) constructs a simple join tree [Jensen, et al; 90a, 90b] directly from a topological sort of the partial plan. Given a topological sort Q = { S i } ni = 0 of the plan, we can define a function α(k) that represents the set of variables that are protected by causal links that run from steps in k

{ Si }i = 0

to steps in { S i } ni = k + 1 . Let β(k) denote union of the cost functions for S k and

the conditional effects that S k uses to establish causal links in the plan. Let Pre(N) be a function that extracts all of the preconditions of all of the conditional effects in N . Likewise, Eff(N) returns all of the effect variables in N . The UDTPOP-B evaluator implicitly constructs a cluster tree that is much like the cluster tree illustrated in Figure 55, below.26

97

Out(β(0))

α(0)

Pre(β(1)) ∪ Out(β(1))

α(1) ∪ α(2)

Pre(β(k)) ∪ Out(β(k))

α(k + 1) ∪ α(k)

Pre(R)

FIGURE 55. The cluster tree constructed implicitly by the UDTPOP evaluator.

This cluster tree is far from optimal in terms of total cluster size, but UDTPOP can construct and evaluate this cluster tree in milliseconds for small plans. Compressing the zeros out of the potentials would also significantly improve performance [Jensen+Andersen, 90; Kushmerick, 95].

3.7

Formal Properties

Section 3.7 explores some of the formal properties of UDTPOP. Section 3.7.1 argues that the algorithm is sound by demonstrating that: 1. If a plan is complete, the plan model constructed by UDTPOP is identical to the Markov model implied by the individual step distributions. 2. UDTPOP only returns complete plans therefore UDTPOP is sound. Section 3.7.2 argues that the algorithm is complete by showing that 1. Optimal plans can’t include ineffective steps.

26.The details of the algorithm are similar to the bucket-elimination algorithm [Dechter, 96].

98

2. Every linear series of effective steps can be represented as a UDTPOP plan graph. 3. Every optimal plan is in the search space of UDTPOP, except for possibly the empty plan. 3.7.1

Soundness

We define UDTPOP to be sound if every plan that it returns is complete and the underlying plan model is correct: the model always computes the correct utility for the plan. In order to prove soundness, we will show that all of the “essential parts” of the Markov model representing a particular step sequence are captured in the plan model generated for the equivalent UDTPOP plan. The “essential parts” of the Markov model are the distributions required for evaluating the expectation of each of the additive subvalue nodes. Shachter [88, 90, 98] defines a function, N p ,27 that determines the set of distributions required for answering any conditional probability query on a belief network. We will show that the belief network constructed by UDTPOP is equivalent to the N p set of the Markov Model implied by the individual step distributions.

Definition 16 ( N p ): [Shachter, 88; 98] N p(J E, M) is the set of requisite probability

elements for J given E in belief network M . This set contains all of the conditional probability distributions that are required for computing a joint probability distribution for the variables in J given the evidence variables E . Theorem 9 ( N p(J ∅, M) ): [Shachter, 88; 98] When no observations have been

made( E is empty), the set of requisite probability elements required for computing the joint probability over J is the union of J and all of the ancestors of the nodes in J . We will abbreviate this function N p(J, M) .

27.The set of relevant probability elements N p for a query used to be called N π .

99

3.7.1.1 Markov Model Definition 17 (Action Model): The effect of each step is a conditional probability

distribution, P S { X(S+) X(S-) } =



P S, q { E S q(S+) C S q(S-) }



Δ { X k(S+), X k(S-) }

Xk ∈ X \ ES

P S, q ∈ CEs(S)

where P S, q { E Sq(S+) C Sq(S-) } are the conditional effect distributions of S . Δ(X, Y) , a delta function, is 1.0 when X = Y and 0 otherwise. The model for an individual step is shown below. The nodes labeled C S denote utility subvalue nodes.

PS,1

X1(S-) X2(S-)

X1(S+)

PS,q

X2(S+)

CS{B} Xn(S-)

+

Xn(S+)

Xa(S-)

Δ{Xa(S+),Xa(S-)}

Xa(S+)

Xb(S-)

Δ{Xb(S+),Xb(S-)}

Xb(S+)

Xz(S-)

Δ{Xz(S+),Xz(S-)}

Xz(S+)

FIGURE 56. Markov Model. The Markov model for the distribution over all the variables in X(S+) given X(S-) . The conditional effect distributions P S, q and cost function C S are depicted toward the

100

top of the diagram. The “persistence” distributions Δ ( X i(S+), X i(S-) ) are shown at the bottom. The persistence distributions ensure that X i(S+) = X i(S-) when X i is not an effect of the step. The layers of the full Markov model are separated by outcome variable distributions, X , representing all of the variables modeled in the domain.

Definition 18 (Markov Model): The distribution over effect variables that results n

from executing the sequence of steps

n { Si }i = 1

is

∏ PS { X(Si+) X(S( i – 1 )+) } i

where

i=1

X(S i-) = X(S ( i – 1 )+) , and P S { S 0+ } 0

is the distribution over the initial states. The plan

model is the concatenation of the belief network fragments shown in Figure 56 with an initial distribution P S0 { S 0+ } and a final reward distribution R(X n+) . Whenever there is an arc from X(S i-) to P Si, q ( I Si, j ) in the model for each individual step, we will replace this with an arc from X(S ( i – 1 )+) to P Si, q ( I Si, j ). Call this belief network M m = ,

where N m denotes the distributions in the Markov model and A m denotes the arcs between these distributions.

STEP 1

STEP 2

X

I

X

X

I

X

P2

X X

C2 P0

+

X

X

I

X

X

I

X

I

X

X

I

X

I

X

X

P1

X

I

X

X(S0+)

C1

X(S1+)

+

R

+

X(S2+)

FIGURE 57. The Markov model for a 2 step plan.

101

3.7.1.2 Soundness Lemma 1 (The Plan Model and Np): Say that one of the topological sorts of the steps n

in the complete UDTPOP plan P = is Q = { S i } i = 1 . Let M m be the ⎛



n







⎝i = 1





Markov model of Q . The plan model of P , M p , is equivalent to N p ⎜ R ∪ ⎜ ∪ C i⎟ , M m'⎟ , where M m' is a modification of the Markov model that has the same variables and same joint distribution as M m . Thus the utility calculated from M p is identical to the utility calculated from M m . Proof

The proof of this theorem appears in Appendix A.3.

Definition 19 (Correct Causal Structure): A UDTPOP plan is said to have correct

causal structure, if the utility of the plan model is identical to the utility derived from the Markov model of every topological sort of the original plan. Theorem 10 (Soundness): Every plan returned by UDTPOP is complete, effective, and has correct causal structure. Proof

There are only two ways to exit from UDTPOP. If UDTPOP either cannot com-

plete a plan step or discovers a constraint violation, then complete-plan returns ∅ , indicating that no plan exists in this branch of the search space. The only other point where UDTPOP can return is when the goal agenda is empty in complete-plan. Such a plan has no threats or open conditions and, therefore, must be complete. All complete plans have correct causal structure by the last Lemma, therefore UDTPOP is sound.

3.7.2

Completeness

We will prove that UDTPOP is complete by using a “clairvoyant proof”. A clairvoyant proof [McDermott, 91] uses a clairvoyantly-known exemplar plan to provide search-control. This allows the planner to make all of the ‘right’ choices when duplicating the exemplar.

Since the exemplar is any plan satisfying the UDTPOP optimality criterion,

102

UDTPOP is complete. Recall that a planning problem is a tuple, , where R is the reward function, A is the set of step schema that can be used to construct a plan and IC is the set of distributions over all of the variables used in the domain. Theorem 11 (Completeness): Let Q be a non-empty sequence of steps that is an

optimal solution to the planning problem D = . The cost functions for the possible steps in A are all greater than zero. UDTPOP with the appropriate search control strategy can identify a partial-order plan P' that has a topological sort that is identical to Q . Proof

See Appendix A.4.

3.8

Empirical Results

In this section, we compare the performance of a variant on UDTPOP with the performance of Buridan. UDTPOP and Buridan construct similar plans, but have dissimilar goals. UDTPOP is designed to optimize utility. Buridan is designed (roughly) to identify the minimum complexity plan satisfying a probability threshold. In order to perform a direct comparison, we change the termination and evaluation functions in UDTPOP to produce a new planner, UDTPOP-B. The changes: • Termination test: A normal UDTPOP plan is complete if it has no flaws. UDT-

POP-B declares that a plan is complete if it has no flaws and the probability of success is above the specified threshold. • Cost function: Rather than using the upper bound on utility to guide best first

search, we use the default Buridan cost function: #Open Conditions + #Steps + #Causal_Links .28

These modifications may penalize UDTPOP-B more than Buridan. If a plan has no flaws and the success probability is below the termination threshold, Buridan can continue to

103

improve the plan. UDTPOP-B, on the other hand, is forced to discard this plan: no further refinement is possible. Table 12 and Figures 58-63 illustrate the performance of two versions of Buridan against UDTPOP-B. Buridan-R uses the REVERSE algorithm [Kushmerick, et al, 95] for evaluating the success probability. The REVERSE algorithm calculates a lower bound on the probability of success using the only the causal effect distributions, causal links, and threats in the partial plan. The REVERSE algorithm is analogous to the assessment algorithm used in UDTPOP-B: both algorithms use only the explicit structure of the plan for evaluation. Buridan-F uses the FORWARD algorithm [Kushmerick, et al, 95] for computing the probability of goal satisfaction. The FORWARD algorithm simulates the application of a topological sort of the steps in a partial plan to find the probability of goal satisfaction. In order to find a lower bound on the probability of success, the FORWARD algorithm evaluates the success probability for all topological sorts of the plan that are consistent with the ordering constraints and returns the smallest of these probabilities.29

Buridan-F will

return any partial plan as a solution to a problem if it can prove that it can always reach the threshold probability using any topological sort of the steps contained in the partial plan.

28. For the initial tests, I selected #OCs + #Steps for two reasons: 1. Gerevini & Schubert [95, 96] present an analysis of cost functions for causal link planners that seems to indicate that this cost function will be more efficient. Our own experience bears this out. 2. Buridan and UDTPOP-B use very different linking strategies (with UDTPOP-B generating several links corresponding to a single UDTPOP-B causal link). This cost function provided very good results for both planners on a subset of the problems. Unfortunately, this cost function caused Buridan to loop (infinitely) on several problems. Buridan can add links ad infinitum without increasing the number of open conditions or plan steps. In order to factor out the effect of the cost function on search space size, we also present representative plots that illustrate the estimated size of the entire search space. These plots are not sensitive to the cost function (though they are sensitive to flaw selection strategies). 29.FORWARD can exactly evaluate any topological sort of an underspecified plan (threats between operators, etcetera). Different topological sorts of an incomplete plan can have different utilities.

104

Most of the domains used in the experiment were derived from the domain descriptions distributed with the publicly-available source code for Buridan. Four statistics are illustrated for each of these algorithms. • Nodes-Visited: the number of partial plans that were refined. (interior nodes).

This is the number of plans that were popped off the top of the search queue in the best-first search algorithm. • Nodes-Generated: the total number of partial plans constructed by the algorithm. • Time: the total amount of time used for search.30 These times are included to

demonstrate that the time/refinement is comparable for both planners. Neither planner has been seriously optimized, so timing information is suspect. • Branching Factor Reduction: this is the percent reduction in branching factor due

to the use of effectiveness and relevance constraints. A successful plan is one that exceeds the termination threshold τ . The first column of the chart lists both the domain and the termination threshold used in each experiment.

DOMAIN

STEPS IN SOLUTION

BURIDAN-R

BURIDAN-F

UDTPOP-B

3

4/6 0.476 s.

4/6 0.433 s.

3/3 67% 0.066 s.

10

586/733 20.0 s.

581/721 115. s.

110/128 55% 30.4 s.

3

44/74 1.59 s.

43/70 1.60 s.

14/14 44% 0.500 s.

DETERMINISTIC DOMAINS Simple Lens World, τ = 1.0 Lens World, τ = 1.0 Chocolate Blocks World, τ = 1.0 UNCERTAIN DOMAINS

30.These tests were conducted using a Macintosh Quadra 630 running MCL 3.0p2. Optimizations and virtual memory are off. Ephemeral GC is on. 15 MBytes of RAM were available for MCL.

105

Bite Bullet World,

3

26/35 1.24 s.

19/21 1.45 s.

11/17 30% 1.43 s.

2

11/16 0.945 s.

7/11 0.676

8/15 6% 0.614 s.

2

>50000 several hours

236/932 13.5 s.

11/18 5% 1.62 s.

2

109/261 1.25 s.

37/85 2.21 s.

9/15 0% 0.693 s.

1

720/2944 44.5 s.

8/32 0.92 s.

3/4 20% 0.094 s.

IC5, τ = 0.999

0

207/327 6.65 s.

1/1 0.266 s.

1/1 0% 0.021 s.

IC6, τ = 0.999

0

1238/1958 86.0 s.

1/1 0.171 s.

1/1 0% 0.021 s.

Wet Towel World,

3

19/29 1.49s.

16/24 1.36 s.

4/7 13% 0.203 s.

3

17/42b 0.992 s.

6/14 0.71 s.

4/7 13% 0.176 s.

3

>6200/>50000 hours

154/858 14.0 s.

7/10 9% 0.339 s.

2

4753/26628 441. s.

116/333 6.37 s.

9/12 31% 0.522 s.

3

5925/30004 595. s.

789/3333 42.0 s.

10/14 28% 0.725 s.

4

c

4003/24193 420. s.

16/26 20% 1.46 s.

5

c

c

23/41 14% 2.80 s.

7

>72.9K/>500K

c

116/316 3% 32.4 s.

τ = 0.8 Bomb and Toilet World, τ = 1.0 Bomb and Clogging Toilet World, τ = 0.9 Waste Time World, τ = 0.9 Single Link World, τ = 0.81

τ =

0.65625 a

Slippery Blocks World, τ = 0.9 Diamond World, τ = 0.6 Mocha Blocks World, τ = 0.8 Mocha Blocks World, τ = 0.85 Mocha Blocks World, τ = 0.88 Mocha Blocks World, ( τ = 0.89 ) Mocha Blocks World, ( τ = 0.899 )

>9 hrs, 10

mind

P1 τ=1

3

27/79

10/27

26/150 73%

P2 τ=1

3

38/156

20/64

48/307 70%

P3 τ=1

4

625/3135

67/349

88/624 66%

P4 τ=1

5

8321/73095

533/3334

1020/8585 60%

106

P5 τ=1

5

e

1248/8769

1020/8757 59%

P6 τ=1

6

e

e

3425/32352 55%

a. This is the exact probability of 3 consecutive dry steps. b. Did not find the shortest solution. c. Ran out of memory. d. Test conducted using a Sparc 20 /96 MByte running Allegro Common Lisp e.Exceeded a 100,000 plan limit (plans generated).

Key: / %

decision-theoretic planning a dissertation submitted to ...

objective of a planner is to assemble the primitive travel elements into a program of activ- ...... the first (cheating in a non-contingent planner), the second Flip-Coin step will erase any contribution from the first ...... Technical Report 91-02-03, Department of Computer Science and Engineering, University of Washington.

988KB Sizes 0 Downloads 166 Views

Recommend Documents

decision-theoretic planning a dissertation submitted to ...
I certify that I have read this dissertation and that in my opinion it is fully adequate, in scope and quality ...... The Full Decision Tree Constructed by Build-Tree .

Jason J. Fichtner A dissertation submitted to the faculty ...
Tables,” Tax Notes, March 26, 2001; and Leonard Burman, “Treasury's New ...... is the basis for computing projected tax changes in distribution tables. ... displaying distribution data for income and tax liability can mislead the public and cloud

Submitted to Oecologia
statistical packages (McCune and Mefford 1999, Oksanen et al. ..... statistics accounting for the variance in the distributions (e.g. Mann-Whitney's U or a t statistic) ...

A Dissertation
(e.g., nurse, sick, lawyer, medicine). All list words are semantic associates of an omitted critical word (doctor), on free-recall tests, this critical unpresented word ...... United States. The second experiment was conducted with fully proficient S

A thesis submitted to the Central European University, Department of ...
May 12, 2010 - used to raise the barriers are available locally at affordable prices. ..... It falls under the local administrative domains of the Twifo Heman Lower.

Submitted to Consciousness and Cognition as a ...
with van Overwalle and Baetens (2009)'s review which concluded that the mirror system is not involved in ..... Canadian Journal of Psychology, 41(3), 365–78.

A thesis submitted to the Central European University, Department of ...
May 12, 2010 - Increased budgetary support of Wildlife Division could also reduce future .... Summary of HWC management International Best Practice . ...... areas can act in two ways: first as early warning systems to farmers when wildlife try ...

Manuscript submitted to Evolutionary Psychology ...
Charline Hondrou, Department of Computer Science, National Technical ... children about interpersonal conflict, through an educational computer game ..... present in 59% of boys' narratives (n = 171) but only 27% of girls' narratives (n = 121).

APPLICATION TO BE SUBMITTED BY SHG TO LINK ... -
Feb 21, 2011 - As the advance is being made to weaker section, 50% of the normal rate ... In case the facility availed is cash credit, the borrowers will operate ...

To the Graduate Council: I am submitting herewith a dissertation ...
This dissertation is dedicated to my family and friends (irl and online) for their ...... TBE Transitional Bilingual Education, a type of bilingual education program ...

Application for admission to the Provident Fund.(to be submitted ...
I hereby nominate the person mentioned below who is a member of my family as defined in Rule 2 of the. General Provident Fund (Andhra Pradesh) Rules to ...

Application for admission to the Provident Fund.(to be submitted ...
Returned with account number allotted. ... (A.P) Rules to reserve the amount the may stand to my credit in the fund in the even of my death before that amount ...

To the Graduate Council: I am submitting herewith a dissertation ...
Current Notions of Proof in Euclidean Geometry. A Dissertation. Presented for the. Doctor of Philosophy. Degree. The University of Tennessee, Knoxville ...... taught as a laboratory science, with experiments and concrete applications” ([1903] 1926,

Surviving Your Dissertation: A Comprehensive Guide to ...
In the fully updated Fourth Edition of their best-selling guide, Surviving Your ... data archives, as well as expanded coverage of qualitative methods and added ...

Submitted June 30th 2015 to Philosophical Transactions of the Royal ...
original data showing that participants trained to increase self-other control in the motor domain demonstrated increased empathic corticospinal responses ...

Master dissertation
Aug 30, 2006 - The Master of Research dissertation intends to provide a background of the representative social and political discourses about identity and cultural violence that might anticipate the reproduction of those discourses in education poli

Ph.D Dissertation
me by running numerical simulations, modeling structures and providing .... Figure 2.9: Near-nozzle measurements for a helium MPD thruster [Inutake, 2002]. ...... engine produces a maximum Isp of 460 s [Humble 1995], electric thrusters such ...

cornell university “synthetic” evolution submitted to ...
our paradigm of evolution in a simulated environment, an experiment in which (since we control .... like shape memory alloys and Piezo-electric elements.

Dissertation
Deformation theory, homological algebra and mirror symmetry. In Geometry and physics of branes (Como, 2001), Ser. High Energy Phys. Cosmol. Gravit., pages 121–209. IOP, Bristol, 2003. [26] Kenji Fukaya and Yong-Geun Oh. Floer homology in symplectic