DECISION-THEORETIC PLANNING
A DISSERTATION SUBMITTED TO THE DEPARTMENT OF ENGINEERING-ECONOMIC SYSTEMS AND OPERATIONS RESEARCH AND THE COMMITTEE ON GRADUATE STUDIES OF STANFORD UNIVERSITY IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY
Mark Alan Peot June 1998
i
Copyright © by Mark Alan Peot 1998 All Rights Reserved
ii
I certify that I have read this dissertation and that in my opinion it is fully adequate, in scope and quality, as a dissertation for the degree of Doctor of Philosophy.
_______________________________ Ross D. Shachter (Principal Adviser)
I certify that I have read this dissertation and that in my opinion it is fully adequate, in scope and quality, as a dissertation for the degree of Doctor of Philosophy.
_______________________________ David E. Smith
I certify that I have read this dissertation and that in my opinion it is fully adequate, in scope and quality, as a dissertation for the degree of Doctor of Philosophy.
_______________________________ Edison T. S. Tse
Approved for the University Committee on Graduate Studies:
________________________________
iii
Abstract In recent years, researchers in AI planning have been trying to extend the classic planning paradigm to handle uncertainty and partial goal satisfaction. This dissertation focuses on the development of two partial-order planners: DTPOP, a contingent decision-theoretic planner, and UDTPOP, its non-contingent sibling. Both planners assemble actions with conditional, uncertain effects into plans of high overall utility.
A number of innovations are claimed for UDTPOP. UDTPOP uses a new criterion for identifying actions that are relevant to utility. Steps are constrained to be effective, that is, they are guaranteed to change the world in ways that can result in high utility outcomes. In nearly every other classical partial-order planner, an action can support an open precondition only if it has an outcome that can directly support that precondition. In UDTPOP support can be indirect as long as the added step supports desirable utility outcomes and doesn’t cause any other step to lose effectiveness. This new action selection criterion allows us to abandon the multiple support paradigm of the probabilistic planner Buridan. This, in turn, allows us to design a nearly systematic planner. The same innovation also allows us to develop a strict (and tight) upper bound on the utility of the plan that can be used as an admissible cost function for best first search. UDTPOP is significantly more effective than previous methods. An empirical study compares a variant of UDTPOP against Buridan. In nearly every domain, the performance of UDTPOP is superior, sometimes by 2-3 orders of magnitude.
The second planner, DTPOP, is the first sound and complete contingent partial-order planner. Efficient contingency plans satisfy a “no augury” condition: information used to decide between plan branches must be relevant (in the sense of D-separation) to the utility of the branches under consideration. DTPOP uses this criterion directly to develop tests that can reveal the distribution over unknown, but relevant, events. A test is a subplan that has observable outcomes that are dependent on these hidden events. During planning, DTPOP reverses its direction of search, using regression to establish preconditions and progression to identify observations.
iv
Acknowledgements This work was funded in part through DARPA contract F30602-91-C-0031, the Rockwell International Science Center, and by the National Science Foundation through a Graduate Fellowship.
The ideas in this dissertation benefitted considerably from discussions with my advisor Ross Shachter, Jack Breese, Denise Draper, Ken Fertig, Nir Friedman, Steve Hanks, David (DE2) Smith, Phil Stubblefield, Edison Tse, Dan Weld & Michael Wellman. The ideas in this dissertation arose through joint work performed with DE2 Smith and Jack Breese. Jim Martin and the Rockwell Palo Alto Laboratory provided the perfect working environment for my research.
Finally, I want to thank my friends, Brian Gregory, Michael Lippitz, Enrique Romero, David Smith, and Tim Stanley, for putting up with a lot of dissertation-related flakiness over the past years. Thanks also to my parents, Hans and Kathleen Peot, for their considerable moral support over the years.
My true love Lydia Chang managed my dissertation progress from Michigan.
v
vi
Table of Contents
Table of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii List of Illustrations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiv Notation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxi 1.0
2.0
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1
The Problem
1
1.2
The Solution
4
1.2.1
Non-contingent Solutions: UDTPOP
4
1.2.2
DTPOP
4
1.3
Roadmap of Regression Planner Development
5
1.4
Contributions
7
1.5
Outline of this dissertation
10
Action and Plan Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.1
Variables and Functions
14
2.2
Steps
15
2.2.1
Conditional Effects
15
2.2.2
Conditional Cost Models
16
2.2.3
Belief Network Representation for Steps
17
2.2.4
Example: Stress World
18
2.2.5
Frame Assumptions
21
2.3
Goals and Utility
22
2.4
Assembling Steps into Plans
23
2.5
Example
26
2.6
Related Work
27
2.6.1
Attributes of Probabilistic Steps
27
vii
3.0
2.6.2
Attribute Cardinality
27
2.6.3
World State Representation
30
2.6.4
Distribution Symmetry
31
2.6.5
Distribution Completeness
32
UDTPOP: Noncontingent Planning. . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.1
Overview
35
3.2
The Basic Ideas
36
3.2.1
Multiple Support
36
3.2.2
Single Support
40
3.2.3
Effectiveness
47
3.2.3.1
Possibility
48
3.2.3.2
Pertinence
48
3.2.3.3
Effectiveness
49
3.3
UDTPOP
53
3.3.1
Plan Flaws: Open Conditions
55
3.3.2
Plan Flaws: Threats
56
3.3.3
Adding Support: Add-Step and Add-Link
56
3.3.4
Resolving Threats: Promote and Demote
58
3.3.5
Resolving Threats: Persist-Support
60
3.4
Example
62
3.5
Approximating Effectiveness
67
3.5.1
Possibility
67
3.5.2
Pertinence
68
3.5.3
Effectiveness
69
Evaluating Plans
70
3.6 3.6.1
Search
70
3.6.2
Model Construction in Complete Plans
72
3.6.3
Model Construction in Partial Plans
78
3.6.3.1
Modeling Open Conditions
79
3.6.3.2
Modeling Open Conditions Using LPE
81
3.6.3.3
Modeling Threats
89
3.6.3.4
Model Construction and Evaluation for Partial Plans
91
3.6.3.5
Persist-Support
95
3.6.4
3.7
The Implementation of the Evaluator used in UDTPOP-B
Formal Properties
3.7.1 3.7.1.1
Soundness Markov Model
97
98 99 100
viii
3.7.1.2 3.7.2
Soundness Completeness
102
3.8
Empirical Results
103
3.9
Discussion
114
3.9.1
Mutual Exclusion
114
3.9.2
Comparison with Buridan
115
3.9.2.1
Link Establishment Order
115
3.9.2.2
Backtracking on Open Conditions
116
3.9.2.3
Plan Evaluation and Mutual Exclusion
117
3.9.2.4
Persist-Support vs. Confrontation
118
3.9.2.5
Soundness
118
3.9.2.6
The Buridan Heuristic
119
3.10 Extensions
122
3.10.1
Extending Relevance
122
3.10.2
Extending Effectiveness
123
3.10.3
Tightening the Upper Bound on Evaluation
124
3.10.4
Systematicity
125
3.10.5
Multiple Support and Ordering Constraints
126
3.10.5.1
Multiple Support = Fewer Ordering Constraints (sometimes...)
126
3.10.5.2
Commutivity
127
3.11 Contributions
4.0
128
Relevance and Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 4.1
Notation and Definitions
132
4.1.1
Belief Network Notation
132
4.1.2
Influence Diagrams
133
4.2
Observation Relevance
134
4.3
Identifying Relevant Nodes in Belief Networks
137
4.3.1
The Ne, Np, Nr and Ni sets
137
4.3.2
The Bayes Ball Algorithm
139
4.3.3
Ne, Np, Ni, and Nr Examples
142
4.3.3.1
Collect-Requisite
142
4.3.3.2
Collect Relevant
144
4.4
5.0
102
Dynamic Influence Diagrams
146
DTPOP: Contingent Planning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149 5.1
Overview
151
5.2
Basic Ideas
152
5.2.1
Constructing Contingent Plans
152
ix
5.2.1.1
Warplan-C
153
5.2.1.2
CNLP
155
5.2.1.3
C-Buridan
158
5.2.1.4
Cassandra
162
5.2.1.5
DTPOP
163
5.2.2
5.3
Identifying Observations
DTPOP
165
168
5.3.1
DTPOP Plan Elements
171
5.3.2
Mutual Exclusion Constraints and Execution Policies
172
5.3.3
Plan Flaws
174
5.3.3.1
Plan Flaws: Threats
174
5.3.3.2
Open Uncertainties
175
5.3.4
UDTPOP Modifications
177
5.3.5
Threat Resolution: Branch
177
5.3.6
Discovering Alternative Plans: Add-Branch
178
5.3.7
Identifying Observations: Add-Link-Forward and Add-Step-Forward
178
5.3.8
Remove-Open-Uncertainty
180
5.4
Plan Optimization
185
5.4.1
Contingent Plans and Asymmetric Influence Diagrams
186
5.4.2
Plan Optimization
187
5.4.2.1
Plan Model Construction
188
5.4.2.2
Pass 1
191
5.4.2.3
Pass 2
195
5.4.2.4
Plan Optimization
195
5.4.2.5
Simplifying the Decision Problem
196
5.4.3
5.5
Approximating Plan Optimization
198
Recognizing Open Uncertainties
204
5.5.1
Background
204
5.5.2
The Information Relevance Network
205
5.6
Heuristics
214
5.6.1
Inter-branch threats
214
5.6.2
Closing and Constraining Open-Uncertainties
215
5.7
Example
216
5.7.1
The First Plan Branch
218
5.7.2
Making the Initial Plan More Robust
220
5.7.3
Inter-Branch Threats
222
5.7.4
Searching for Observations
224
5.7.5
Constructing the Decision Tree.
225
x
5.7.6
5.8
Solving the Decision Tree
Formal Properties
229
232
5.8.1
Soundness
232
5.8.2
Completeness
233
5.9
Discussion
5.9.1
Related Work
234 234
5.9.1.1
Causal Link Planners
234
5.9.1.2
Markov Decision Processes
237
5.9.1.3
MAXPLAN
241
5.9.1.4
Knowledge-Based Model Construction
242
5.9.2
Knowledge Preconditions
242
5.9.3
Observation Actions
246
5.9.4
Classes of Plans and Independence
248
5.10 Extensions and Conclusions
250
5.10.1
Experimental Validation
250
5.10.2
Evaluation of Partial Plans
251
5.10.3
Multiplicative Growth in Search Space with Plan Branches.
252
5.11 Contributions
253
6.0
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255
A.
UDTPOP Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273 A.1
Effective Support
273
A.2
Pertinence
275
A.3
Soundness
276
A.4
Completeness
281
A.4.1
Identifying Causal Structure
281
A.4.2
The Clairvoyant Decision Policy and PertinentC
283
A.4.3
Completeness
288
A.5
B.
Admissible Upper Bound
292
DTPOP Proofs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 299 B.1
Notation
299
B.2
Eliminating Decisions
299
B.3
Observation Relevance
300
B.4
Soundness
302
B.5
Completeness
305
B.5.1
Identifying the Contingent Plan
305
B.5.2
No Fusion
308
xi
B.5.3
Completeness
309
xii
List of Tables
TABLE 2. The drink-coffee step.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
TABLE 3. The effect of the write step on total number of pages written. TABLE 4. The effect of the Write step on Harvey’s state of alertness. TABLE 12.
UDTPOP-B vs. Buridan-R and Buridan-F
TABLE 13.
Empirical Comparison in a Navigation Domain
. . . . . . . . . . . . . . . . . . . . . . . . . 20 . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
xiii
List of Illustrations
FIGURE 1.
Roadmap of Regression Planner Development
..................... 6
FIGURE 2.
Relative performance of Buridan and UDTPOP-B
FIGURE 3.
Step Model
FIGURE 4.
Steps in Stress World
FIGURE 5.
Causal Links.
FIGURE 6.
A short plan
FIGURE 7.
A chain of causal links protecting attribute variable .
FIGURE 8.
Asymmetric conditional effects
FIGURE 9.
A Buridan step
FIGURE 10.
Increasing the probability of a precondition with a causal link
FIGURE 11.
Increasing the probability of a precondition with multiple links from the same step 39
FIGURE 12.
Increasing the probability of a precondition using a causal link from a different step 40
FIGURE 13.
Single Support
FIGURE 14.
Increasing the probability of a precondition using a causal link from a different step 42
................... 9
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 . . . . . . . . . . . . . . . . 29
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 . . . . . . . . . 39
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
xiv
FIGURE 15.
Choices in Single and Multiple Support
. . . . . . . . . . . . . . . . . . . . . . . . . . 44
FIGURE 16.
Proliferation of Structure in Multiple Support Planners
FIGURE 17.
A very simple navigation domain.
FIGURE 18.
A “state transition” diagram for the “Go North” step.
FIGURE 19.
Constraints.
FIGURE 20.
Effectiveness
FIGURE 21.
Threats
FIGURE 22.
Promotion
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
FIGURE 23.
Demotion
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
FIGURE 24.
Before Persist-Support
FIGURE 25.
After Persist-Support
FIGURE 26.
Initial Plan:
FIGURE 27.
The example after adding SW1
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
FIGURE 28.
The example after adding SW2
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
FIGURE 29.
The example after adding a link from SIC to SW2
. . . . . . . . . . . . . . . . . . 65
FIGURE 30.
The example after adding a link from SIC to SW1
. . . . . . . . . . . . . . . . . . 65
FIGURE 31.
The example after persist-support
FIGURE 32.
The final complete plan
FIGURE 33.
Best First Search.
FIGURE 34.
Model_CE
. . . . . . . . . . . . . . . 45
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 . . . . . . . . . . . . . . . . . 52
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
xv
FIGURE 35.
A Model for a Complete Plan
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
FIGURE 36.
Model_CE(RGoal, SGoal)
FIGURE 37.
Model_CE(Pages(SW1+), SW1).
FIGURE 38.
Trace of Model_CE
FIGURE 39.
Trace of Model_CE continued
FIGURE 40.
Trace for Model_CE completed
FIGURE 41.
A fallacious argument for using decisions for representing open conditions 80
FIGURE 42.
A simple partial plan that breaks the ‘straw’ model construction algorithm
FIGURE 43.
Using SA to support both preconditions maximizes utility
FIGURE 44.
LPE Example I
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
FIGURE 45.
LPE Example II
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
FIGURE 46.
Active Set 1
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
FIGURE 47.
Active Set 2
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
FIGURE 48.
Active Set 3
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
FIGURE 49.
Modeling every completion of a partial plan
FIGURE 50.
EvaluateUB
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
FIGURE 51.
Model_CE2
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
FIGURE 52.
UB
FIGURE 53.
Persist Support can cause dual support
FIGURE 54.
The contradiction in the proof for Theorem 8
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
81
. . . . . . . . . . . . 81
. . . . . . . . . . . . . . . . . . . . . . . 89
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 . . . . . . . . . . . . . . . . . . . . . . . . . . 95 . . . . . . . . . . . . . . . . . . . . . . 97
xvi
FIGURE 55.
The cluster tree constructed implicitly by the UDTPOP evaluator
. . . . . . 98
FIGURE 56.
Markov Model.
FIGURE 57.
The Markov model for a 2 step plan.
FIGURE 58.
Relative Performance of UDTPOP-B and Buridan-R
. . . . . . . . . . . . . . . 108
FIGURE 59.
Relative Performance of UDTPOP-B and Buridan-F
. . . . . . . . . . . . . . . 109
FIGURE 60.
The branching factor of Buridan and UDTPOP-B as a function of search space depth in the Mocha Blocks World 0.899 domain 111
FIGURE 61.
The branching factor of Buridan and UDTPOP-B as a function of search space depth in Diamond World 111
FIGURE 62.
The relative search space sizes for Buridan-R and UDTPOP-B in Mocha Blocks World 0.899 112
FIGURE 63.
The relative search space sizes for Buridan-R and UDTPOP-B in Diamond World 114
FIGURE 64.
Simple Link Domain
FIGURE 65.
The network of roads used for the navigation domain.
FIGURE 66.
The Move Operator
FIGURE 67.
Multiple support can result in fewer ordering constraints.
FIGURE 68.
The Dry-With-Wet-Towel Action from Wet Towel World
FIGURE 69.
Influence Diagram
FIGURE 70.
A simple decision problem with one decision and no observations
FIGURE 71.
A simple decision problem with one observation
FIGURE 72.
Irrelevant observations
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 . . . . . . . . . . . . . . 121
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 . . . . . . . . . . . . 127 . . . . . . . . . . . . . 127
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 . . . . 135
. . . . . . . . . . . . . . . . . . 136
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
xvii
FIGURE 73.
Bayes_Ball
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
FIGURE 74.
Collect-Requisite
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
FIGURE 75.
Collect-Relevant
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
FIGURE 76.
Collect Requisite Cases.
FIGURE 77.
Relevance Examples
FIGURE 78.
Planning in Warplan-C
FIGURE 79.
Planning in CNLP
FIGURE 80.
Outcomes for a gluing operation.
FIGURE 81.
Observation Plan Construction in DTPOP
FIGURE 82.
The top level of the DTPOP planning algorithm
FIGURE 83.
Threats
FIGURE 84.
Voltage Measurement Example
FIGURE 85.
Branch
FIGURE 86.
Inefficient Plans
FIGURE 87.
Remove-Open-Uncertainty Cases I and 2
. . . . . . . . . . . . . . . . . . . . . . . 183
FIGURE 88.
Remove-Open-Uncertainty Cases 3 and 4
. . . . . . . . . . . . . . . . . . . . . . . 184
FIGURE 89.
Plan Optimization
FIGURE 90.
PM
FIGURE 91.
PM_1
FIGURE 92.
Model_Step
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159 . . . . . . . . . . . . . . . . . . . . . . . 167 . . . . . . . . . . . . . . . . . . . 170
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192
xviii
FIGURE 93.
Model_CE
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
FIGURE 94.
Observability and Model_CE
FIGURE 95.
Model_CL
FIGURE 96.
An algorithm for systematically generating partial orders with equivalent value topological sorts 198
FIGURE 97.
A Decision Tree
FIGURE 98.
Build-Tree
FIGURE 99.
Prune-Mutex
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202
FIGURE 100.
Build-Action
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203
FIGURE 101.
Merge
FIGURE 102.
Active Subgraph Example, Part I
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206
FIGURE 103.
Active Subgraph Example, Part II
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207
FIGURE 104.
The VOM calibration example
FIGURE 105.
There are possibly many active paths for a single relevance relation
FIGURE 106.
Relevance For Observable Nodes.
FIGURE 107.
Collect-Relevant
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212
FIGURE 108.
Collect-Requisite
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213
FIGURE 109.
Bayes_Ball_IRN
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213
FIGURE 110.
A non-contingent plan for partying outdoors.
FIGURE 111.
Expected Value of Party-Outdoors
FIGURE 112.
The plan after starting another plan branch.
FIGURE 113.
The plan after completing the second plan branch.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208 . . . 209
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211
. . . . . . . . . . . . . . . . . . . . . 220
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220 . . . . . . . . . . . . . . . . . . . . . . 221 . . . . . . . . . . . . . . . . 221
xix
FIGURE 114.
Expected Value for the party alternatives
. . . . . . . . . . . . . . . . . . . . . . . . 222
FIGURE 115.
Inter-branch threats.
FIGURE 116.
Mutex Sets After Branching
FIGURE 117.
The Information Relevance Network
FIGURE 118.
The Information Relevance Network after Add-Step-Forward
FIGURE 119.
One possible topological sort for the contingent plan
FIGURE 120.
The Full Decision Tree Constructed by Build-Tree
FIGURE 121.
Decision Tree after Marginalizing Out the Unobserved Variables
FIGURE 122.
Decision Maximization
FIGURE 123.
Marginalization of Observable Variables
FIGURE 124.
Value of the contingent plan
FIGURE 125.
Simple Forecast Example
FIGURE 126.
Multiple Forecasts
FIGURE 127.
The transformation from Mm to Mm’
FIGURE 128.
Discover Links.
FIGURE 129.
Action schemata for SA and SB.
FIGURE 130.
Constructing Q
FIGURE 131.
Belief networks for Case I
FIGURE 132.
Gadget for Representing a Disjunct in 3SAT
FIGURE 133.
Representation for 3SAT.
FIGURE 134.
Discover_Contingent_Links
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223 . . . . . . . . . . . . . . . . . . . . . . . . . . . 224 . . . . . . . . 225
. . . . . . . . . . . . . . . 226
. . . . . . . . . . . . . . . . . 229 . . . . . . 230
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231 . . . . . . . . . . . . . . . . . . . . . . . . . 231
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 248 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 278
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 296 . . . . . . . . . . . . . . . . . . . . . 301
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 302 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 307
xx
Notation A, B, Variable : A, B, Variables :
Attribute variables (Generalized propositions). Sets of attribute variables.
a, b, value ( a, b, values ) : S 1, S A, S Go : S i - ( S i+ ) :
V
ordering constraint, “ S 1 completes before S 2 commences.”
: A causal link from S E to S C protecting attribute variable V .
ST ⊗ L : ST
:
Actions.
The time immediately before (after) action S i executes.
S 1 < S 2 : An S E →S C
Attribute values (sets of values).
threatens causal link L .
The conditional outcome “ e given c .”
P{.} :
The upper bound on an interval probability distribution.
P{.} :
The lower bound on an interval probability distribution.
[ S 1, …, S n ] : {S A, S B }⊥ :
A sequence. A mutual exclusion constraint.
xxi
1.0 Introduction
One of the landmarks of intelligent behavior is the ability to proactively prepare a detailed course of action that can be applied in order to accomplish one or more desirable objectives. This activity is called planning. The objective of this dissertation is to explore a number of issues surrounding the development of partial-order plans in uncertain domains. In this chapter, we will define the basic problem addressed by this dissertation and provide a road map of related work in this subfield of planning. The final sections of this chapter provide a brief summary of the remainder of the dissertation, including a brief synopsis of contributions claimed.
1.1
The Problem
A plan is a sequence of actions that can be executed in order to achieve some objective. A planner assembles a set of primitive actions into an overall program of behavior that addresses the objective. For example, in a travel domain, we might have primitive actions for individual travel legs, such as primitive actions for flying or taking a cab. One objective in this domain might be to travel from Palo Alto, California to New Carlisle, Ohio. The objective of a planner is to assemble the primitive travel elements into a program of activity that accomplishes this activity. In this situation, such a plan might look like: • Take a cab from Palo Alto to San Francisco International Airport. • Fly from San Francisco to Minneapolis. • Fly from Minneapolis to Dayton. 1
• Take a cab from Dayton International Airport to New Carlisle.
Most of the past work in planning is aimed at developing planners that solve the following problem: Given a set of deterministic primitive actions, a set of deterministic initial conditions and a goal, find a sequence of these primitive actions that accomplishes this goal with certainty. This problem is the classic planning problem. The objective of this dissertation is to discuss the design of two planners that generalize this classic planning paradigm in three ways: • Actions may have multiple, uncertain outcomes. • Goals have value. • The outcomes of the primitive actions and initial conditions are observable.
Uncertain Actions
In the classic planning paradigm, primitive actions are deterministic. Application of the same primitive action under the same conditions always results in an identical outcome. We will relax this restriction, allowing uncertain primitive actions; actions that have multiple possible uncertain outcomes. This will model domains, such as medicine, in which primitive actions can have wildly varying consequences. For example, a surgical procedure might be highly uncertain with possible outcomes ranging from a perfect cure through a painful death. Several outcomes might result from the prescription of an antibiotic: a patient may take a full course of the drug, possibly curing the infection; the patient may stop taking the drug early, resulting in a higher probability of reinfection, or the patient might have an allergic reaction to the drug.
Utility
In classical planning, planning objectives are captured using goals. A plan is not acceptable unless the plan achieves the goal with certainty. The objective for the planners
2
described in this dissertation, UDTPOP and DTPOP, on the other hand, is specified using an atemporal utility function. This utility function is composed of two components: • a reward utility function that rates the world states resulting from the execution
of a plan, and • additive cost functions for each of the actions used in the plan.
The use of a utility function to capture planning objectives allows the planner to trade the achievement of various objectives against each other and against the cost of a plan. If the actions required to achieve an objective are too expensive, then the planner may return an empty plan indicating that the optimal policy is to do nothing. If the planner cannot simultaneously achieve a set of objectives, it may still be able to identify a plan that balances partial satisfaction of some subset of the overall objectives against the cost of the plan. For example, during the treatment of a patient with a fatal disease, we need to balance the patient’s quality of life against the cost or disutility of measures designed to extend the patient’s life. Achieving total success, both in curing the disease and in minimizing the impact of treatment on the patient may not be possible.
Observations
In the classical planning paradigm, both the initial conditions (the state of the world prior to plan execution) and the individual primitive actions are deterministic. Given the initial conditions, it is possible to predict exactly the world state that results from the application of one or more actions. In the planning domains addressed by this dissertation, both the initial conditions and the individual primitive actions are no longer deterministic. It is not possible in all situations to project a unique world state that results from the application of one or more primitive actions. The world state is not completely observable. The plan execution agent can only learn about the state of the world through actions with observable outcomes. The observable
3
outcomes of an action may not map exactly into the full set of outcomes of the action. In C-Buridan [Draper et al, 1995], the outcomes of actions are grouped into “distinguishable equivalence classes”, each consisting of a set of outcomes that cannot be distinguished after execution of the action. For example, there might be three outcomes from a “glue part” operation: “no bond”, “weak bond” and “strong bond.” It may be possible to distinguish the first outcome (no bond) from the second and third outcomes, but it may be impossible to distinguish between a weak and a strong bond without further testing.
1.2
The Solution
This dissertation discusses two partial-order planners: UDTPOP and DTPOP.1 Both of these planners assemble actions drawn from a library of action schemata into plans that maximize some atemporal utility function.
1.2.1
Non-contingent Solutions: UDTPOP
UDTPOP finds non-contingent solutions to planning problems. “Non-contingent” means that action execution is not a function of observations made during plan execution. Although the outcomes of individual actions can be a function of the outcomes of other actions, action execution is not. For example, say that our objective is to dry some object. UDTPOP might synthesize a plan that contains 3 drying operations in order to increase the probability that the item is dry. It cannot synthesize action sequences in which the dryness of the object is checked and used to condition the execution of future drying actions.
1.2.2
DTPOP
DTPOP can synthesize contingent plans. A DTPOP plan is a partial-order graph of contingent actions. The execution of each contingent action is conditioned on a function over 1. pronounced “you-dee-tee pop” and “dee-tee pop,” respectively.
4
previously made observations. When this function is true, the action can execute. For example, we might synthesize a plan in which the dry operation is only performed if we observe that the target object is wet.
1.3
Roadmap of Regression Planner Development
Figure 1 shows the relationship between the work performed in this thesis and the various contingent and probabilistic planning algorithms that are most closely related to UDTPOP or DTPOP. Section 5.9.1 describes related work that is less closely related to the content of this dissertation, including the relationship between this dissertation and work on refinement planning and Markov decision processes. Underlined planners are probabilistic. Planners in bold are contingent.
5
1976
WARPLAN-C (Warren)
SNLP (McAllester & Rosenblitt)
1991
CNLP (Peot& Smith)
1992 SENSp (Etzioni, et al.)
Cassandra (Pryor & Collins)
1993
1994
UCPOP (Penberthy & Weld)
Plinth (Goldman & Boddy)
Buridan (Kushmerick, et.al.)
C-Buridan (Draper, et. al.)
ε-safe CNLP (Goldman & Boddy) 1995
ε-safe Plinth (Goldman & Boddy)
UDTPOP (Peot) 1996 Puccini (Golden)
DTPOP (Peot)
1997
FIGURE 1. Roadmap of Regression Planner Development.
UDTPOP was inspired by examining the structure and performance characteristics of Buridan [Kushmerick, et al.], the first sound and complete probabilistic planner contained actions with uncertain and unobservable outcomes. UDTPOP uses a variant of Buridan’s representation for conditional effects. In addition, UDTPOP’s probabilistic threat-resolu-
6
tion operation, persist-support, is based on the probabilistic confrontation mechanism of Buridan. DTPOP has its roots in a planner (named CNLP) developed by David Smith and myself in 1992 [Peot+Smith, 1992]. CNLP is a partial-order planner inspired by the contingent total-order planner WARPLAN-C [Warren, 1976]. CNLP plans for goals using regression. If CNLP needs to use an uncertain, but observable outcome in order to achieve its goal, it derives a contingency plan for the alternative outcomes. CNLP introduced a new mechanism for threat resolution: conditioning (now referred to as branching [Draper, et al, 1994]) in order to resolve resource conflicts between contingent plan branches. Branching forces the planner to commit to one action or the other based on the results of an observable action that occurred earlier in the plan. This plan construction operation resolves resource conflicts between contingency plans by forcing the actions involved in the resource conflicts into different plan branches. The conflict disappears because the conflicting steps can never both be executed during any particular execution of the plan – the steps are contingent on mutually exclusive outcomes of some earlier observation action. CNLP inspired the contingent version of Buridan: C-Buridan [Draper, et al, 1994]. In addition to extending Buridan, C-Buridan fixed flaws in CNLP’s action representation language by using the conditional effect mechanism of Buridan to capture observation/observation dependencies and by introducing the notion of discernible equivalence classes to describe the outcomes of an action that can be distinguished from each other.
1.4
Contributions
UDTPOP
This dissertation outlines the design for a new probabilistic planner UDTPOP. Specific contributions include:
7
• UDTPOP is an “relatively efficient” partial-order planner that can construct util-
ity-maximizing plans from actions with probabilistic effects. • UDTPOP is shown to be sound in the sense that the expected utility of a UDT-
POP plan is identical to the utility of any Markov model of a topological sort of the plan. • UDTPOP is shown to be complete in the sense that it is always guaranteed to
find the plan of highest expected utility. • An admissible evaluation function is developed that can be used with A* to iden-
tify the plan of highest overall utility. • The causal link mechanism of Buridan [Kushmerick, et al. 1993] is re-engi-
neered to improve planning efficiency. UDTPOP uses a ‘fat’ causal link that serves the same purpose as multiple Buridan links. This reduces the number of choices that must be made during planning, which in turn reduces the depth, d , of the overall search space. Since the time complexity for identifying a solution varies as O ( b d ) given comparable branching factors b , the UDTPOP search space can be exponentially smaller than that of Buridan. • A mechanism is developed that allows UDTPOP to ‘close’ open conditions
without compromising completeness. UDTPOP only needs to consider one source of support for any precondition. If every precondition in the plan is closed (there is one source of support for that precondition) and there are no threats in the plan, then that plan is complete. A Buridan plan, on the other hand, is never complete. Buridan can always attempt to increase the probability of a desirable precondition by adding additional causal support. After Buridan identifies one source of support for each precondition in the plan, Buridan attempts to increase the probability of the plan by choosing to add a causal link to some already-supported precondition within the plan. This choice increases the branching factor (the b ) of the Buridan search space, increasing the time and memory required to identify a solution.
8
• A new ‘relevance’ criterion is introduced. UDTPOP uses this to identify actions
that can increase the utility of a plan. This criterion is also used to constrain partial plans–if it is no longer possible for an action in a partial plan to contribute constructively to the objective, that partial plan is pruned. • Empirical results are presented that convincingly demonstrate the performance
gains claimed in the last two bullets. The summary of these results is shown in Figure 2. This log-scale plot graphs the number of plans created by UDTPOP and Buridan when solving a variety of different domains.
1000000
UDTPOP-B 10000 1000
Buridan-R
100 10 1
UDT Simple Lens* Bomb&Toliet* Wet Towel Bite Bullet* Slippery Blocks* Chocolate* Waste Time* IC5 Lens* IC6 Single Link Mocha 0.88* Mocha 0.85* Bomb&Toliet2* Mocha 0.88* Mocha 0.89* Mocha 0.899* Diamond World P1 P2 P3 P4 P5 P6
Plans Created
100000
FIGURE 2. Relative performance of Buridan and UDTPOP-B. The vertical axis is the log of the number of plans created while solving the problem instances listed along the horizontal axis. several of these examples, UDTPOP outperforms Buridan by 3-4 orders of magnitude.
On
DTPOP
This dissertation also outlines the design of a new contingent planner, DTPOP. • DTPOP is demonstrated to be sound.
9
• DTPOP is complete in the sense that it identifies the best contingent plan of a
given size. • A mechanism is proposed to add relevant observation actions to the plan. • A new threat resolution mechanism is introduced based on branching. • Two techniques are proposed for evaluating and optimizing contingent plans.
1.5
Outline of this dissertation
Chapter 2:
Action and Plan Representation
This chapter discusses the knowledge representation used for UDTPOP, including: • Representation of events; • Representation of actions, including conditional effects and conditional cost
models; • Plan representation; and • The influence diagram induced by a partial-order plan.
Chapter 3:
UDTPOP
This chapter outlines the design of UDTPOP, a non-contingent decision-theoretic partialorder planner, including: • A description of the UDTPOP planning problem and algorithm, • ‘Fat’ causal links, • The persist-support operation for threat resolution, • Evaluation techniques for partial plans, and • Empirical results comparing the performance of UDTPOP with that of Buridan.
Chapter 4:
Relevance and Independence
This chapter reviews past research on influence diagrams and probabilistic relevance. Topics discussed include: 10
• Definitions for probabilistic relevance, • Definitions for relevant and requisite sets: sets of nodes that are relevant to a
probability or are required for answering that probability query; • A description of the Bayes Ball algorithm [Shachter, 98] for identifying the rele-
vant and requisite nodes in belief networks or influence diagrams; and • Conclusions and future work.
Chapter 5:
DTPOP
This chapter outlines the design for DTPOP, a contingent decision-theoretic partial order planner. Topics covered in this chapter: • A detailed example of planning using DTPOP; • A description of the plan construction and plan optimization algorithms; • Theorems for soundness and completeness; • Techniques for identifying open uncertainties; and • Conclusions and future work.
Appendix A: UDTPOP Proofs Appendix B: DTPOP Proofs Appendices with proofs for both UDTPOP and DTPOP.
11
12
2.0 Action and Plan Representation
This chapter defines the representation used for world states, steps, plans, and objectives in UDTPOP and DTPOP. This representation differs from that of classic planners in four respects: • The initial world state and the outcomes of actions are uncertain; • Utility and cost functions, rather than goals, drive the selection of actions during
planning; • Generalized propositions (variables with 2 or more possible states) rather than
binary propositions are used to capture salient properties of the world; and • Step execution is not necessarily contingent on the value for the preconditions of
the action (although the effects of the action may be). In this chapter, I describe 1. The representation used by both UDTPOP and DTPOP for representing actions and the world. 2. The representation used for plans, which are partially-ordered sets of actions. 3. How planning objectives are specified. 4. The difference between this representation and that used by other probabilistic planners. Plan and action representation features that are unique to DTPOP (observable outcomes and contingent plans) are discussed in Chapter 5.0.
13
2.1
Variables and Functions
DTPOP and UDTPOP capture the state of the world using a set of discrete domain variables, X = { X 1, …, X n } , that each represent some salient attribute of the world that we wish to model. These attribute variables can each assume one of a finite set (called the domain of the variable) of mutually-exclusive values. The state of the world at any one point of time is the conjunction of the values corresponding to these attribute variables. Boutilier [96] calls these variables generalized propositions. The analogue to a literal1 is equality between the variables in generalized propositions and their values (e.g. A = true or B = b).
Boolean functions can be defined on generalized literals.
The state of the world changes as a function of the execution of steps in a plan. In UDTPOP, we uses generalized fluents to represent the evolution of attribute variables over time. The generalized fluent Robot-Location(t) maps time into variable values: Robot-Location(t): ℜ → domain(Robot-Location) .
All changes in the world state in UDTPOP occur due to step execution, thus the only times of interest are the times that individual steps are executed. The time immediately after (before) the execution of step S is S+ ( S- ). Step execution is modeled as atomic; no other step S 2 can execute in the interval [ S 1-, S 1+ ] . Sans serif capital roman letters or words will denote fluents and variables or propositions, e.g. A ( t ), Battery-Charge, Robot-Location , etc. The values for these functions or variables will be represented using lowercase (e.g. a , charged , home ). Sets of functions, variables, or values will be indicated using overlines (e.g. Robot-Locations(t) , and places ). A capital X
will denote the set of all variables used to model the world. The trajectory of world
states over time is X(t) .
1.
Recall that a literal is either the proposition, p, or its negation, ¬p.
14
2.2
Steps
A plan contains a partially-ordered set of steps S . The description for each step S i in this set consists of a set of conditional effects, Eff(S i) , and a set of step cost functions Cost(S i) . Each conditional effect is a conditional probability distribution over a set of variable values given the world state prior to the execution of S i . Each cost function describes one component of the total cost of executing S i . The cost of executing S i is the sum of the individual cost functions in Cost(S i) . In UDTPOP, the conditional effects of each step are unobservable. Although the effects of steps may be conditioned on the results of earlier steps in the plan, the set of steps executed during plan execution may not be contingent on the results of steps that were executed earlier in the plan.2 In Chapter 3.0, we will describe UDTPOP, an open-loop planner that can find optimal open-loop plans. In Chapter 5.0, we will extend the representation of S
to include observable effects which will, in turn, allow contingent execution.
2.2.1
Conditional Effects
The effect of a step is a conditional probability distribution over a set of effect variables given a set of values for precondition variables. The effect variables, E(S+), refer to those variables that actually can change as a function of step execution at the time immediately after the execution of S. The precondition variables, C, refer to variables that affect the outcome distirubiton over E. Each conditional effect distribution P S { E ( S+ ) C ( S- ) } is a distribution over values for a set of effect variables E given the values for a set of precondition variables C .
2.
In control theory terms, plan execution is open loop.
15
I will often refer to the individual nonzero components of the conditional effect distributions. These components are called the conditional outcomes of the conditional effect distribution. Each conditional outcome has the form <( E 1 = e 1 ) ∧ … ∧ ( E m = e m ) ( C 1 = c 1 ) ∧ … ∧ ( C n = c n ) > = p
and captures the probability that a particular outcome E = e will result given the trigger or precondition C = c . The set of all of the conditional outcomes in a conditional effect distribution CE is COs(CE) .
2.2.2
Conditional Cost Models
The set of step cost functions Cost(S i) describe how much it costs to execute each step S i under different circumstances. Each conditional cost function maps a subset of the world state to the reals. Multiple cost functions on a step (or across several steps) combine additively.3 If it is possible for E { Cost(S) } ≤ 0 then the optimal plan might be infinitely long. TO see why, consider the “slippery gripper” problem presented in the Buridan paper [Kushmerick, et al, 94] illustrates why. The dry-gripper step dries a robot gripper with probability 0.5
if the gripper is wet and does nothing otherwise. The probability that the robot gripper
is dry can be increased to be arbitrarily close to 1.0 by concatenating a sufficiently large number of dry-gripper steps. If the cost of dry-gripper is negative or zero, then the optimal UDTPOP plan will have an infinite number of dry-gripper steps. Theorem 1 (Sufficient Conditions for a Finite Solution): The
non-contingent plan with the maximum expected value has a finite number of steps if there are a finite number of possible action schema and for each possible action schema S , ∀Cost i ∈ Cost(S), Cost i(X) > 0 .4 3.
They are additive subvalue nodes [Tatman+Shachter, 90]
16
We will restrict the domains of UDTPOP to contain only steps with positive costs in order to guarantee that we can find an optimal plan. Unfortunately, this “UDTPOP Domain Restriction” will not prevent the discovery of infinite contingent plans in DTPOP for reasons that I will discuss in Section 5.0.
2.2.3
Belief Network Representation for Steps
PRECONDITION VARIABLE S
EFFECT VARIABLE S
CONDITIONAL EFFECT DISTRIBUTION
E1
C1 LINKS TO PRECONDITION C2 VARIABLE DISTRIBUTIONS
Cm
P{E|C} E2
Cost{B}
+
En
SUBVALUE NODE (STEP COST FUNCTION)
FIGURE 3.
Step Model. A belief network fragment representing a single step. A single step can have more than one cost function or conditional effect distribution–only one of each is shown here.
A simple generic belief network representation can be used to represent steps. One such representation is shown in Figure 3. The central node models the joint distribution over all of the effects of the step. The deterministic nodes (double ovals) pull distributions over individual variables out of the joint conditional effect distribution so that they can be used 4.
Williamson [96] allows steps with negative cost or resource use but solves the plan existence problem by adding the restriction that no step can increase the net amount of resources available to later steps. For example, we might sell one resource (say gasoline) to increase our wealth, but we cannot execute steps that increase the total value of all of the resources (gas and wealth) available to us.
17
to condition other conditional effect distributions. The diamond is an additive subvalue node [Tatman+Shachter, 90] and contains the cost function for the step. The expected cost for all of the steps in the plan is just the sum of the expected costs for each of the steps in the plan.
2.2.4
Example: Stress World
We will use a simple example to illustrate the step and state representation used in UDTPOP. drink-coffee
Alert
P{E|C}
write
Alert
Alert
P{E|C}
Alert
Pages
P{E|C}
Pages
C{B}
+
C{B}
+
FIGURE 4. Steps in Stress World.
Harvey Hacker5 is a typical caffeine-powered graduate student. In order to maximize his productive output (measured in pages written), he needs to carefully modulate his coffee intake. When he has quaffed exactly the right amount of caffeine, he is “in the zone” and is maximally productive. If he has quaffed too much or too little coffee, he is working outside “the zone” and his productivity suffers. The generalized propositions used to model Harvey’s state in Stress World might include: Alert
: Harvey’s state of “alertness,” one of { asleep, conscious, in-the-zone, wired } .
Pages : The 5.
number of pages that Harvey has written. One of { 0, …, n } .
Harry’s brother for those of you that care about that sort of thing...
18
Harvey’s state is completely determined at any point in time by values for the variables X = { Alert, Pages } . At T 1 , his state might be X(T 1) = { asleep, 0 } ; at another time, his state
might be X(T 2) = { wired, 10 } , etc. Stress World contains two steps: drink-coffee and write. drink-coffee: Coffee has an uncertain effect on Harvey’s level of alertness that is dependent on Harvey’s level of alertness prior to drinking coffee. The conditional outcomes describing the effect of coffee on Harvey are shown in Table 2. Alert is both the precondition and effect variable of drink-coffee.
Preconditions
Outcomes
Probability
Alert(S java-)
Alert(S java+)
P { Alert(S java+) Alert(S java-) }
asleep
asleep
0.2
conscious
0.8
conscious
0.2
in-the-zone
0.8
in-the-zone
0.2
wired
0.8
wired
1.0
conscious
in-the-zone
wired
TABLE 2. The drink-coffee step.
probability
of
and are conditional outcomes of this step because the each
of
these
conditional
outcomes
is
greater
than
zero.
is not because P { in-the-zone asleep } = 0.0 .
write: Writing “burns up” the caffeine in Harvey’s blood stream (lowering his level of alertness) and produces pages of written material. write contains two independent conditional effect distributions:
19
• one describing the effect of Harvey’s state of alertness and the amount written
thus far on the total number of pages written, and • one describing Harvey’s state of alertness as a function of his previous state of
alertness. The precondition variables for the first conditional effect are { Alert, Pages } . The outcome variable is Pages . Harvey’s output is highest when he has quaffed exactly the right amount of coffee and his state of alertness is “in-the-zone.” His productivity drops to zero if he has quaffed too much caffeine (“wired”) or if he is not very alert (“asleep”). The conditional probability distribution describing this effect is shown in Table 3. Preconditions
Outcomes
Alert(S write-)
Pages(S write-)
Probability
Pages(S write+)
asleep
n
n
1.0
conscious
n
n n+1
0.2 0.8
in-the-zone
n
n+1
1.0
wired
n
n
1.0
TABLE 3. The effect of the write step on total number of pages written.
The precondition and effect variable for this second conditional effect is Alert . The conditional probability distribution describing this effect is shown in Table 4. Preconditions
Outcomes
Probability
Alert(S write-)
Alert(S write+)
P { Alert(S write+) Alert(S write-) }
asleep
asleep
1.0
conscious
asleep
0.1
conscious
0.9
TABLE 4. The effect of the Write step on Harvey’s state of alertness.
20
Preconditions
Outcomes
Probability
Alert(S write-)
Alert(S write+)
P { Alert(S write+) Alert(S write-) }
in-the-zone
conscious
0.1
in-the-zone
0.9
in-the-zone
0.1
wired
0.9
wired
TABLE 4. The effect of the Write step on Harvey’s state of alertness.
2.2.5 Frame Assumptions A conditional effect describes the explicit effect of a step on a variable or set of variables. If a variable, A , is not mentioned in a conditional effect of step S then the value for the function corresponding to that variable does not change when S is executed, e.g. A(S+) = A(S-) .
This frame assumption6 [McCarthy+Hayes, 69] is shared with most other
causal link planners. For example, drink-coffee does not explicitly influence the number of pages written. Thus, the number of pages written thus far is the same immediately before and immediately after the drink-coffee step is executed. I will also assume no spontaneous action. If no step is executed between time t 1 and time t2 ,
then the state of the world, X , does not change, e.g. X(t 1) = X(t 2) . The values of func-
tions representing the world state may only change due to the execution of steps.7 All of the direct and indirect effects of a step must be captured explicitly in the conditional effect distribution of the step. One of the ramifications [Ginsberg+Smith, 88a; 88b] of inverting a container is that the contents of the container might drain out. If we desire to accurately model all of the direct and indirect ramifications of this step, then we need to 6.
Called the Law of Persistence [Georgeff, 86].
7.
It is possible to model the effect of possibly relevant exogenous events by inserting them as dummy actions into the initial plan. See Blythe [96].
21
explicitly represent the result of inverting the container when it is full of sugar, water, concrete, acid, etcetera.
2.3
Goals and Utility
The objective of both UDTPOP and DTPOP is to identify the plan with maximum expected value.8 The utility of a plan in UDTPOP is composed of two components. The first component is the reward. The reward, denoted R ( X(P+) ) , is a function on a subset of the variables describing the state of the world after all of the actions in plan P have been executed. The second component of the value function is a set of cost functions, each denoted Cost S, i(X ( S- )) , that penalizes each step in the plan.9 The expected value of the plan is the sum of the expectations of the cost functions in the plan: ⎛ V(P) = ⎜ ⎝
∑
x ∈ X(P+)
–
⎞ R(X(P+) = x)P { X(P+) = x }⎟ ⎠ ⎛
∑⎜ S ∈ P⎝ j
∑
⎛ ⎜ ⎝
∑
Cost S, i ∈ Cost(S j) x ∈ X ( S- )
(1)
⎞⎞ Cost S, i(X(S-) = x)P { X(S-) = x }⎟ ⎟ ⎠⎠
One popular kind of objective used in AI planning is the goal. A goal is a reward that provides a fixed reward iff the final world state is one of a set of goal world states. That is,
R ( X(P+) ) = ⎛ K , if X ( P+ ) ∈ goal ⎞ ⎝ 0, otherwise ⎠
(2)
where goal is the set of desired world states. With this definition, goals are provisional; if a goal is too expensive to achieve, the planner will abandon that goal.10 8.
Synonymous with utility when the decision-maker is risk neutral
9.
It is easy to generalize the cost function to be a function of both the outcomes and the preconditions of an operator.
22
UDTPOP can be used to emulate a probabilistic planner like Buridan by setting all of the step costs to zero and setting the reward function to a goal function with K set to 1.0; e.g. ∀S ∈ P, C S(A) = 0
and R ( X(P+) ) = ⎛ 1, if X ( P+ ) ∈ goal ⎞ . The value of this plan is exactly ⎝ 0,
otherwise ⎠
equal to the probability that the goal is achieved. We will use this trick in order to compare the performance of the utility-based planner UDTPOP with the goal-based Buridan planner in Section 3.7.
2.4
Assembling Steps into Plans
A plan is a directed-acyclic graph of actions augmented with bookkeeping information that summarizes commitments made during the planning process. A UDTPOP plan is a tuple, , where S,
represents the steps in the plan (the nodes of the directed graph); O , denotes ordering constraints (the arcs of the directed graph); L represents the causal links between steps in the plan and K represents a set of probabilistic constraints. The steps and ordering constraints define the actual plan. The causal links, L , commit to and protect the relationships between the effect of one step and the precondition of another step. UDTPOP constructs plans by identifying deficiencies in this causal structure. The probabilistic constraints, C , restrict the planner search space either to reduce search space redundancy or to prevent discovery of provably nonoptimal plans.
Goal and Initial Condition Steps
There are two distinguished dummy steps in every plan: a goal step, S Goal , and an initial conditions step S IC . These steps are added to the plan to simplify the design of the plan10. Contrast this definition of goal with that of Wellman & Doyle [1992]: (paraphrased) A goal is a strict preference relation on world states. Any world in which the goal is achieved is superior to any world in which the goal has not been achieved.
23
ner11; the same planner mechanisms used to handle links between the steps in the plan itself are used to construct links to the initial conditions and to the preconditions in the goal step. The initial conditions step, S IC , is ordered before every other step in the plan and has no preconditions or cost functions. The conditional effect distributions of S IC contain distributions over all of the attribute variables in the domain. Every variable that appears as a precondition of any other step must also be an outcome variable of the initial conditions step. The goal step, S Goal , is constrained to occur after every other step in the plan and has no conditional effects. The cost function for S Goal is the reward function capturing the objectives for the planner. The UDTPOP domain restriction does not apply to S Goal , the reward function may return any bounded, real value.
Ordering Constraints
The temporal-ordering constraints O restrict the time of execution of some steps to occur before or after the time of execution of other steps. The constraint S 1 < S 2 indicates that S1
must complete its execution before the execution of S 2 begins.
Causal Links
The plan’s causal links, L , record commitments made by the planner in order to establish a X
source of support for preconditions. The causal link S E →S C indicates that the planner has committed to using an outcome variable of S E in order to establish a distribution over X
precondition variable X for step S C . If there exists a causal link S E →S C connecting two 11. This is the standard trick for handling initial conditions and goals in planners [Weld, 94].
24
steps S E and S C , we will say that S E (the establisher of the causal link) establishes precondition variable X for step S C (the consumer of the causal link). A causal link represents the following commitment:
Definition 1 (Causal Link Commitment): If a complete plan P has a causal link X
S E →S C
and steps S , then
1. X is an outcome variable of S E and a precondition variable of S C , and 2. S E < S C . 3. For all S T ∈ S , if it is possible for S T to execute between the times that S E and SC
execute, then X cannot be an effect variable of S T .
Note that this definition of causal link is slightly different than that used in Buridan and other causal link planners [Kushmerick, et al, 94; 95; McAllester+Rosenblitt, 91; Penberthy+Weld, 92; Peot+Smith, 92]. Rather than protecting an individual variable value, this causal link protects the distribution over the set of mutually-exclusive values denoted by a single variable. The causal link commitment means that X i(S E+) = X i(S C-) in every completion of a partial plan.12 Each causal link corresponds to an arc (or set of arcs) from the node representing the effect X i(S E+) to the conditional effects of S C that have X i in their preconditions. This relationship is shown below in Figure 5. In this diagram, a single causal link corresponds to two arcs in the corresponding influence diagram. X
The causal link commitment on a particular link S E →S C is threatened if there exists a step S T that can execute between S E and S C that has the same effect variable X protected
12. Actually, this commitment is partially rescindable. See Section 3.3.5.
25
X
by the causal link. In order to preserve the commitment denoted by S E →S C , UDTPOP will resolve this threat using one of a number of threat resolution techniques.
Belief Network (Plan Model):
Xi
SE
Causal Link (Plan):
SC E|Xi,P
+
Xi SE
U|Xi,Q
SC
FIGURE 5. Causal Links. Influence diagram arcs induced by a causal link. If the plan has a causal link between SC and SE protecting Xi, then the plan model will contain an arc from the probability node representing Xi in SE to all of the conditional effect distributions in SC that are conditioned by Xi.
The plan is called complete if there are no flaws in the causal link structure of the plan, that is: • each relevant precondition variable is supported by a causal link and • the causal link commitment for each causal link holds for each topological sort
of the plan (e.g. there are no threats).
2.5
Example
A complete UDTPOP plan is illustrated below. This plan models the effect of two write steps. In this plan, the reward utility function is a function only of the number of pages written. During planning, the planner commits to using S W2 to establish the number of pages for the Pages precondition of the goal step. The single causal link from S W2 to S Goal
captures this commitment.
26
Both of the write step rely on the use of other steps to establish support for their preconditions, Pages and Alert. These commitments are captured by the causal links between the initial conditions step and S W1 and the causal links between S W1 and S W2 .
PAGES SIC
SW1
ALERT
PAGES SW2 ALERT
PAGES SGOAL
FIGURE 6. A short plan.
2.6
Related Work
The step and variable representation scheme proposed in this chapter is similar to schemes proposed by several authors. In this section, we will describe how these different representation schemes compare and justify some of the choices made for the representation used in UDTPOP.
2.6.1
Attributes of Probabilistic Steps
The knowledge representation schemes proposed for probabilistic planners can be characterized in terms of four attributes: • Attribute cardinality: 2 or n. • World state representation: factored vs. flat. • Distribution symmetry: Symmetric vs. Asymmetric. • Distribution completeness: Complete vs. Incomplete.
2.6.2
Attribute Cardinality
The attribute cardinality is the number of states that each attribute variable can assume. The alternatives are binary (propositions) or non-binary (the variable/state or generalized 27
proposition [Boutilier, et al, 96] approach discussed in this chapter). Most probabilistic planners use propositions, including [Blythe, 94, 96; Draper, et al, 93, 94; Goldman+Boddy, 94a, 94b, 96; Haddawy, 91; Hanks 90; Kushmerick, et al, 95; Milani, 94; Pryor+Collins, 93] although a few others use variables [Boutilier, 94; Doan, 96; Doan+Haddawy, 95] or make no particular commitment on a technique for representing components of the world state [Dean, et al., 93]. UDTPOP uses generalized propositions for a number of reasons: • The concept of mutual exclusion is very natural when describing the properties
of objects within a domain. For example, objects might have only one location at one time. Things can be living or dead, but not both. When a propositional representation is used, only the mutual exclusion of Fact and ¬Fact is guaranteed. Other mutual-exclusion relationships have to be discovered through inference. • In classical planners, mutual exclusion relationships between propositions are
enforced in a relatively unnatural way by carefully engineering each action schema to maintain the mutual-exclusion relationship. For example, a movement operator that asserts at(object, new-location) must also assert ¬at(object, oldlocation). If this mutual exclusion relationship is violated in any one of the action schema, then the plans generated by the planner are no longer guaranteed to be sound. For example, if we accidentally write that the only effect of a movement operator is at(object, new-location), then it may be possible that both of the fluents, at(object, new-location) and at(object, old-location), will be true at the same time. • The mutual exclusion of variable bindings also allows us to infer whether it
might be possible add a step or resolve a threat to a link by examining the history of bindings for a particular variable. Suppose that there is a continuous chain of causal links protecting the distribution over a specific variable, X , that connects the initial conditions step with the goal step (see Figure 7). If there is a threat to
28
any one of the causal links protecting X , then we know that that threat cannot be removed through promotion or demotion.
X
SIC
S1
X
S2
X
S3
X
SGoal
FIGURE 7. A chain of causal links protecting attribute variable X .
• The ‘easy’ probabilistic representation of an attribute variable is much more
compact than the most obvious probabilistic representation for the equivalent set of propositions (unless extra machinery is used in the propositional planner to detect mutual exclusion). For example, in Stress World, there are 4 mutuallyexclusive levels of alertness. If we use belief networks to model uncertainty in this domain, only one discrete variable with 4 possible states is required to model the position of any object at any one point in time. In a naive propositional approach, we might use 4 binary variables, each representing one possible level of alertness. The joint distribution of these variables has 16 components ( 2 4 ). However, only 4 of these components can be non-zero. In order to derive this fact, the evaluator needs to consider all of the steps in the plan in order to derive the proper mutual-exclusion relationships or rely on the user to explicitly declare the propositions to be mutually exclusive [Breese, 92]. If the number of possible states in the alertness variable was increased to 10, the joint distribution over the propositions representing the different states of this variable would contain 1024 components; only 10 of which are non-zero. The shift of representation from propositions to attribute variables is not completely without cost. The variable representation makes it easy to model mutual exclusion of attributes when they vary across one dimension, but may make it more difficult to model more com-
29
plicated mutual exclusion relationships. Say that we are attempting to model trains in a railroad domain. The variable representation of UDTPOP makes it easy to capture the mutual exclusion of possible locations for each train13, but makes it more difficult to write step descriptions that encode the mutual exclusion of trains per each location14 without the use of a second dependent variable. In these situations, it is still possible to fall back to a purely propositional representation, by making each proposition a binary attribute variable.
2.6.3
World State Representation
Almost all probabilistic planners (including UDTPOP and DTPOP) reason with a propositional or factored representation based upon some variant of STRIPS [Fikes+Nilsson, 71]. All of these planners represent probabilistic steps as a (possibly factored) conditional probability distribution over a set of outcome variables given a set of precondition variables. An alternative to this representation is that of a traditional Markov Decision Process (MDP) [Dean, et al., 93]. An MDP uses a step representation based on a transition matrix between all of the possible world states. The factored or propositional representation is typically exponentially smaller than the equivalent MDP representation. Even though the representation is exponentially smaller, a planner that uses a factored representation is not necessarily more efficient than an MDP. The plan evaluation and plan existence problems for the MDP representation are PL-complete and NP-complete, respectively [Goldsmith, et al, 97].15 The plan evaluation and plan existence problems for UDTPOP, on the other hand, are PP-complete and NPPP-com13. Each train can only be in one location. 14. Each location can host only one train. 15. The class PL is the set of problems that can be solved on a Turing machine in polynomial time and logarithmic space.
30
plete, respectively, for polynomially bounded plans.16 The complexity of partial-order probabilistic planning is greater due to the relative sizes of the representations: the size of the problem specification for an MDP is exponential in the number of state variables used.
2.6.4
Distribution Symmetry
In a symmetric representation of conditional effects, the same precondition and outcome variables appear in each conditional effect. Many planners, notably Buridan [Kushmerick, et al., 94], DRIPS [Haddawy+Doan, 94], and the MDP planners developed by Boutilier [Boutilier, et al, 96] allow conditional effects to have the structure of an unbalanced tree. Such a tree is shown in Figure 8. The set of triggers C i are collectively exhaustively and mutually-exclusive boolean expressions on the set of all preconditions of the step. C1 C2 C3 Cn
A1 A2 A3 An
FIGURE 8. Asymmetric conditional effects.
UDTPOP uses a symmetric representation where each of the C i in the figure above are conjunctions over the same set of precondition variables, but an asymmetric representation is clearly a good idea. An asymmetric distribution is potentially exponentially smaller than one that is symmetric. 16. The notation NPC-complete means that the problem would be NP-complete if there is an oracle for problems in class C. The canonical PP-complete problem is MAJSAT: “do the majority of assignments satisfy a 3CNF?” [Papadimitriou, 94] Theorists think that PP is really hard because every problem in the polynomial hierarchy can be reduced to PPP [Toda, 89], the class of problems solvable on a polynomial time turing machine using a PP oracle. The canonical NPPP-complete problem is E-MAJSAT: “given a 3CNF with proposition sets A and B, is there an assignment for A such that the 3CNF is satisfied for the majority of assignments for B?” [Littman, 97]
31
2.6.5
Distribution Completeness
Buridan uses a single tree to represent the outcomes of a step. This tree does not specify a complete distribution over an effect variable, but rather specifies how the step can change the state of the variable. This is illustrated by the “slippery gripper” example from Kushmerick, et al [94]. drygripper dries a robot gripper so that it can pick up a block successfully with high probability. The robot gripper’s state, Dryness(t), can be either be wet or dry. dry-gripper dries the gripper with probability 0.5 if the gripper is wet and has no effect on the gripper if the gripper is dry. The effect of this step is modeled as a single (Buridan-style) conditional effect <∅, dry, 0.5 > . This representation is incomplete because the probabilities of the conditional outcomes affecting Dryness do not sum to 1.0 . In UDTPOP, we explicitly model the persistence of dryness. For this example, the equivalent complete conditional effect for dry-gripper would be: P { Dryness(S+) = dry Dryness(S-) = dry } = 1.0 P { Dryness(S+) = wet Dryness(S-) = wet } = 0.5 P { Dryness(S+) = dry Dryness(S-) = wet } = 0.5
The
only
effective
outcome
〈 Dryness(S+) = dry Dryness(S-) = wet〉 .
in
this
conditional
effect
is
The rest of the conditional outcomes in this step
are designed to passively persist their preconditions. UDTPOP uses a symmetric representation for conditional effects because this representation makes explicit that the probability of dryness after dry-gripper is executed is dependent on how dry the gripper was before the step was executed. This allows UDTPOP to increase the probability of success for dry-gripper by finding the appropriate support for the implicit precondition Dryness . In addition, most influence diagram and belief network algorithms (barring those listed in [Hanks, 90] and [Kushmerick, et al, 94]) are designed to take advantage of the Markov property: the distribution over any conditional event is independent of its non-descendents given the values of its predecessors. The distributions for 32
incomplete nodes must be “completed” by the inference algorithm before the inference algorithm can reason with them. Obviously, any incomplete distribution can be turned into a complete distribution by adding passive conditional outcomes. The complete form of a conditional effect distribution is, in the worst case, exponentially larger than the incomplete form of the conditional effect (growth is exponential in the number of outcome variables).
33
34
3.0 UDTPOP: Noncontingent Planning
UDTPOP is an open-loop1 decision-theoretic partial-order planner. Although individual steps may have effects that are dependent on the effect of previous steps, the sequence of steps that is actually executed is not dependent on information observed during runtime. The same step sequence is executed regardless of the state of the world.
In this sense,
UDTPOP plans are similar to the plans generated by Buridan [Kushmerick, et al, 1994]. This kind of planning is called conformant planning [Goldman+Boddy, 1994b]. UDTPOP constructs the plan that maximizes the value of a multi-attribute value function. This value function is the sum of a reward value function and step cost functions. UDTPOP can trade planning objectives against each other or against the overall cost of the plan. UDTPOP is not required to identify a plan that accomplishes the objective; if the objective is too expensive, the planner will abandon that objective.
3.1
Overview
This chapter includes: • an introduction to the UDTPOP algorithm; • an extended example; • a description of the formal properties of UDTPOP; • an empirical comparison of UDTPOP with Buridan; and 1. This is a term from control theory: “Open loop” means that you cannot observe the state of the system while you are controlling it. The opposite is “closed loop” meaning that you can observe the system while controlling it and compensate for errors in control.
35
• a long discussion on design issues and extensions of UDTPOP.
In order to simplify the presentation, many of the proofs have been moved to Appendix A.
3.2
The Basic Ideas
This section will focus on the key insights that led to the development of UDTPOP. These insights center around 1) the nature of the commitment implied by each causal link and 2) a simple technique for recognizing pertinent steps when steps are deterministic or near deterministic. In sections 3.2.1 and 3.2.2, I will discuss the first of these insights. I will contrast two alternative strategies for adding causal links to probabilistic plans: multiple support, used by Buridan [Kushmerick, et al, 94], and single support, used by UDTPOP. We will argue that a single-support planner can be much more efficient than a multiple-support planner. In order to realize the potential efficiency gain of single-support planners in domains that contain steps with a strong degree of determinacy, we need to be able to quickly prune the list of steps that are not relevant to an open precondition variable. Section 3.2.3 describes a simple technique that allows the planner to determine whether a step is effective, that is, it contributes in a positive way to the overall mission of the plan.
3.2.1
Multiple Support
Multiple-support was pioneered in the Buridan planner [Kushmerick, et al, 94]. In multiple support, a single precondition can be established by multiple steps. The multiple-support strategy can be characterized by the following three properties: 1. The step representation for a multiple-support planner describes how each step changes the world (recall 2.6.5). 2. This change-based representation allows a planner to use multiple causal links to establish a single precondition proposition. 3. No a priori restriction is made on the order of execution between multiple steps supporting the same precondition (ordering constraints may be added later to resolve threats). 36
Over the next few pages, I will explain what these properties mean and what their implications are for probabilistic planning. In Buridan, the effects of a step are described using a set of conditional outcomes . Each trigger, c i , of each conditional outcome is a set of propositions. The triggers for the set of conditional outcomes are mutually exclusive ( ∀c i ≠ c j , c i ∧ c j = ⊥ ) and collectively exhaustive ( c 1 ∨ … ∨ c n = T ). Each effect e i is a set of propositions denoting the set of changes that are made to the state of the world if that outcome is the result of step execution. Any e k can be empty, indicating that conditional outcome has no effect on the world state. The set of conditional effects defining a step can be represented as a probability tree. The leaves of this tree indicate the individual outcomes ( e i ). The path from the root of the tree to each leaf encodes the trigger, c i , of each conditional outcome. Figure 9 illustrates a hypothetical drink-coffee step. If the coffee is caffeinated, drink-coffee changes the drinker’s level of alertness to conscious with an 80% probability (Leaf 1 in Figure 9) and has no effect at all with a 20% probability (Leaf 2). If the coffee is decaffeinated, drinkcoffee results in consciousness with a probability of only 5% and has no effect 95% of the time. 1 {Alert=conscious} p=0.8
2 {} p=0.2
Coffee=caf
{Alert=conscious} p=0.05
{} p=0.95
Coffee=decaf
FIGURE 9. A Buridan step.
Causal links in Buridan record the planner’s commitment to use the contribution of individual leaves of conditional effect distributions to establish support for precondition prop-
37
ositions. A “Buridan-style” or “thin” causal link links a single conditional outcome O of one step S E to a single precondition value p of another step S C . This thin causal link captures the commitment that p is true at the time that S C is executed because effect O of S E makes it true, at least some of the time. Buridan can increase the probability of p by adding more causal links to p , each originating in a distinct leaf from the same step or from a different step. The basic idea: the original thin causal link only causes p to be true some of the time–if we support p with additional causal links, the establishing steps for these additional causal links might make p true in case the original establishing step fails. Buridan can also increase the probability of an effect protected by a thin causal link by adding support to the trigger propositions of the conditional outcome that establishes the link. I define a multiple-support planner to be a planner that uses a change-based step representation with thin causal links. Consider the following example from Stress World. Harvey desires to be conscious so that he can drive home safely. Harvey can increase the probability of a successful drive by drinking coffee before departing, because if the coffee is caffeinated, he will be conscious with a probability of at least 80%. The causal link capturing this planning commitment runs from the effect Alert = conscious of the leftmost branch of drink-coffee’s probability tree to the precondition Alert = conscious of drive-home (see Figure 10).
38
{My-Location = home} Alert=conscious
{Alert=conscious} p=0.8
Drive-Home
Alert=asleep
{} {Alert=conscious}
p=0.2 Coffee=caf
p=0.05
{} p=0.95
Drink-Coffee
Coffee=decaf
FIGURE 10. Increasing the probability of a precondition with a causal link.
The probability of driving home successfully might be increased still further by committing to use the other conscious effect of drink-coffee (Figure 11) or, better still, by drinking two cups of coffee (Figure 12). The reasoning: If either cup of coffee wakes Harvey up, then Harvey is guaranteed to get home.2 {My-Location = home} Alert=conscious
{Alert=conscious} p=0.8
Alert=asleep
{} {Alert=conscious}
p=0.2
CoffeeType=caf
p=0.05
Drive-Home
{} p=0.95
Drink-Coffee
CoffeeType=decaf
FIGURE 11. Increasing the probability of a precondition with multiple links from the same step.
2. In the latter case, Buridan does not need to commit to the order of the two drink-coffee steps. We will see later that a single-support planner, such as UDTPOP, must impose “artificial” ordering constraints on steps that all support the same precondition. Because of this, Buridan can find more general plans than UDTPOP. In Section 3.10.5.2, we show how some of this generality can be recovered.
39
{My-Location = home} Alert=conscious
{Alert=conscious} p=0.8
{} {Alert=conscious}
p=0.2
CoffeeType=caf
p=0.05
Alert=asleep
{} p=0.95
CoffeeType=decaf
Drive-Home
{Alert=conscious} p=0.8
{} {Alert=conscious}
p=0.2
CoffeeType=caf
Drink-Coffee
p=0.05
{} p=0.95
CoffeeType=decaf
Drink-Coffee
FIGURE 12. Increasing the probability of a precondition using a causal link from a different step. Adding a second causal link increases the probability of consciousness from at least 0.8 to at least 0.96 if both cups of coffee are caffeinated.
If Buridan wanted to increase the probability of the partial plan in Figure 12 still further, it might attempt to add support to any conditional outcome that contributes a causal link. In this case, the “open preconditions” would be the trigger CoffeeType = caf on each of the drink-coffee steps. In addition, the Alert = conscious precondition on drive-home remains open, because Buridan could continue to add additional links (such as more drink-coffee) to support it. Buridan would not attempt to support CoffeeType = decaf until it uses one of the conditional outcomes that depend on this trigger proposition.
3.2.2
Single Support
UDTPOP and DTPOP are based around a competing notion, that of single support. Single support differs from multiple-support in three respects: 1. The step representation describes a complete distribution over all of the variable values that can result after the execution of the step. If a variable appears in one conditional outcome, it must appear in all of them. 2. Only one causal link establishes support for each open precondition. 3. If multiple steps can support the same precondition, only one supports that precondition directly. The rest must add their contribution indirectly through the passive conditional outcomes of intervening steps.
40
In a single-support planner, the distribution over a particular variable of interest, V , is established by the last step that can affect the probability distribution over V . A single fat causal link is used to capture the distribution over the precondition variable.3 Once this primary causal link has been added to the plan, the formerly open precondition may only be supported indirectly by satisfying the preconditions of the step that established the causal link. There is a philosophical argument for the single support approach – if two actions both affect a single proposition, it is usually the case that the actions cannot be independent and cannot be combined independently as they are in Buridan. When this somewhat unrealistic independence assumption does not hold, the performance of Buridan suffers. In the coffee example used in the previous section, the drink-coffee step either increases the level of alertness or persists the level of alertness that existed prior to drinking the coffee. Harvey’s level of alertness, which was not a precondition of the Buridan-style step, is a precondition of drink-coffee in a single support planner. In order to increase the probability that Harvey is alert, we add support that changes the probability of the preconditions of drink-coffee: either we can either increase the probability that the coffee is caffeinated, or increase the probability that the driver was awake prior to drinking this particular cup of coffee (Figure 13).
3. This causal link is “fat” because it captures the same protection and commitment implied by several “thin” causal links.
41
My-Location
Drive-Home
Alert
Drink-Coffee Alert
CoffeeType
FIGURE 13. Single Support. In single support, only one causal link is used to satisfy a precondition. In this case, Drink-Coffee establishes a distribution over Alert for Drive-Home.
Figure 14 illustrates the UDTPOP equivalent of multiple support. Rather than adding a second causal link to the Alert precondition variable of drive-home, a single-support planner attempts to increase the probability of Alert = conscious by increasing the probability of the Alert = conscious precondition of Drink-Coffee.
My-Location
Drive-Home
Alert
Drink-Coffee2
Alert CoffeeType
Alert
Drink-Coffee1 Alert
CoffeeType
FIGURE 14. Increasing the probability of a precondition using a causal link from a different step.
In a single-support planner new steps are always added to the leaves of the causal link tree rather than at arbitrary points as they are in a multiple-support planner. This ‘artificial’
42
restriction on step order is one of the key distinctions between a multiple-support planner like Buridan and a single-support planner like UDTPOP. The preconditions for a step are the variables4 that condition the conditional effect distribution used to establish that step’s “outbound” causal links. In deterministic or near deterministic domains, the techniques discussed in Section 3.2.3 are used to restrict the set of steps that are relevant to any particular precondition. This mechanism is the key to the efficiency of UDTPOP on deterministic domains. In summary, the single support approach that I outline differs from the multiple support planner Buridan in three respects: 1. Single Support for Preconditions: A single source of support is used for each precondition. 2. Variables: Variables are used instead of propositions. 3. Fat Causal Links: If there is a causal link protecting a given effect variable, it protects all of the generalized propositions concerning that variable that are asserted by the establisher for the causal link. The empirical section (Section 3.8) demonstrates that a single support planner can be much more efficient than a multiple support planner. We list (and briefly explain) several of the reasons below. We will revisit many of these issues in more detail in Section 3.9.
Decreased search space breadth:
When adding a causal link to a plan, the single-support planner has to choose between one of several possible steps that can influence a given precondition variable. Because a Fat Causal Link is used to capture all of the effects on the given precondition variable, there is no need to select between the individual conditional outcomes of the step. A multiple-support planner, on the other hand, must not only decide between these steps, but also must also choose which of the conditional outcomes of the step to use in order to support the precondition (Figure 15). This additional choice tends to increase the breadth of the search tree increasing the overall size of the search space. 4. Not just a single value of the variable.
43
V
V=T
? V Step 1
? V Step 2
V=T
V=T Step 1
V=T
V=T Step 2
FIGURE 15. Choices in Single and Multiple Support. A single support planner only needs to decide between steps that can influence a given precondition variable (A total of 2 options in the figure above.). A multiple support planner needs to also decide which of the specific effects should be used to establish a given precondition (A choice between 4 possible options).
Decreased search space depth:
The use of Fat Causal Links reduces the depth of the single-support planner’s search space. A single-support planner only needs to add a single causal link to capture all of the influences that one step S A has on a variable Xi that is pertinent to another step S B . In order to capture the same set of influences, several causal links need to be added: one from each conditional outcome on S A that influences a desirable value of variable Xi.5 Each additional bit of structure forces the multiple-support planner to make additional plan construction decisions, increasing the depth of the search space.6 5. The behavior of Buridan depends on the evaluator used. The FORWARD evaluator [Kushmerick, et al, 93, 95] uses only the steps in the plan and the ordering constraints to determine the minimum probability of the goal, thus it is insensitive to the number of links between steps as long as enough links are present to force Buridan to add enough preconditions to the plan. The REVERSE evaluator, on the other hand, only uses the causal links that are actually in the partial plan to calculate the goal probability. When the goal probability is set sufficiently high, Buridan may have to add enough links between two steps to capture all of the influences on a single proposition.
44
Xi
Xi
Xi Establisher
Xi=T
Xi=F
Xi=T
Xi=F
FIGURE 16. Proliferation of Structure in Multiple Support Planners. A single-support planner needs to add only one fat causal link in order to capture all of the influences that an establishing step has on a single precondition variable. A multiple-support planner may need to add a causal link for each leaf of the conditional effect tree in order to model the full spectrum of effects of the establisher on the precondition variable. Adding each ‘thin’ causal link increases the depth of the solution in the search space.
Nonsystematic search:
The space is said to be systematic if each partial plan appears in only one place in the search space. Search is highly nonsystematic in multiple-support planners. Imagine that step S A has a precondition Xi. Imagine further that step S B has several (say 10) conditional outcomes that all influence proposition Xi. When the multiple-support planner first considers linking S B to S A , it must choose one of 10 possible ways to add the first causal link. If the planner attempts to increase the probability of Xi by adding further links from SB ,
it find additional support in one of the 9 unused conditional outcomes of S B . The
problem: in one part of the search space, the planner might add a link from the first leaf of
6. The “confrontation” threat resolution mechanism in Buridan also greatly increases the number of planning decisions. When a threat is resolved via confrontation, additional safety (pre)conditions are added to consumer of the threatened causal link and additional postconditions are added to the nonthreatening effects of the threatening step. Buridan decreases the probability of the threatening effects of the threat by adding causal links to safety conditions and to the preconditions that tend to increase the probability of those safety conditions on the threat. This additional structure increases the number of decisions that Buridan must make during the planning process, increasing the size of the search space.
45
SB
and then add a link from the second leaf of S B . In another part of the search space, the
planner might add the links in the opposite order, duplicating the plan. Thus a multiplesupport planner not only searches over all of the possible combinations of causal links between S A and S B , but also searches over all of the possible orders for adding these links to the plan. In this example, a multiple support planner might identify as many as 10! = 3628800
plans that differ only in the order that the links were added from S B to
S A .7
A single-support planner can add links in a more systematic fashion because of SingleSupport for Preconditions. Each precondition can only have one source of support – once a link is added to the open precondition, it is illegal to add another link to that same precondition. Support for open preconditions can be identified systematically without backtracking.
Redundant support:
It may be difficult to determine when additional support is redundant in a multiple-link planner. For example, say that the step Flip-Coin has two mutually exclusive and collectively exhaustive effects: Coin-Face = heads and Coin-Face = tails , each occurring with probability 0.5. The Buridan planner might add links from two separate instances of the Flip-Coin step in order to attempt to increase the probability that Coin-Face = heads is true. However, unless the execution of the second Flip-Coin step is made contingent on the first (cheating in a non-contingent planner), the second Flip-Coin step will erase any contribution from the first Flip-Coin step. The last Flip-Coin step executed always establishes the face of the coin that is showing.8
7. This is definitely true for the current Buridan implementation. It may be possible to impose some discipline on the order that links are added to a plan in order to increase the systematicity of the search space.
46
A single-support planner will never make this mistake. Since the result of the Flip-Coin step does not depend on the previous state of the coin, the step does not have Coin-Face as a precondition and only one Flip-Coin step can be used to establish the goal.9
3.2.3
Effectiveness
The second insight underlying the design of UDTPOP is the design of a mechanism for pruning irrelevant steps when conditional distributions are deterministic or nearly deterministic. Most steps comprising a probabilistic domain contain conditional effect distributions that are relatively sparse. Many steps are ‘designed’ to do nothing if their preconditions are not satisfied. For example, if we wish to command a robot to stack one block on top of another, we might require that 1) we are holding one of the blocks, 2) the second block is clear, 3) the robot is powered up, etcetera. If any of these conditions are false, the robot does nothing and passively preserves the conditions that were true before the step was “executed.” In UDTPOP, I require that each step in a plan be effective – there should be some non-zero probability that each step can change the state of the world in some way that contributes to the overall goal of the plan. There are three ingredients for effectiveness: 1. Each step must have a conditional outcome (called an effective conditional outcome) that can change the state of the world. 2. This conditional outcome must be possible, that is the joint probability of the precondition values in the trigger for this conditional outcome must be greater than zero. 8. It is easy to modify Buridan so that it recognizes when distributions are complete. This, unfortunately, does not solve the problem. Dependencies in the preconditions between multiple supporting steps can introduce correlations that make it impossible to accomplish the desired goal using some of the causal links in the plan. The general problem of determining whether a thin causal link can support a precondition in any situation is NP-complete. 9. It doesn’t need to have Coin-Face as a precondition since there is no possibility that the step will preserve the previous state of the coin.
47
3. The value of the conditional outcome should either help the plan accomplish the goal or influence the value of a step cost function (pertinence). 3.2.3.1 Possibility
A conditional outcome is possible when there is some completion of the partial plan in which the probability of its trigger, c , is greater than zero.
Definition 2 (Possible): A precondition or effect value is possible if its probability is
greater than 0.0. 3.2.3.2 Pertinence
A step is pertinent 10 to an open precondition if that step has an effect that makes it possible for the plan to achieve its objective. For example, say that our goal is to write a four page paper. So far, we have assembled a partial plan consisting of one write step and wish to determine which of the preconditions of write should be satisfied in order to construct an efficient plan. Recall that a single write step can add 0 or 1 pages of written material to a paper and that the preconditions for write are Pages and Alert . In order for the final write step to achieve the objective of the plan, it must be the case that we have three or four pages of written material immediately before this final write step executes. A step S A is pertinent to the open precondition Pages of the final write step if Pages = 3 or Pages = 4
are possible effects of S A .
The set of conditional outcomes that are pertinent is contingent on the set of conditional outcomes that are possible. Say that it is known that the step immediately before the final write step in the example above results in three pages of written material with certainty. Then it is no longer possible to achieve the final objective (4 pages) by using a passive conditional outcome of write; the write step must produce 1 page of written material. 10.“Pertinence” is used to describe this relationship rather than the more obvious term “relevance.” Relevance can be confused with probabilistic relevance which will be used extensively in Chapters 4.0 and 5.0.
48
This, in turn, implies that the writer (Harvey) must either be “conscious” (he produces 1 page of writing with probability 0.8) or “in-the-zone” (he produces 1 page of writing with probability 1.0). The set of pertinent values for Alert is a function of the set of possible values for Pages . A precondition value is pertinent if the precondition supports a value function and adding the step that supports that precondition value makes it possible to achieve better than the worst possible value for the goal. A plan should make it possible to at least partially achieve the goal, otherwise there is no purpose in pursuing that plan.
Definition 3 (Pertinence): A precondition value x is pertinent if the precondition
supports a value function in the plan and setting the precondition to x makes it possible to achieve a value that is at least as high as the worst possible value in the reward utility function. Definition 4 (Pertinent Step): A step is pertinent if it has a conditional outcome that
can support a pertinent precondition. 3.2.3.3 Effectiveness
The conditional outcomes for each step are divided into two classes: conditional outcomes which change the state of the world (effective conditional outcomes) and conditional outcomes that persist the state of the world (persistent conditional outcomes).
Definition 5 (Effective Conditional Outcome): An outcome is effec-
tive with respect to variable A if either • A ⊂ E but A ⊄ C , or • ∃i, j such that A = E i = C j and e i ≠ c j .
49
Definition 6 (Passive Conditional Outcome): The
conditional outcome is passive with respect to variable A if ∃i, j such that A = E i = C j and ei = c j .
It is possible during planning to add steps that make subsequent steps superfluous. For example, in the example above, the final write step is superfluous if the step executed immediately before write can guarantee 4 pages of writing. In order to prevent UDTPOP from exploring or completing one of these provably inferior plans, UDTPOP adds a constraint that each step be effective: each step must have at least one effective conditional outcome that is both possible and pertinent.
Definition 7 (Effective Support): In order for a step S E to provide effective support
to another step S C , there must exist some possible effective conditional outcome of SE
that supports a pertinent precondition of S C .11
When UDTPOP uses a step to support a precondition, it adds an effectiveness constraint that guarantees that the step will provide effective support for the precondition in every V
completion of a partial plan. I will use the notation “ effective(S E →S C) ” to denote the conV
straint “support for causal link S E →S C should be effective.” The opposite of effective support is passive support. During threat resolution, UDTPOP may require that a step provide passive support to persist desired values of a threatened causal link. In order to enforce this constraint, UDTPOP uses another kind of constraint V
called a persistence constraint, denoted persist(S E →S C) .
11. Kambhampati proposes a similar constraint on effectiveness for his multi-contributor planner [Kambhampati, 94b]. A step is pruned if it cannot provide ‘effective’ support for at least one of the plan completions implied by the multi-contributor structure. Effectiveness in a UDTPOP plan, on the other hand, is a function of the multiple possible world ‘trajectories’ that might result from the execution of a plan. UDTPOP prunes a step if it cannot provide effective support for at least one of these ‘possible world trajectories.’
50
Definition 8 (Passive Support): In order for a step S E to provide passive support to
another step S C , there must exist some possible passive conditional outcome of S E that supports a pertinent precondition of S C . A simple constraint engine enforces effective and persistence constraints.
EXAMPLE
The effectiveness and persistence constraints can restrict planning options considerably in domains with goal-like value functions and steps with relatively few effective conditional outcomes. Figures 17 through 20 illustrate the import of effectiveness constraints on step selection in a simple robot navigation domain (Figure 17). The nodes and arcs on this graph denote cities and roads. Our objective is to derive a sequence of “Go-XXX” actions that move the robot to city a. The reward function is 10 if the robot ends up at a and is 0 otherwise. N
a (Goal) b c d
FIGURE 17. A very simple navigation domain. a , b , c , and d are cities connected by a single road.
One of the operations that our robot can execute is Go-North. This step moves the robot north one city at a time until it can go no further. Figure 18 illustrates the effect of this step; the robot’s location is advanced one step north unless the robot is already in City a .
51
Effects
a
b
c
d Effective Outcomes
Passive Outcome
a
b
c
d
Preconditions
FIGURE 18. A “state transition” diagram for the “Go North” step. “Go-North moves a robot north until it can’t go any farther. In this domain, Go-North moves the robot to City c if it starts in d , to City b if it starts in c and to City a if it starts out in b . If the robot is already in a , it stays in a .
A simple partial plan comprised of three Go-North steps is shown in Figure 19. The large triangular region in this figure captures the constraint that the steps be pertinent to the goal Loc = a .
The barred parallelograms denote the states that have to be possible in order for
Go-North 3 and Go-North 2 to be effective. The intersection of these three regions is the set of states that must be possible in order for these steps to be effective.
a
b
c
d
Go North 3
Pertinent to goal
a
b
c
d
a
b
c
d
a
b
c
d
Go North 2
Step 3 is Effective Step 2 is Effective
Go North 1
FIGURE 19. Constraints.
The intersection of these constraints implies that go-north 1 must be able to achieve Loc = b
or Loc = c . If there were to be a step prior to go-north 1, then that step should
have Loc = c or Loc = d in its effects. This implies, for example, that we cannot add a fourth go-north operation because d is not one of that step’s effects.
This makes sense:
if we can start no more than three cities south of A , it never makes sense to execute four
52
go-north steps. In such a plan, one of the go-north steps (the last one) would be ineffective, persisting the location of the robot rather than moving the robot north. In fact, the effectiveness criterion alone provides pruning even if the robots goal is to end up in any city (See Figure 20).
a
b
c
d
a
b
c
d
Go North 3
Go North 2
Step 3 is Effective Step 2 is Effective
a
b
c
d
a
b
c
d
Step 1 is Effective
Go North 1
FIGURE 20. Effectiveness. Even if our goal is to get to any city, the effectiveness constraints still prevent the planner from adding a step that doesn’t support d .
The Appendix shows that every step must provide effective support in the optimal plan: Theorem 5 (Effective Support is Necessary for Plan Optimality): If plan P is
optimal, then every step in P (except for S IC and S Goal ) provides effective support. Proof
See Appendix A.1.
Definition 9 Effective Plan: If every step in a plan provides effective support to another step, then we will say that the plan is an effective plan.
3.3
UDTPOP
In the next 6 sections, we will describe the UDTPOP algorithm. • This section will outline the top level design of the planning algorithm.
53
• Section 3.4 illustrates the planning algorithm using an example from Stress
World. • Section 3.5 describes the details of effectiveness, pertinence, and possibility cal-
culations. • Section 3.6 describes how to calculate the expected value for complete plans and
how to calculate bounds on the expected value of partial plans. • Section 3.7 describes the formal properties of UDTPOP. • Finally, section 3.8 benchmarks the performance of UDTPOP against the perfor-
mance of Buridan on a variety of domains. UDTPOP solves the following problem: given a planning problem , where R is the reward utility function, A IC
is the set of allowable steps, and is the distribution over all of the variables in the domain
immediately before the execution of any plan steps, UDTPOP finds a partial plan, assembled from the steps in A that maximizes the expected value of the reward utility function R . UDTPOP returns no plan if there is no plan that can result in an outcome better than the worst outcome in the reward utility function. At the top level, the design of UDTPOP is similar to other causal link planners. completeplan starts with an empty plan consisting of only an initial conditions step and goal step. UDTPOP incrementally completes this plan by repairing flaws (open conditions and threats) in the plan’s causal structure. When UDTPOP can find no further flaws in the plan, UDTPOP returns both the plan (a partially ordered sequence of steps) and a set of causal links that capture the set of cause-and-effect relationships required to compute the plan’s expected value. A sketch of the algorithm is shown below.
54
Complete-Plan ( P , Flaws ) if ( constraint_violated(P) ) return ∅ else if ( Flaws = ∅ ) return P else Choose a flaw, f , in Flaws . 1. if ( f is an open condition) either 1.a. P′ = Add-Step( f , P , Flaws – f ) 1.b. P′ = Add-Link( f , P , Flaws – f ) 2. if ( f 2.a. 2.b. 2.c.
is a P′ = P′ = P′ =
threat) either Promote( f , P , Flaws – f ) Demote( f , P , Flaws – f ) Persist-Support( f , P , Flaws – f )
return P′
All of the flaws in the plan are recorded in a flaw set Flaws . When this set is empty, there are no flaws left in the causal structure of the plan and the plan is complete. constraint_violated
checks the effectiveness constraints in the plan. If any constraint is vio-
lated, the plan is pruned.
3.3.1
Plan Flaws: Open Conditions
UDTPOP establishes explicit causal support for every pertinent precondition in a plan. A variable is a precondition if it conditions a cost function, conditions the reward utility function, or conditions a conditional effect distribution that is used to establish a causal V
link. A precondition V of step S is open if there is no causal link S E →S that establishes V.
Open
An open condition is denoted by ( → S ) where Open is the open precondition vari-
able and S is the step containing the open precondition variable.
55
Open conditions can be repaired by adding causal links to the plan, either through addlink or add-step.
3.3.2
Plan Flaws: Threats
ST
V
V SE
SC
V
FIGURE 21. Threats. S T threatens S E →S C . If a step (other than the establisher) can modify the variable protected by a causal link, then that step threatens the commitment denoted by that causal link and invalidates the underlying causal model. V
Definition 10 (Threat): A step S T is said to threaten a causal link S E →S C if
1. S T can possibly occur between S E and S C , and 2. V is an effect variable of S T . The notation S ⊗ L is used to represent a threat to causal link L from step S . Threats are resolved using promotion, demotion, and a variant on Buridan’s confrontation operator, persist-support.
3.3.3
Adding Support: Add-Step and Add-Link
Add-Step and Add-Link add new causal links to plans in order to repair open conditions. Add-Step selects a step (an action schema) from the set A of possible actions and copies it into the plan in order to establish support for an open precondition. Add-Link uses an effect variable from a step that is already in the plan in order to establish support. Both Add-Step and Add-Link use Add-Support to do all of the work.
56
Open
Add-Step( oc = ( → S C ) , P = , Flaws ) if there exists a step S E in the domain description that has Open in its effect variables, return Add-Support( oc , S E , , Flaws ) else return ∅ .
Open
Add-Link( oc = ( → S C ) , P = , Flaws ) if there exists a step S E in S such that S E is possibly before S C and Open is an effect variable of S E , return Add-Support( oc , S E , P , Flaws ) else return ∅
Open
Add-Support( oc = ( → S C ) , S E , P = , Flaws ) if (there is a conditional outcome ce = of step S E such that 1. ∃j such that Open = E j , 2. e j is pertinent for precondition Open of step S C 3. c is possible, and 4. ce is effective for Open.) then { Open let K new := effective(S E → S C ) Open let P' := let Flaws' := Prune(Flaws) ∪ newOCs P(S E, Open, P) ∪ newThreats(P) return Complete-Plan( P′ , Flaws' ) } else return ∅
Each call to add-support adds: Open
• a causal link, S E → S C • an ordering constraint, S E < S C , that forces the establisher of the causal link to
occur before the consumer; and
57
Open
• a constraint, effective(S E → S C ) , that forces the step to provide effective sup-
port for the causal link. The function newOCs adds new open conditions. A precondition variable is added to the open conditions list if it has not been established by any causal link and either: 1) is the precondition of a conditional effect distribution that is used to establish a causal link, 2) is the precondition of a step cost function, or 3) is a precondition of the reward utility function. The function newThreats discovers all of the new threats in the plan. Threats arise from two sources. When a new step is added, it can threaten existing causal links. When a new causal link is added, existing steps might threaten that new link. newThreats uses the threat definition to identify these new threats. When ordering constraints are added to the plan, it may no longer be possible for a step to threaten a causal link. The function prune removes threats that are resolved by the addition of ordering constraints.
3.3.4
Resolving Threats: Promote and Demote V
If S T threatens causal link S E →S C then it is possible for S T to be executed between the times that S E and S C are executed. One standard method for resolving this threat is to require that S T be ordered to execute either after S C executes (promotion) or before S E (demotion).
58
Promotion
V
ST V
SE
SC CAUSAL LINK ORDERING CONSTRAINT
FIGURE 22. Promotion. Promotion resolves a threat by forcing the threat to execute after the causal link. V
Promotion( T = S T ⊗ ( S E →S C ) , P = , Flaws ) If ( S C < S T is possible){ O′ := O + ( S C < S T ) return Complete-Plan( , Prune(Flaws) ) } else return ∅ .
Demotion
ST
V
SE
V
SC
CAUSAL LINK ORDERING CONSTRAINT
FIGURE 23. Demotion. Demotion resolves a threat by forcing the threat to execute before the causal link.
59
V
Demotion( T = S T ⊗ ( S E →S C ) , P = , Flaws ) If ( S T < S E is possible){ O′ := O + ( S T < S E ) return Complete-Plan(, Prune(Flaws) ) else return ∅ .
As in add-support, any time that ordering constraints are added to the plan, additional threat flaws may disappear.
3.3.5
Resolving Threats: Persist-Support
Persist-Support is similar in intent to the confrontation operation used in Buridan and UCPOP [Penberthy+Weld, 92]. Persist-support resolves the threat by making it impossible for the threat to change the state of the protected variable. Persist-support does this by redrawing the plan’s causal link structure so that the effect protected by the link passes through the passive conditional outcomes of the threatening step.12 Causal support can only be persisted by a threat if the threat has passive conditional outcomes that can serve as a “tunnel” to carry the desired effect of the establisher of the threatened link to the consumer via the threat. Since these passive conditional outcomes are now pertinent, UDTPOP will take step to increase the probability of these conditional outcomes, effectively “widening the tunnel” (Figures 24 and 25).
12.In this respect, UDTPOP’s persist-support operation is more similar to the confrontation operation of UCPOP. UCPOP adds ordering constraints that force the threat to occur between the establisher and consumer of the threatened link. Buridan, on the other hand, does not. Confrontation does not constrain the order of the establisher with respect to the threat.
60
V
V ST
V SE
SC PRECONDITION OR EFFECT THREATENED LINK
FIGURE 24. Before Persist-Support. The threatening step S T must have passive conditional outcomes for V .
NEW OPEN CONDITION(S) ON “TUNNEL”
ST
PASSIVE CONDITIONAL EFFECT “TUNNEL”
V V SE
SC NEW CAUSAL LINK THREATENED LINK (REMOVED)
FIGURE 25. After Persist-Support. The threatened causal link is ‘threaded’ through the ‘tunnel’ formed by the passive conditional outcomes of S T . Constraints ensure that the tunnel can possibly persist the state of V for pertinent values of V . Optional constrains ensure that S T has no effective conditional outcome that can influence a pertinent value of V for the new causal link V
S T →S C .
61
V
Persist-Support( T = S T ⊗ ( S E →S C ) , P = , Flaws ) if (there exists a conditional outcome ce = in S T such that ∃j such that Ej = V, ce is a passive conditional outcome for V, c is possible, ej is a pertinent value for V.) V V V then Let L′ := L – ( S E →S C ) + ( S E →S T ) + ( S T →S C ) O' := O + ( S E < S T ) + ( S T < S C ) V K new := persist(S T →S C) 13 P' = Flaws′ = Prune(Flaws) ∪ NewOCs(S T, V, P′) return Complete-Plan( P′ )
(
) else return ∅
It is possible for persist-support to redraw the structure of the plan so that two causal links are providing support to the same precondition variable of S T . This is a temporary state: if the causal links supporting S T originate in two different steps,14 each step will threaten the other’s support of S T . Resolving these threats resolves the dual support problem: only one causal link will support S T .
3.4
Example
In this section, we will illustrate some of the principles of UDTPOP using a simple example drawn from Stress World. Harvey’s objective is to write a short paper (1 to 2 pages). 1
13.The systematicity of UDTPOP can be improved by adding a constraint that S T not provide effective support for S C .Currently, UDTPOP’s constraint engine cannot enforce the negative constraint V
¬effective(S T →S C) (See Section 3.5).
14.If they originate in the same step, both causal links are actually the same link.
62
written page is worth $10, 2 pages are worth $20 and 0 pages are worth $0. In the initial state, Alert = conscious and Pages = 0 . The initial plan consists of an initial conditions step with two conditional effects capturing the initial conditions, Alert = conscious and Pages = 0 , and a goal step capturing the reward utility function.
PAGES = 0 PAGES = 1 OR 2
SIC
SGOAL
ALERT = CONSCIOUS
FIGURE 26. Initial Plan: The initial plan for the example has only two steps, S Goal and S IC .
Pages
Step 1: Use Add-Step to repair open condition ( →
SG ) Pages
The only flaw to repair in the initial partial plan is the open condition ( →
S G ) . In
order to achieve one of the desirable effects of S Goal with some probability, we will need to support Pages using a new step (the effect of S IC would result in a payoff of $0). Add-Step inspects the steps available to the planner and discovers that write has at least one conditional outcome that can change the state of the world to a state that is rewarded by the reward function. Add-Step adds •a new write step, S W1 , to the plan; Pages
•a causal link, S W1 →
S Goal ;
•a constraint that guarantees that write provides effective support for the
goal; and •two new open conditions and .
63
NEW CAUSAL LINK
PAGES
PAGES
PAGES = 0
SGOAL
SW1
SIC ALERT
ALERT = CONSCIOUS
ALERT
FIGURE 27. The example after adding SW1. Pages
Step 2: Use Add-Step to repair open condition ( →
S W1 )
We will choose to use add-step to add another write step to establish support for Pages
(→
S W1 )
(we could have used add-link to establish support from the initial con-
ditions). In order for S W1 to be effective, the new write step must be able to establish that Pages = 0 or Pages = 1 . There are effective conditional outcomes in write that can accomplish at least one of these goals, so write is a pertinent step Pages
for ( →
S W1 ) .
PAGES
SW2
PAGES ALERT
SIC ALERT
PAGES
PAGES ALERT
SW1
SGOAL
ALERT
FIGURE 28. The example after adding SW2.
64
Pages
Step 3: Use Add-Link to repair ( →
S W2 ) Pages
The next open condition that we will attack is ( →
S W2 ) . If we link S IC
to S W2 , then it is
still possible for • Pages to be 1 or 2 in the final reward function, and • Pages to be 0 or 1 for the precondition of S W1 . Pages
We can, therefore, add the causal link S IC →
SW2
PAGES ALERT
SIC
S W2 .
PAGES
PAGES ALERT
SGOAL
SW1 ALERT
ALERT
FIGURE 29. The example after adding a link from SIC to SW2.
Alert
Step 5: Use Add-Link to repair ( → S W1 ) Alert
In a similar fashion, we resolve the open condition ( → S W1 ) by adding the causal link Alert
S IC → S W1 .
SW2
PAGES
SIC
ALERT
PAGES
PAGES ALERT
SW1
SGOAL
ALERT
FIGURE 30. The example after adding a link from SIC to SW1.
65
Alert
Step 6: Resolve the threat S W2 ⊗ ( S IC → S W1 ) using Persist-Support S W2
Alert
threatens S IC → S W1 . This threat cannot be resolved by promotion or demotion
because S W2 is forced to occur between S IC and S W1 . The only threat resolution operation is persist-support. We can use persist-support if it is possible for S W2 to execute without changing the value of Alert . One of the conditional outcomes of write is <( Alert = conscious ) ( Alert = conscious ) >
so this is possible. Persist-support removes the
Alert
threat to S IC → S W1 by splicing S W2 into the middle of this causal link, resulting in two Alert
Alert
Alert
new causal links: S IC → S W2 and S W2 → S W1 . The old causal link, S IC → S W1 (with its effectiveness constraint), is removed from the plan. Notice that persist-support also repairs the final open condition.
SW2
PAGES
SIC
SGOAL
SW1
ALERT
ALERT
PAGES
PAGES
ALERT
FIGURE 31. The example after persist-support.
At this point, there are no further flaws in the plan–every pertinent precondition is supported by a causal link and there are no further threats. The plan is complete.
SW2
PAGES
SIC
ALERT
PAGES
PAGES ALERT
SW1
SGOAL
FIGURE 32. The final complete plan.
66
Note that we should be able to derive a better plan if we are allowed to use contingent steps–steps whose execution is a function of the number of pages written thus far and Harvey’s level of alertness. Contingent steps and observations will be discussed in Chapter 5.0.
3.5
Approximating Effectiveness
Recall (Section 3.2.3) that UDTPOP uses effectiveness constraints to improve the efficiency of planning when plan steps have conditional effect distributions that are deterministic or are near-deterministic. The definition of effectiveness relies on two concepts: possibility and pertinence. Possibility and pertinence are defined using probability queries on the model for a complete plan. In this section, I present a tractable technique for computing pertinence and possibility in partial plans that preserves completeness. This definition of pertinence and possibility is used by the constraint engine in both UDTPOP and DTPOP as well as in the completeness proofs for both planners. The approximation for pertinence and effectiveness does not require the construction and evaluation of a belief network. Instead, pertinence and possibility is estimated by tracing through the nonzero probabilities in the conditional effect distributions.
3.5.1
Possibility
Every value for a precondition variable is assumed to be possible if that precondition is open. Otherwise, a precondition is possible if any of its establishing causal links make it possible. A conditional outcome of a step is possible if all of the precondition values in its trigger are possible.
Definition 11 (PossibleP Precondition): An precondition value V = v in step S is
possibleP if: • V is open, or
67
V
• there is a causal link S E →S and V = v is a possible effect of S E 15.
Definition 12 (PossibleP Effect): An outcome O = o of step S is possibleP if there
exists a conditional outcome of S such that ∀c j ∈ c, possible P(c i) . 3.5.2
Pertinence
Definition 13 (PertinentP Effect): An outcome V = v of step S is pertinentP when V
∀( S →S C )
, V = v is a pertinentP precondition value of S C .
Definition 14 (PertinentP Precondition): A precondition value, V = v , is perti-
nentP if either: • V = v is a precondition of a cost function, • V = v is a precondition of the reward function and there exists a utility outcome
such that R > R min , or
• V = v is a precondition of conditional effect and e is pertinentP.
We prove in the Appendix (Section 1.2) that if an effect is possible , then it is certainly possible P .
Likewise if a step is pertinent , it is also pertinent P .
Theorem 6 (PertinentP and Pertinence): In pertinent(e) ⇒ pertinent P(e)
a
complete
plan,
and possible(e) ⇒ possible P(e) .
15. V can be established by more than one link before threat resolution.
68
3.5.3
Effectiveness
Definition 15 (EffectiveP ): In order for a step S E to provide effectiveP support to
another step S C , there must exist some effective conditional outcome of S E that is possibleP and supports a pertinentP precondition of S C . The completeness proof (Sections 3.7.2 and A.4.3) demonstrates that pertinent p and possible P
are sound for determining step effectiveness even when there are threats in the
plan. This proof relies on a trick that will be used throughout this chapter. We can place an extra persistence constraints on the persist-support step without losing completeness. The constraint that we will place on threat resolution is the following: Corollary 6 (Persist-Support Constraint): Say that we wish to resolve the threat V
S T ⊗ ( S E →S C )
. We can add the following constraints after persist-support without
compromising completeness: if is an effective conditional outcome of S T , and c is possible , then V = v cannot be a pertinent p precondition of S C . This is a restatement of the clairvoyant decision policy lemma (Lemma 3) used by the completeness proof. The basic idea behind the proof (the proof is actually quite a bit more complicated than this): V
Suppose that S T threatens S E →S C . If S T had an effective conditional outcome that was relevant to precondition variable V of S C in some completion of the plan, then it would have been possible to use either add-link or add-step to establish the causal link between ST
and S C . If we had drawn the links in this order when constructing the plan, there
would be no threat to resolve. Without losing completeness, we can restrict persist-support so that the threatening step S T will never contribute active support to S C .
69
If this persist-support constraint is used, we can also guarantee that threat resolution will never increase the utility of a partial plan. This fact will allow us to derive an evaluation function (see the next section) for A* search that guarantees that we can always find the best plan in the search space.
3.6
Evaluating Plans
This section discusses three topics: 1) evaluation and model construction of complete plans, 2) evaluation and model construction of complete plans, and 3) model construction and evaluation of partial plans. • Section 3.6.1 describes the basic UDTPOP search algorithm. • Section 3.6.2 presents an algorithm for constructing a belief network for a com-
plete plan and demonstrates that the expected value computed from this belief network is exactly the same as the expected value of the plan. • Section 3.6.3 presents a model construction algorithm for partial plans that uses
interval probabilities to represent open conditions. The upper bound on expected value calculated from this model is an upper bound on expected value for any essential completion of a partial plan.
3.6.1
Search
UDTPOP uses best first search (with pruning16) over the space of partial plans in order to identify the plan of highest possible utility (Fig. 33). This routine selects and prunes partial plans according to an upper ( UB ) and lower bound ( LB 17) on the expectation of the best completion of each partial plan. The upper bound is used to select plans to expand – if the upper bound is guaranteed to be at least as large as the exepected value of the best
16.This is the search strategy proposed for Pyrrhus [Williamson, 96]. 17.This is not the expected value of the worst plan. LB is the best expected value that can be guaranteed for some completion of the partial plan.
70
completion of that partial plan, then the search algorithm is guaranteed to identify the plan of maximum expected value. One of the central objectives of this section is to identify an evaluation function for partial plans that is a tight upper bound on the utility of the best completion of the partial plan. The optional lower bound is used to prune plans from the best first search queue. If we can prove that there is at least one plan with an expected value of LB, we can eliminate all partial plans from consideration that have an upper bound on value, UB, that is less than LB.
A lower bound can be computed by finding any completion for any partial plan. The upper bound is trickier and will be discussed in 3.6.3. Let GLB := V ( P ∅ )
(Optional)
Let Open = { P Initial } Loop { If Open = ∅ then return ∅ Remove P best from Open , s.t. P best has the largest UB . If P best is complete, stop and return P best . If LB(P best) > GLB then
GLB := LB(P best)
(Optional)
Let P be the set of plans that result from fixing one flaw in P . Let Open := Open ∪ P . Prune the plans in Open that have an upper bound that is less than or equal to GLB . (Optional) }
FIGURE 33. Best First Search. P ∅ is the empty plan consisting of SIC, SGoal, and causal links for all of the preconditions of SGoal. GLB is the largest lower bound on utility.
71
If pruning is critical (because of memory limitations, for example), it may make sense to set GLB to a higher value to force more pruning (this sacrifices completeness).
3.6.2
Model Construction in Complete Plans
Belief networks are used for plan evaluation [Pearl, 87]. The causal links, cost functions and conditional effect distributions in the plan imply a belief network that models all of the essential distributions in the plan. This belief network is comprised of probability nodes representing conditional effect distributions, deterministic nodes18 representing distributions over individual effect variables, and additive subvalue nodes19 [Tatman+Shachter, 90] representing the cost and reward functions. This network is constructed in a fairly straightforward fashion by tracing causal links back from the final reward function and step cost functions in the plan. If there is a causal link protecting Xi between two steps S Establisher and S Consumer , then there are one or more arcs from variable X i(S Establisher+)
to conditional effect distributions or cost functions in S Consumer .
An algorithm that constructs a belief network from a complete plan is illustrated in Figure 34. There are two global variables: the plan, P = and a belief network M =
consisting of distributions N and arcs A . The model construction algorithm is
started by calling Model_CE on the plan’s goal step and reward function (Model_CE( R , S Goal )). The algorithm includes a conditional effect distribution only if it is a requisite element for calculating the expectation of any of the value functions in the plan [Shachter, 88; 98] (see also Section 4.2).
18.A deterministic node is a probability node that has a probability distribution that consists entirely of 1’s and 0’s. A double oval (two concentric ovals) will be used to denote the deterministic node in a belief network. 19.Value nodes will be denoted by diamonds in belief networks.
72
Model_CE( CE , Step ) { if CE ∉ N then ( N := N + CE
for all V ∈ PreVars(CE) { V let S E →Step ∈ L be the causal link protecting V for all Cost E ∈ Cost(S E) , Model_CE( Cost E , S E ) find CE E ∈ Eff(S E) such that V ∈ OutVars(CE E) Model_CE( CE E , S E ) if V(S E+) ∉ N then {
* * * * *
E := EffVars(CE E) V(S E+) := the distribution P { V ( S E+ ) = v V = v, E\V } = 1.0 N := N + V(S E+) A := A + ( CE E, V(S E+) )
} A := A + ( V(S E+), CE )
} } }
FIGURE 34. Model_CE. An algorithm for constructing a belief network from a complete UDTPOP plan.
Each conditional effect distribution of a step may contain more than one effect variable. If a conditional effect distribution contains more than one variable, then the starred lines of Figure 34 establish intermediate deterministic nodes that each represent a single effect variable of the conditional effect distribution. For example, if a distribution over variable V
is needed and the belief network contains only a conditional effect distribution over U
and V , then the starred lines of Model_CE will add the following belief network fragment to extract variable V from UV : UV
V
Figure 35 illustrates the belief network that Model_CE constructs for the final plan in the example of Section 3.4. Since every conditional effect distribution in StressWorld has a single effect variable–Model_CE does not need to add any deterministic variables to the plan model.
73
PAGES(SIC+) PAGES SIC
SW2 PAGES ALERT SW1 ALERT
PAGES SGOAL
PAGES(SW2+)
ALERT(SIC+)
PAGES(SW1+)
RGOAL
ALERT(SW2+)
FIGURE 35. A Model for a Complete Plan.
Let’s step over Model_CE to see how it constructs this plan model.
PAGES SIC
SW2 PAGES ALERT SW1 ALERT
PAGES SGOAL
RGOAL
FIGURE 36. Model_CE(RGoal, SGoal).
Initial Call: Model_CE(RGoal, SGoal), B = <∅, ∅ > . B is empty so Model_CE adds the distribution RGoal to N. Model_CE loops over all of precondition variables in RGoal looking for causal links. There is only one such variable (Pages) from step SW1 that is established by a distribution over Pages(SW1+) in SW1. Model_CE recurs calling Model_CE(Pages(SW1+), SW1)
in order to construct a node for Pages(SW1+) so that Model_CE can add an arc
between Pages(SW1+) and RGoal. Model_CE(Pages(SW1+), SW1): Pages(SW1+) is not in the network, so Model_CE adds it. Model_CE then constructs a model for the first precondition of SW1, Pages.
74
PAGES SIC
FIGURE 37.
SW2 PAGES ALERT SW1 ALERT
PAGES SGOAL
PAGES(SW1+)
RGOAL
Model_CE(Pages(SW1+), SW1).
Model construction continues in a similar fashion in depth first fashion until Model_CE encounters a node with no preconditions. At this point, the stack begins to unwind–as each call to Model_CE exits, Model_CE adds an arc between the newly constructed node and the node in the argument to Model_CE. If precondition Pages is always before Alert in the preconditions list, the sequence of model construction actions will evolve as shown in Figures 37 and 38.
75
Model_CE(Pages(SW2+), SW2) PAGES SIC
SW2 PAGES ALERT SW1 ALERT
PAGES SGOAL
SIC
SW2 PAGES ALERT SW1 ALERT
PAGES(SW1+)
RGOAL
PAGES(SIC+)
Model_CE(Pages(SIC+), SIC) PAGES
PAGES(SW2+)
PAGES SGOAL
PAGES(SW2+)
PAGES(SW1+)
RGOAL
PAGES(SW1+)
RGOAL
Model_CE(Alert(SIC+), SIC) PAGES(SIC+) PAGES SIC
SW2 PAGES ALERT SW1 ALERT
PAGES SGOAL
PAGES(SW2+)
ALERT(SIC+)
Return from Model_CE(Pages(SW2+), SW2) PAGES(SIC+) PAGES SIC
SW2 PAGES ALERT SW1 ALERT
PAGES SGOAL
PAGES(SW2+)
PAGES(SW1+)
RGOAL
ALERT(SIC+)
FIGURE 38. Trace of Model_CE.
76
Model_CE(Alert(SW2+), SW2) PAGES SIC
SW2 PAGES ALERT SW1 ALERT
PAGES SGOAL
PAGES(SIC+)
PAGES(SW2+)
ALERT(SIC+)
Model_CE(Alert(SIC+), SIC) PAGES SIC
SW2 PAGES ALERT SW1 ALERT
PAGES SGOAL
Return from Model_CE(Alert(SW2+), SW2)
SIC
SW2 PAGES ALERT SW1 ALERT
PAGES(SW2+)
PAGES SGOAL
Return from Model_CE(Pages(SW1+), SW1)
SIC
SW2
ALERT
PAGES ALERT SW1
ALERT(SW2+)
PAGES SGOAL
PAGES(SW1+)
RGOAL
ALERT(SW2+)
PAGES(SIC+)
PAGES(SW2+)
ALERT(SIC+)
PAGES
RGOAL
PAGES(SIC+)
ALERT(SIC+)
PAGES
PAGES(SW1+)
PAGES(SW1+)
RGOAL
ALERT(SW2+)
PAGES(SIC+)
PAGES(SW2+)
ALERT(SIC+)
PAGES(SW1+)
RGOAL
ALERT(SW2+)
FIGURE 39. Trace of Model_CE continued.
77
Return from Model_CE(RGoal, SGoal) PAGES SIC
SW2 PAGES ALERT SW1 ALERT
PAGES SGOAL
PAGES(SIC+)
PAGES(SW2+)
ALERT(SIC+)
PAGES(SW1+)
RGOAL
ALERT(SW2+)
FIGURE 40. Trace for Model_CE completed.
3.6.3
Model Construction in Partial Plans
The goal of model construction in partial plans is to derive an upper bound on the utility of the partial plan so that any completion of the partial plan will have utility that is less than or equal to this bound. When the upper bound has this property, we can guarantee that the best possible plan can be found by using this upper bound to guide best first search. In Section 3.5, we described a technique for discarding sections of the search space without compromising completeness. If we can use this persist-support constraint to discard a partial plan, we will say that the completions of that partial plan are not essential. An upper bound is admissible if the upper bound on the utility of each partial plan is at least as large as the expected value of any of the essential completions of that partial plan. In order to derive this admissible upper bound on expected value, we will need to carefully study the effect of threats and open conditions on an evolving model of the plan. The next three sections show how UDTPOP estimates the upper bound on expected value for a partial plan. In order to do this, we will need to be able to model the effect of threats and open conditions on the expected utility of the partial plan. Modeling open conditions appears to be easy: we can replace all of the open conditions with decisions and identify the best possible combination of decision values to maximize the expected value of the completed portions of the plan. Section 3.6.3.1 demonstrates that
78
this intuitive approach is wrong! Section 3.6.3.2 reviews a probability interval calculus developed by Draper [96] and demonstrates that probability intervals are, in fact, the correct method to use for modeling open conditions. Section 3.6.3.3 describes how to properly model and evaluate partial plans with threats.
3.6.3.1 Modeling Open Conditions
One plausible way to model open conditions is to model them as decisions. This seems reasonable because, after all, if we could choose the optimal value for each of these open condition decisions, then we should be able to realize an expected value that is higher than if these decision variables were replaced by any other joint probability distribution. This (incorrect) argument is illustrated in Figure 41. Say that the set of open conditions is O and that the expected utility of the plan given any value for O is V { O } . It is always the case that max V { O } ≥ ∑ P { O }V { O } .20 FurtherO
O
more, the contribution due to the cost functions of each new step can only reduce the expected utility–only the UDTPOP reward function can have a utility that is greater than zero. It seems plausible, then, that the expected value of every completion of the partial plan must be less than or equal to max V { O } . This argument, however, is incorrect. What is O
the flaw with this argument?
20.The second equation is the convex combination of several conditional utilities, the largest of which is equal to max U { O } , therefore the sum cannot be larger. O
79
G
O1
O2
G
O3
O4
P{O1,O2,O3,O4}
FIGURE 41. A fallacious argument for using decisions for representing open conditions. The figure on the left represents the model for a partial plan. Open conditions are all replaced by decisions. By picking appropriate values for these decisions, we can always achieve a utility that is at least that of replacing the decisions by any joint distribution (including that represented by a plan) over the open conditions.
The problem with this argument is this: the add-link operation can be used to link the preconditions of steps to other steps that already exist in the plan. This, in effect, can allow the value for the precondition decision, O i , to be conditioned on the outcomes of existing steps in the plan. A simple example will illustrate why the naive decision approach fails. Suppose that our goal is to find a plan that sets binary variable A equal to the value of binary variable B . The reward function 1 if binary variable A is equal to binary variable B and is zero otherwise. There is one action schema in this problem domain, S A . S A has no precondition and has
two
conditional
= 0.5 .
outcomes:
= 0.5
and
The cost for S A is 0 .
80
R
Goal B
A
SA
B?
P{A} P{AB}
FIGURE 42. A simple partial plan that breaks the ‘straw’ model construction algorithm.
The partial plan in Figure 42 illustrates why the decision approach will not work. If we substitute a decision D for the open condition B on SGoal, then we compute an expected value of 0.5 regardless of the value for D (because of the uncertainty in the outcome of S A ).
But note that add-link can link precondition variable B of SGoal to the B outcome
variable of SA. This completion of the plan has an expected value of 1.0 , double that computed by replacing B with a decision (Figure 43).
R
Goal A
P{A} B
SA
P{B} P{AB}
FIGURE 43. Using SA to support both preconditions maximizes utility. Since A is always equal to B in SA, using SA to support both preconditions of SGoal results in a plan that has an expected utility of 1.0.
3.6.3.2 Modeling Open Conditions Using LPE
The right way to model open conditions21 is to use interval probabilities to model the plan. Localized partial evaluation (or LPE) [Draper, 96] is a family of techniques for approximating multiply-connected belief networks. Using interval arithmetic, LPE computes bounds on the joint and conditional probability distributions for subgraphs (called active 81
sets) of belief networks. The bounds achieved using only the active sets are guaranteed to contain the joint or conditional distribution that would result from the evaluation of the complete belief network. Figure 44 and Figure 45 illustrate the application of LPE to a simple inference problem. The belief network on the left in each figure represents the original multi-connected belief network. The middle figure highlights the active set that will be used to approximate the belief network. The right figure illustrates the interval belief network that the LPE algorithm constructs to bound the probability distributions. The contribution that the “nonactive” set makes to the active set is summarized using a set of vacuous probability distributions (distributions that contain probabilities that are bounded only to be anywhere between zero and one). For every arc that leaves the active set, LPE adds a vacuous likelihood distribution that is conditioned on the node at the tail of the deleted arc. For every arc that enters the active set, LPE adds a vacuous prior distribution with the same number of states as the deleted node. Given any belief network and any active set, the LPE algorithm computes a set of bounds on any probability query that is guaranteed to contain the value of the query that would be computed using the full belief network. In Figure 44, we illustrate the effect of approximating the belief network on the left with the active set in the middle. This active set consists only of the nodes R and B. The space of all possible contributions from the belief network outside the active set is summarized by two vacuous prior distributions representing the missing predecessor nodes, A and C. Say that we want to find a bound on P left { R = r } .22 The probability of this query is bounded above and below by the LPE bound computed from the interval belief network at the right of this diagram, that is, P left { R = r } ∈ P right { R = r } . 21.Actually, given a topological sort of the plan, it is possible to draw informational arcs from each of the effects in the partial plan to each of the open condition decisions that can possibly occur after that effect. In order to compute an upper bound, we would need to evaluate one of these influence diagrams for every topological sort of the steps that either contain an open condition or are possibly before an step containing an open condition.
82
R
R
R
B A
B
B
A C
C Active Set Vacuous Distribution
FIGURE 44. LPE Example I.
Figure 45 is an example using a different active set. This active set is missing only one arc (the arc between A and C). The contribution from A to C and from C to A are summarized using a vacuous prior and a vacuous likelihood distribution, respectively.
22.In this section, we will use a subscript on P to denote the belief network (not the active set) that is used to compute the query, for example, we might write P A { X Y } to denote the conditional probability distribution over X given Y using belief network A . When the active set is a subset of the full belief network, P A { X Y } is a probability interval. We will use subset notation to compare probability intervals of inter-
val or exact belief networks. P 1 ∈ P 2 means that exact distribution P 1 is contained inside the interval distribution P 2 . P 1 ⊆ P 2 means that the interval distribution for P 1 is contained inside P 2 . When we need to refer to the upper bound of a probability bound, we will use the notation P A { x } . The lower bound for the same query is P A { x } .
83
R
R
R
B A
B
A
B
A C
C
C
FIGURE 45. LPE Example II.
It is not necessary to know the details of LPE in order to understand the remainder of the chapter. The following points, though, are important: • If an arc leaves the active set in the LPE approximation, we replace that arc with
an arc to a vacuous likelihood distribution. • If an arc enters the active set, we replace the arc with an arc from a vacuous prior
distribution. • Adding a vacuous likelihood λ′ ( . X ) = [ 0, 1 ] to node X in interval belief net-
work α results in a set of bounds over probabilities that is at least as large as the bounds computed from the original belief network: P α { X Y } ⊆ P [ α + λ′ + ( X → λ′ ) ] { X Y }
.
• If there is a likelihood λ′ ( . X ) = [ 0, 1 ] in belief network α , then P { X } = [ 0, 1 ]
(a vacuous distribution). [Draper, 96]
Now, let’s get down to the business of using this machinery to model open conditions. Our strategy will be to assume that if a precondition is open, then any step that can possibly precede the precondition might conceivably form the root for a chain of steps that eventually will support the open precondition. The model for the partial plan can be modeled as an active set of the complete plan. Since there are possibly several (unknown) completions
84
for each partial plan, the model for the partial plan needs to be the union of the active sets for all possible completions. We are, in essence, pretending that the partial plan model is an active set of a larger belief network that contains arcs and distributions that correspond to plan construction operations that occur deeper in the planner’s search space.23 In figures 46 through 48, we illustrate that the model for a partial plan is an active set of one of the completions of that plan (neglecting threats). An interval belief network can be constructed for each of these active sets that consists of • probability nodes for each of the conditional effect distributions in the partial
plan that is connected to a cost or reward function via a set of causal links, • deterministic nodes for each of the utility and cost functions in the partial plan, • vacuous probability distributions for each of the open conditions, and • vacuous likelihood functions for each probability distribution that contributes an
causal link in the complete plan but not in the partial plan. In the following figures, we will simplify the plan model so that the relationship between the plan model (a belief network) and the plan itself (a partial-order graph of steps) is more obvious. We will assume, for example, that each step only has one conditional effect distribution with one outcome and that there is only one utility subvalue function (the reward) in each plan. The names used for each conditional effect distribution will be the same as the step that defines it.
23.Actually, this is not quite accurate – persist-support can delete arcs from the belief network used for the partial plan. The admissibility proof (Section 1.5) demonstrates that we can ignore threats when we are computing an upper bound on the plan’s utility.
85
SIC
SIC
SIC SA
SIC SA
SA
SA
Open Condition
SB
SB
SB SC
SB SC
SC
SC
SGoal
SGoal
SGoal
SGoal
A.
B.
C.
D.
FIGURE 46.
Active Set 1. Diagram A illustrates the original partial plan. This partial plan has a single flaw: an open condition on S B . A single arc from S IC to S B is used to complete the plan (B). The full model for B is shown in Figure C. The dotted region denotes the active set of the complete model that corresponds to the partial plan shown in Figure A. An interval belief network (D) can be constructed from the active set in C. Construct this belief network by replacing every arc that crosses the boundary of the active set with either: •A vacuous likelihood distribution P { . X } = [ 0, 1 ] , if the arc leaves the active set; or •A vacuous prior probability distribution P { X } = [ 0, 1 ] (if the arc enters the active set). The “plaid” nodes in this diagram denote vacuous distributions. The bounds on the expected utility of the reward node computed using network D contains the expected utility calculated from the full belief network (C).
Figures 46-48, each contain a partial plan, a completion of that plan, the model for the completion, and the LPE approximation of the model for the completion using only the nodes in the model for the partial plan. Figures 46.a-48.a all illustrate the same partial plan. Figures 46.b-48.b each illustrate one possible completion of this partial plan. A simplified version of the plan model computed by Model_CE is shown in each of Figures 46.c-48.c. The lightly shaded region illustrates the active set that corresponds to the distri-
86
butions and arcs that are contained in the original partial plan. Figures 46.d-48.d depict the interval belief networks that represents the active sets shown in 46.c-48.c.
SIC
SIC
SIC SA
SA
SA
SB
SA
SD
SD
Open Condition
SIC
SB
SB SC
SB SC
SC
SC
SGoal
SGoal
SGoal
SGoal
A.
B.
C.
D.
FIGURE 47.
Active Set 2. In this figure, the open condition in A is repaired by inserting an step S D that, in turn, links to S A .
87
SIC
SIC
SIC SA
Open Condition
SE
SA
SA
SA
SF
SF
SB
SE
SIC
SB
SB SC
SB SC
SC
SC
SGoal
SGoal
SGoal
SGoal
A.
B.
C.
D.
FIGURE 48.
Active Set 3. Any number of steps might be added to complete the partial plan. These steps can link to any number of conditional effect distributions scattered through out the steps of the plan that can possibly precede the steps with the open preconditions. In this example, a 2 step subplan is used to satisfy the open condition. This subplan is conditioned on the outcome variable of both S A and S IC .
In each case, note that a vacuous prior distribution replaces the open condition. In each case, one or more of the nodes that can possibly precede the open condition have vacuous likelihoods. This observation leads to our strategy: 1. Replace every open condition with a vacuous prior probability distribution. 2. If an step SA possibly occurs before any step containing an open condition, attach a vacuous likelihood distribution to each of the relevant conditional effect distributions in SA . This strategy is illustrated in Figure 49.
88
SIC
SIC
SA
SA Open Condition
SB
SB
SC
SC SGoal
SGoal
A.
B.
FIGURE 49. Modeling every completion of a partial plan.
3.6.3.3 Modeling Threats
In order to properly model unresolved threats, we will take advantage of the “clairvoyant persist-support constraint”. We will use this constraint to argue that threat resolution steps can only reduce the upper bound on the essential completions of the partial plan Imagine that a plan is complete except for some number of unresolved threats. If these threats are resolved via promotion or demotion, then there will be no impact on expected utility. The only way to change the expected utility of the plan is to change the causal link structure. The only threat resolution operation that can change the causal structure of the plan is persist-support. The clairvoyant persist-support constraint says that persist-support will never cause a threatening step S T to provide effective support to the condition protected by the threatV
ened link S E →S C . This means that if S T contains an effective conditional outcome
89
that is possible , then V = v cannot be a relevant precondition value for S C . This
has two implications: • First of all, passive conditional outcomes have no effect on the variable V and,
therefore, have no direct effect on the utility or cost functions in the plan. • Say that S T has effective conditional outcomes over variable V that are possible.
It cannot be the case that all of the values for variable V are relevant, since otherwise S T would be providing effective support to V . Since all of the variable values for cost functions are relevant, this implies that the causal link only impacts the reward function. All of the possible effective conditional outcomes of S T support precondition values v that are not pertinent p and, therefore, the outcomes are not pertinent . This, in turn, implies that these outcomes can only “transfer” probability
mass from the superior alternatives of the reward function to the worst possible outcome. Thus, if we assume the persist-support constraint of Section 3.5.3, threat resolution can have no effect on the expected values of the cost functions that are present in the plan prior to threat resolution. Furthermore, threat resolution cannot increase the expected value of the reward function. Persist-support may force the planner to search for support for the preconditions that become relevant because they condition the passive conditional outcomes used by persistsupport. The planner finds support for these open conditions either by using steps whose only purpose is to support these “secondary preconditions” or by using steps that have other purposes in the plan. In the first case, the utility of the plan can only decrease because every cost function is strictly positive. In the second case, using the outcomes of steps that are already in the plan cannot decrease the cost of the plan. Thus threat resolution either has no impact on the utility of the plan or reduces the utility of the plan (A detailed proof is provided in Appendix A.).
90
3.6.3.4 Model Construction and Evaluation for Partial Plans
The discussion in the last two sections suggests a rather surprising model construction and evaluation algorithm for providing an upper bound on the utility of the essential completions of any partial plan. This algorithm, EvaluateUB, consists of two phases. In the first phase, Model_CE2 constructs a single model for the entire plan. In the second phase, UB crafts a subgraph of this model in order to evaluate an upper bound for the expectation of the reward function and the expectation of each of the cost functions in the plan. The LPEbased model construction algorithm is illustrated in Figures 50 through 52.24 Model_CE2 (Fig. 51) is identical to Model_CE (Fig. 34) except for the starred lines. The starred lines do the following: • If a precondition is open, the distribution over the precondition variable is mod-
eled using a vacuous probability interval. This vacuous probability interval indicates that the precondition can have any prior probability in the interval [0, 1]. EvaluateUB calls UB to individually evaluate each utility function in the plan. UB adds vacuous likelihood nodes to the network to model possible sources for causal links to open conditions in the plan. UB prunes the network constructed by Model_CE2 to the smallest network that can be used to evaluate a specific reward or cost function. Every vacuous distribution in an interval belief network tends to increase the width of all of the interval probability queries that can be computed from the belief network. If we can prune extra vacuous distributions from the belief network, we can often find an interval around the expectation of the subvalue node or cost function that is significantly tighter than if we had not pruned the network.
24.UDTPOP never actually needs to construct the plan model. The model construction algorithm describes how the model is “hidden” inside of the plan–the plan evaluator can compute the utility of the plan directly from the structure of the plan itself rather than by constructing an independent belief network.
91
EvaluateUB( Plan = ) { Let M := <∅, ∅ > //the model (a global variable) Util = R ∪ ( – Cost(S i) )
∪
Si ∈ S
Model_CE2( R , S Goal , Plan ) return }
∑
//Modifies M
UB( U i , M )
U i ∈ Util
FIGURE 50. EvaluateUB.
92
Model_CE2( CE , Step , Plan = ) if CE ∉ N then { N := N + CE for all V ∈ PreVars(CE) V { if ∃( S E →Step ∈ L ) then { for all Cost E ∈ Cost(S E) , Model_CE( Cost E , S E ) find CE E ∈ Eff(S E) such that V ∈ OutVars(CE E) Model_CE( CE E , S E ) if V(S E+) ∉ N then { E := EffVars(CE E) V(S E+) := the distribution P { V ( S E+ ) = v V = v, E\V } = 1.0 N := N + V(S E+) A := A + ( CE E, V(S E+) ) } A := A + ( V(S E+), CE ) } else { // V is an open condition if ∃ a probability node for precondition var V of S then { let O V := be that node } else { let O V := a probability node w/states Domain(V) . //a vacuous distribution. P { O V } = [ 0, 1 ]
* * * * * * *
N := N + O V } A := A + ( O V, CE ) } }
}
FIGURE 51. Model_CE2. Model_CE2 is identical to the Model_CE algorithm except for the starred lines. These lines add a vacuous distribution for each open precondition.
93
UB( U i , M = ) { Let M′ = be the subgraph of M that is in the N p set of U i .[Shachter 98] For all S i ∈ S { if ( ∃ some S j such that S j is possibly after S i and S j has an open precondition) then { Let N S be the set of nodes N that correspond to conditional effect distributions of S i . For all N k ∈ N S { Let X be a vacuous likelihood function P { . N k } = [ 0, 1 ] N′ := N′ + X A = A + ( N k, X ) } } Use LPE to compute the posterior bounds P M′ { U i } . Let u Max := max u u ∈ Uj return max(u Max,
∑
u ⋅ P M′ { U i = u })
u ∈ Domain(U i)
}
FIGURE 52. UB. UB first extracts the subset of the belief network that is relevant to U i (Remember that if U j corresponds to a step cost function, then U i = – C i ). After extracting this subset, UB adds a vacuous likelihood to each distribution that might possibly condition one of the open precondition variables, either directly or indirectly. LPE uses this interval belief network to compute an upper bound on the expectation for U i .
Theorem 7 (Upper Bound): If there is only one source of support for every precondition in plan P , EvaluateUB( P ) can compute an upper bound, UB(P) ,on expected utility and this bound is admissible. When the plan is complete, the expected value of the plan V(P) and the upper bound UB(P) are equal. Proof
See Appendix A.5.
94
3.6.3.5 Persist-Support
Appendix A proves that the evaluation algorithm illustrated in Figures 50-52 computes an admissible upper bound for any partial plan as long as it is the case that each open condition is established by at most one causal link. Unfortunately, persist-support can sometimes cause an open condition to have multiple sources of support (let’s call this problem, V
the dual support problem). Say that ST threatens S E1 →S C and there is already a causal V
link S E2 →S T (see Fig. 53). After persist-support resolves the threat, there will be TWO causal links supporting the same precondition.
SC ST
V
SC
V
Persist-Support
ST
V
V SE 2
SE 1
SE 2
V SE 1
FIGURE 53. Persist Support can cause dual support.
This is a temporary condition because the establishers for the dual links threaten each other’s links; future threat resolution activity will eventually resolve these flaws. Unfortunately, even temporary dual support introduces a difficult utility evaluation problem. Fortunately, it is possible to avoid dual support by selecting an appropriate sequence for resolving the threats. Theorem 8 proves that it is always possible to find a threat to resolve that will not introduce dual support. Theorem 8 (Guaranteeing Single Support): It is always possible to order threat
resolution operations so that no precondition will ever be established by more than one causal link.
95
V
We will show that there exists a threat S T ⊗ S E →S C such that either:
Proof
1. the V precondition of S T is open or V
2. there is a causal link S E →S T . In either case, resolving the threat does not introduce dual support. Find any topological sort of the steps in the plan. One or more of these steps will threaten a causal link. Pick the earliest of these threatening steps in the topological sort. This step V
may threaten one or more causal links. We will select T = S First ⊗ ( S E1 →S C ) such that the establisher S E1 is at least as early in the topological sort as the establishers of the other causal links threatened by S First . There are three ways to resolve this threat. Promote and demote do not affect the causal links in the plan, so they cannot introduce dual support. V
Persist-support can only cause dual support when there exists a causal link S E2 →S First and S E1 ≠ S E2 , so we only need to consider this case.25 Now we know that S E2 cannot V
threaten S E1 →S C because S E1 is before S First in any topological sort and S First is, by V
construction, the earliest threatening step. We know that S E1 must threaten S E2 →S First because V
1. S E1 > S E2 (otherwise S E2 ⊗ ( S E1 →S C ) , a contradiction) and V
2. it must be possible for S First > S E1 because S First ⊗ ( S E1 →S C ) . V
We can resolve S E1 ⊗ ( S E2 →S First ) by any means because there can be no causal link V
S →S E
1
. If this causal link did exist, then it would also be the case that S E1 < S First V
(because otherwise S First would threaten S →S E1 and S is before the earliest establisher,
25.UDTPOP treats two causal links as being instances of the same link if all of the particulars of the causal link (the establisher, consumer, and variable) are the same.
96
SE
1
). This, in turn, leads to a contradiction:
V
S E ⊗ ( S E →S First ) 1
2
and S E1 is before S First
in any topological sort of the steps. (This situation is illustrated in Figure 54).
SC SFirst
V SE 1
V
This link cannot exist.
SE 2
V S
FIGURE 54. The contradiction in the proof for Theorem 8. .
3.6.4
The Implementation of the Evaluator used in UDTPOP-B
Any belief network algorithm can be used to evaluate a complete plan model. The algorithm used by UDTPOP-B (the variant of UDTPOP used in Section 3.8) constructs a simple join tree [Jensen, et al; 90a, 90b] directly from a topological sort of the partial plan. Given a topological sort Q = { S i } ni = 0 of the plan, we can define a function α(k) that represents the set of variables that are protected by causal links that run from steps in k
{ Si }i = 0
to steps in { S i } ni = k + 1 . Let β(k) denote union of the cost functions for S k and
the conditional effects that S k uses to establish causal links in the plan. Let Pre(N) be a function that extracts all of the preconditions of all of the conditional effects in N . Likewise, Eff(N) returns all of the effect variables in N . The UDTPOP-B evaluator implicitly constructs a cluster tree that is much like the cluster tree illustrated in Figure 55, below.26
97
Out(β(0))
α(0)
Pre(β(1)) ∪ Out(β(1))
α(1) ∪ α(2)
Pre(β(k)) ∪ Out(β(k))
α(k + 1) ∪ α(k)
Pre(R)
FIGURE 55. The cluster tree constructed implicitly by the UDTPOP evaluator.
This cluster tree is far from optimal in terms of total cluster size, but UDTPOP can construct and evaluate this cluster tree in milliseconds for small plans. Compressing the zeros out of the potentials would also significantly improve performance [Jensen+Andersen, 90; Kushmerick, 95].
3.7
Formal Properties
Section 3.7 explores some of the formal properties of UDTPOP. Section 3.7.1 argues that the algorithm is sound by demonstrating that: 1. If a plan is complete, the plan model constructed by UDTPOP is identical to the Markov model implied by the individual step distributions. 2. UDTPOP only returns complete plans therefore UDTPOP is sound. Section 3.7.2 argues that the algorithm is complete by showing that 1. Optimal plans can’t include ineffective steps.
26.The details of the algorithm are similar to the bucket-elimination algorithm [Dechter, 96].
98
2. Every linear series of effective steps can be represented as a UDTPOP plan graph. 3. Every optimal plan is in the search space of UDTPOP, except for possibly the empty plan. 3.7.1
Soundness
We define UDTPOP to be sound if every plan that it returns is complete and the underlying plan model is correct: the model always computes the correct utility for the plan. In order to prove soundness, we will show that all of the “essential parts” of the Markov model representing a particular step sequence are captured in the plan model generated for the equivalent UDTPOP plan. The “essential parts” of the Markov model are the distributions required for evaluating the expectation of each of the additive subvalue nodes. Shachter [88, 90, 98] defines a function, N p ,27 that determines the set of distributions required for answering any conditional probability query on a belief network. We will show that the belief network constructed by UDTPOP is equivalent to the N p set of the Markov Model implied by the individual step distributions.
Definition 16 ( N p ): [Shachter, 88; 98] N p(J E, M) is the set of requisite probability
elements for J given E in belief network M . This set contains all of the conditional probability distributions that are required for computing a joint probability distribution for the variables in J given the evidence variables E . Theorem 9 ( N p(J ∅, M) ): [Shachter, 88; 98] When no observations have been
made( E is empty), the set of requisite probability elements required for computing the joint probability over J is the union of J and all of the ancestors of the nodes in J . We will abbreviate this function N p(J, M) .
27.The set of relevant probability elements N p for a query used to be called N π .
99
3.7.1.1 Markov Model Definition 17 (Action Model): The effect of each step is a conditional probability
distribution, P S { X(S+) X(S-) } =
∏
P S, q { E S q(S+) C S q(S-) }
∏
Δ { X k(S+), X k(S-) }
Xk ∈ X \ ES
P S, q ∈ CEs(S)
where P S, q { E Sq(S+) C Sq(S-) } are the conditional effect distributions of S . Δ(X, Y) , a delta function, is 1.0 when X = Y and 0 otherwise. The model for an individual step is shown below. The nodes labeled C S denote utility subvalue nodes.
PS,1
X1(S-) X2(S-)
X1(S+)
PS,q
X2(S+)
CS{B} Xn(S-)
+
Xn(S+)
Xa(S-)
Δ{Xa(S+),Xa(S-)}
Xa(S+)
Xb(S-)
Δ{Xb(S+),Xb(S-)}
Xb(S+)
Xz(S-)
Δ{Xz(S+),Xz(S-)}
Xz(S+)
FIGURE 56. Markov Model. The Markov model for the distribution over all the variables in X(S+) given X(S-) . The conditional effect distributions P S, q and cost function C S are depicted toward the
100
top of the diagram. The “persistence” distributions Δ ( X i(S+), X i(S-) ) are shown at the bottom. The persistence distributions ensure that X i(S+) = X i(S-) when X i is not an effect of the step. The layers of the full Markov model are separated by outcome variable distributions, X , representing all of the variables modeled in the domain.
Definition 18 (Markov Model): The distribution over effect variables that results n
from executing the sequence of steps
n { Si }i = 1
is
∏ PS { X(Si+) X(S( i – 1 )+) } i
where
i=1
X(S i-) = X(S ( i – 1 )+) , and P S { S 0+ } 0
is the distribution over the initial states. The plan
model is the concatenation of the belief network fragments shown in Figure 56 with an initial distribution P S0 { S 0+ } and a final reward distribution R(X n+) . Whenever there is an arc from X(S i-) to P Si, q ( I Si, j ) in the model for each individual step, we will replace this with an arc from X(S ( i – 1 )+) to P Si, q ( I Si, j ). Call this belief network M m = ,
where N m denotes the distributions in the Markov model and A m denotes the arcs between these distributions.
STEP 1
STEP 2
X
I
X
X
I
X
P2
X X
C2 P0
+
X
X
I
X
X
I
X
I
X
X
I
X
I
X
X
P1
X
I
X
X(S0+)
C1
X(S1+)
+
R
+
X(S2+)
FIGURE 57. The Markov model for a 2 step plan.
101
3.7.1.2 Soundness Lemma 1 (The Plan Model and Np): Say that one of the topological sorts of the steps n
in the complete UDTPOP plan P = is Q = { S i } i = 1 . Let M m be the ⎛
⎛
n
⎞
⎞
⎝
⎝i = 1
⎠
⎠
Markov model of Q . The plan model of P , M p , is equivalent to N p ⎜ R ∪ ⎜ ∪ C i⎟ , M m'⎟ , where M m' is a modification of the Markov model that has the same variables and same joint distribution as M m . Thus the utility calculated from M p is identical to the utility calculated from M m . Proof
The proof of this theorem appears in Appendix A.3.
Definition 19 (Correct Causal Structure): A UDTPOP plan is said to have correct
causal structure, if the utility of the plan model is identical to the utility derived from the Markov model of every topological sort of the original plan. Theorem 10 (Soundness): Every plan returned by UDTPOP is complete, effective, and has correct causal structure. Proof
There are only two ways to exit from UDTPOP. If UDTPOP either cannot com-
plete a plan step or discovers a constraint violation, then complete-plan returns ∅ , indicating that no plan exists in this branch of the search space. The only other point where UDTPOP can return is when the goal agenda is empty in complete-plan. Such a plan has no threats or open conditions and, therefore, must be complete. All complete plans have correct causal structure by the last Lemma, therefore UDTPOP is sound.
3.7.2
Completeness
We will prove that UDTPOP is complete by using a “clairvoyant proof”. A clairvoyant proof [McDermott, 91] uses a clairvoyantly-known exemplar plan to provide search-control. This allows the planner to make all of the ‘right’ choices when duplicating the exemplar.
Since the exemplar is any plan satisfying the UDTPOP optimality criterion,
102
UDTPOP is complete. Recall that a planning problem is a tuple, , where R is the reward function, A is the set of step schema that can be used to construct a plan and IC is the set of distributions over all of the variables used in the domain. Theorem 11 (Completeness): Let Q be a non-empty sequence of steps that is an
optimal solution to the planning problem D = . The cost functions for the possible steps in A are all greater than zero. UDTPOP with the appropriate search control strategy can identify a partial-order plan P' that has a topological sort that is identical to Q . Proof
See Appendix A.4.
3.8
Empirical Results
In this section, we compare the performance of a variant on UDTPOP with the performance of Buridan. UDTPOP and Buridan construct similar plans, but have dissimilar goals. UDTPOP is designed to optimize utility. Buridan is designed (roughly) to identify the minimum complexity plan satisfying a probability threshold. In order to perform a direct comparison, we change the termination and evaluation functions in UDTPOP to produce a new planner, UDTPOP-B. The changes: • Termination test: A normal UDTPOP plan is complete if it has no flaws. UDT-
POP-B declares that a plan is complete if it has no flaws and the probability of success is above the specified threshold. • Cost function: Rather than using the upper bound on utility to guide best first
search, we use the default Buridan cost function: #Open Conditions + #Steps + #Causal_Links .28
These modifications may penalize UDTPOP-B more than Buridan. If a plan has no flaws and the success probability is below the termination threshold, Buridan can continue to
103
improve the plan. UDTPOP-B, on the other hand, is forced to discard this plan: no further refinement is possible. Table 12 and Figures 58-63 illustrate the performance of two versions of Buridan against UDTPOP-B. Buridan-R uses the REVERSE algorithm [Kushmerick, et al, 95] for evaluating the success probability. The REVERSE algorithm calculates a lower bound on the probability of success using the only the causal effect distributions, causal links, and threats in the partial plan. The REVERSE algorithm is analogous to the assessment algorithm used in UDTPOP-B: both algorithms use only the explicit structure of the plan for evaluation. Buridan-F uses the FORWARD algorithm [Kushmerick, et al, 95] for computing the probability of goal satisfaction. The FORWARD algorithm simulates the application of a topological sort of the steps in a partial plan to find the probability of goal satisfaction. In order to find a lower bound on the probability of success, the FORWARD algorithm evaluates the success probability for all topological sorts of the plan that are consistent with the ordering constraints and returns the smallest of these probabilities.29
Buridan-F will
return any partial plan as a solution to a problem if it can prove that it can always reach the threshold probability using any topological sort of the steps contained in the partial plan.
28. For the initial tests, I selected #OCs + #Steps for two reasons: 1. Gerevini & Schubert [95, 96] present an analysis of cost functions for causal link planners that seems to indicate that this cost function will be more efficient. Our own experience bears this out. 2. Buridan and UDTPOP-B use very different linking strategies (with UDTPOP-B generating several links corresponding to a single UDTPOP-B causal link). This cost function provided very good results for both planners on a subset of the problems. Unfortunately, this cost function caused Buridan to loop (infinitely) on several problems. Buridan can add links ad infinitum without increasing the number of open conditions or plan steps. In order to factor out the effect of the cost function on search space size, we also present representative plots that illustrate the estimated size of the entire search space. These plots are not sensitive to the cost function (though they are sensitive to flaw selection strategies). 29.FORWARD can exactly evaluate any topological sort of an underspecified plan (threats between operators, etcetera). Different topological sorts of an incomplete plan can have different utilities.
104
Most of the domains used in the experiment were derived from the domain descriptions distributed with the publicly-available source code for Buridan. Four statistics are illustrated for each of these algorithms. • Nodes-Visited: the number of partial plans that were refined. (interior nodes).
This is the number of plans that were popped off the top of the search queue in the best-first search algorithm. • Nodes-Generated: the total number of partial plans constructed by the algorithm. • Time: the total amount of time used for search.30 These times are included to
demonstrate that the time/refinement is comparable for both planners. Neither planner has been seriously optimized, so timing information is suspect. • Branching Factor Reduction: this is the percent reduction in branching factor due
to the use of effectiveness and relevance constraints. A successful plan is one that exceeds the termination threshold τ . The first column of the chart lists both the domain and the termination threshold used in each experiment.
DOMAIN
STEPS IN SOLUTION
BURIDAN-R
BURIDAN-F
UDTPOP-B
3
4/6 0.476 s.
4/6 0.433 s.
3/3 67% 0.066 s.
10
586/733 20.0 s.
581/721 115. s.
110/128 55% 30.4 s.
3
44/74 1.59 s.
43/70 1.60 s.
14/14 44% 0.500 s.
DETERMINISTIC DOMAINS Simple Lens World, τ = 1.0 Lens World, τ = 1.0 Chocolate Blocks World, τ = 1.0 UNCERTAIN DOMAINS
30.These tests were conducted using a Macintosh Quadra 630 running MCL 3.0p2. Optimizations and virtual memory are off. Ephemeral GC is on. 15 MBytes of RAM were available for MCL.
105
Bite Bullet World,
3
26/35 1.24 s.
19/21 1.45 s.
11/17 30% 1.43 s.
2
11/16 0.945 s.
7/11 0.676
8/15 6% 0.614 s.
2
>50000 several hours
236/932 13.5 s.
11/18 5% 1.62 s.
2
109/261 1.25 s.
37/85 2.21 s.
9/15 0% 0.693 s.
1
720/2944 44.5 s.
8/32 0.92 s.
3/4 20% 0.094 s.
IC5, τ = 0.999
0
207/327 6.65 s.
1/1 0.266 s.
1/1 0% 0.021 s.
IC6, τ = 0.999
0
1238/1958 86.0 s.
1/1 0.171 s.
1/1 0% 0.021 s.
Wet Towel World,
3
19/29 1.49s.
16/24 1.36 s.
4/7 13% 0.203 s.
3
17/42b 0.992 s.
6/14 0.71 s.
4/7 13% 0.176 s.
3
>6200/>50000 hours
154/858 14.0 s.
7/10 9% 0.339 s.
2
4753/26628 441. s.
116/333 6.37 s.
9/12 31% 0.522 s.
3
5925/30004 595. s.
789/3333 42.0 s.
10/14 28% 0.725 s.
4
c
4003/24193 420. s.
16/26 20% 1.46 s.
5
c
c
23/41 14% 2.80 s.
7
>72.9K/>500K
c
116/316 3% 32.4 s.
τ = 0.8 Bomb and Toilet World, τ = 1.0 Bomb and Clogging Toilet World, τ = 0.9 Waste Time World, τ = 0.9 Single Link World, τ = 0.81
τ =
0.65625 a
Slippery Blocks World, τ = 0.9 Diamond World, τ = 0.6 Mocha Blocks World, τ = 0.8 Mocha Blocks World, τ = 0.85 Mocha Blocks World, τ = 0.88 Mocha Blocks World, ( τ = 0.89 ) Mocha Blocks World, ( τ = 0.899 )
>9 hrs, 10
mind
P1 τ=1
3
27/79
10/27
26/150 73%
P2 τ=1
3
38/156
20/64
48/307 70%
P3 τ=1
4
625/3135
67/349
88/624 66%
P4 τ=1
5
8321/73095
533/3334
1020/8585 60%
106
P5 τ=1
5
e
1248/8769
1020/8757 59%
P6 τ=1
6
e
e
3425/32352 55%
a. This is the exact probability of 3 consecutive dry steps. b. Did not find the shortest solution. c. Ran out of memory. d. Test conducted using a Sparc 20 /96 MByte running Allegro Common Lisp e.Exceeded a 100,000 plan limit (plans generated).
Key: / % s. TABLE 12. UDTPOP-B vs. Buridan-R and Buridan-F
Notes on the tests: • Neither algorithm has been optimized in any way. For example, the best first pri-
ority queue in both algorithms is represented using a list (O[N] for insertions) rather than a more efficient data structure such as a heap (O[lg N] for insertions). • UDTPOP-B only evaluates the plan when it is complete. Buridan evaluates every
partial plan. • The constraint checking algorithm for UDTPOP-B is evaluated every time that a
plan is generated.
107
Relative Performance of UDTPOP-B and Buridan-R 1000000
UDTPOP
Plans Generated
100000
Buridan 10000 1000 100 10
P6
P5
P4
P3
P2
P1
Mocha 0.899*
Diamond World
Mocha 0.89*
Mocha 0.88*
Bomb&Toliet2*
Mocha 0.85*
Mocha 0.88*
IC6
Single Link
Lens*
IC5
Waste Time*
Chocolate*
Bite Bullet*
Slippery Blocks*
Wet Towel
Bomb&Toliet*
Simple Lens*
1
Domains FIGURE 58. Relative Performance of UDTPOP-B and Buridan-R. Buridan-R exceeded a pre-determined search limit on several of these domains. Search in Diamond World and Mocha Blocks Worlds 0.88, 0.89, and 0.899 was limited to 500K plans generated. Search in P5 and P6 was limited to 100K plans generated.
108
Relative Performance of UDTPOP-B and Buridan-F 1000000
UDTPOP-B
Plans Generated
100000
Buridan-F
10000 1000 100 10
P6
P5
P4
P3
P2
P1
Mocha 0.89*
Mocha 0.899*
Mocha 0.88*
Mocha 0.85*
Bomb&Toliet2*
Lens*
Diamond World
Mocha 0.88*
Chocolate*
Waste Time*
Single Link
Wet Towel
Bite Bullet*
Bomb&Toliet*
Slippery Blocks*
IC6
Simple Lens*
IC5
1
Domains FIGURE 59. Relative Performance of UDTPOP-B and Buridan-F. Buridan-F hit a pre-determined search limit of 100K plans before finding a solution for P6.
The number of plans generated by Buridan and UDTPOP-B during planning is shown in Table 12 and graphed in Figures 58 and 59. The search space cutoff was set to 100K plans generated for problems P1-6. The search cutoff for the rest of the problem suite was set to 500K generated. These figures demonstrate the dominance of UDTPOP-B on nearly every test domain, including domains P1 through P6 which were designed to be Buridanfriendly. There are several reasons for the difference in performance: • UDTPOP's search space is “almost systematic.”31 Buridan’s search space is
highly nonsystematic (see Section 3.9.2.1).
109
• The branching factor for Buridan is larger than that of UDTPOP. There are more
ways to add a single causal link between two steps–several identical effects on a single step may all be candidates for the add-link operation. • After Buridan has established one causal link for each precondition in a plan, the
branching factor of the search space explodes. Buridan backtracks on open condition selection in order to determine the best spot to add multiple support. Every precondition of every conditional outcome that contributes a causal link is an open condition. This growth in the branching factor for the plan is illustrated in Figures 60 and 61. As the search space depth increases, the branching factor of Buridan increases dramatically. The branching for UDTPOP, on the other hand, increases slowly or not at all. It is interesting to note that the branching factor for UDTPOP-B is one (meaning that there are no planning choices at all) for the first six levels of the search tree in Mocha Blocks World. • Buridan needs to establish more causal links than UDTPOP-B in order to repre-
sent the same plan; thus the shallowest solution for a given problem lies much deeper in the search tree than it does for UDTPOP. For example, the depth of the first Buridan-R solution for Mocha Blocks World 0.899 is approximately 25 layers deep in the search tree. The depth of the first UDTPOP solution is 20 layers deep. Diamond World was engineered to be especially difficult for Buridan. In this domain, the first UDTPOP solution is found 5 layers deep in the search space; the first Buridan-R solution is found at depth 20.
31.The only UDTPOP plan construction operation that can introduce duplication in the search space is persist-support. The “persist-support constraint” was not used for these tests.
110
Branching Factor in Mocha Blocks World 0.899 as a Function of Search Space Depth
Branching Factor
12 10
UDTPOP-B
8
Buridan-R
6 4 2 0 1
2
3
4
5
6
7
8
9 10 11 12 13 14 15 16 17 18 19 20 21
Search Space Depth FIGURE 60. The branching factor of Buridan and UDTPOP-B as a function of search space depth in the Mocha Blocks World 0.899 domain.
Branching Factor
Branching Factor in Diamond World 0.899 as a Function of Search Space Depth 9 8 7 6 5 4 3 2 1 0
UDTPOP-B Buridan-R
1
2
3
4
5
6
7
8
9
10 11 12 13 14 15 16 17 18
Search Space Depth FIGURE 61. The branching factor of Buridan and UDTPOP-B as a function of search space depth in Diamond World.
111
Since both the branching factor and depth of the search space is much smaller for UDTPOP, one might expect that the overall search space is dramatically smaller. This is, in fact, the case. Figures 62 and 63 illustrate the relative search space sizes of the two planners. A plan is labeled “visited” if either it is a solution to the problem or there was a flaw in the plan and a number of refinements to this plan were generated. “Generated” plans include any partial plan constructed by the planner. The difference between the number of generated partial plans and the number of visited plans is the number of partial plans on the fringe of the search tree. In the A* algorithm [Hart, 74], the term “visited” is synonymous with “closed.” The set of “generated” plans, on the other hand, is the union of both the “open” and “closed” lists of A*. 1000000000 UDTPOP-B Visited
100000000
UDTPOP-B Generated
10000000
UDTPOP-B Estimated Total 1000000 Buridan-R Visited 100000 Buridan-R Generated 10000 Buridan-R Estimated Total 1000 100 10
UD
1
1
2
3
4
5
6
7
8
9 10 11 12 13 14 15 16 17 18 19 20 21 22
Search Space Depth FIGURE 62. The relative search space sizes for Buridan-R and UDTPOP-B in Mocha Blocks World 0.899. No plans were generated beyond depth 19 for Buridan-R.
112
An estimate of the total size of the search space can be computed using the branching factor at each level of the search space and the depth of that level in the search space. The “estimated total” search space in figures 62 and 63 is estimated using the average branching factors of the “visited” plans at each level in the search space. This estimate is exact for a level in the search space if all of the plans in the previous level have been expanded, but may overestimate the number of partial plans at any level in the search space. Figures 62 and 63 illustrate the relative size of the search space for Buridan and UDTPOPB in Mocha Blocks World 0.899 and Diamond World. Note that the shape of the estimated search space is not a function of the assessment function (REVERSE or FORWARD) or the cost function. It is purely a function of the number of refinements available to Buridan or UDTPOP at any particular level in the search tree and the order that flaws are repaired in the planner. These figures illustrate that the search space for UDTPOP can be dramatically smaller than that of Buridan–sometimes by several orders of magnitude.
113
100000000 UDTPOP-B Visited 10000000 UDTPOP-B Generated 1000000
UDTPOP-B Estimated Total Buridan-R Visited
100000
Buridan-R Generated
10000
Buridan-R Estimated Total 1000 100 10 1
U
1
2
3
4
5
6
7
8
9
10 11 12 13 14 15 16 17 18 19
Search Space Depth FIGURE 63. The relative search space sizes for Buridan-R and UDTPOP-B in Diamond World.
The experiments also demonstrated that the effectiveness constraint has the greatest effect on deterministic or nearly deterministic domains. The reduction in branching factor for the deterministic domains varied between 44% and 67% (Table 12). The branching factor reduction for non-deterministic domains varies from 0% to 31%. There was no reduction in branching factor for the IC5 and IC6 domains, because no steps are needed in order to solve these trivial problems.
3.9 3.9.1
Discussion Mutual Exclusion
Other researchers [Breese, 92; Goldman and Boddy, 93] have constructed special mechanisms to allow theorem provers to infer when propositions are mutually exclusive. These
114
mechanisms are directly applicable to UDTPOP (and to Buridan, for that matter). The epsilon-safe planner [Goldman and Boddy, 93] organizes propositions into families related by conditional outcome statements. These propositions are mutually exclusive and collectively exhaustive–exactly one is true at any one time. The mechanism used in this chapter provides much of the same power and allows the planner to identify whether two events codesignate in constant time (the planner just needs to test whether both events influence the same variable.) at the cost of a loss in flexibility.
3.9.2
Comparison with Buridan
The UDTPOP representation and algorithm was partially motivated by the difficulty that Buridan exhibits on problems with symmetric distributions and multiple support. A symmetric distribution is one where a probability is specified for every value of the event variable and the distribution is conditioned on all of the possible values of the precondition variables.
3.9.2.1 Link Establishment Order
Say that there are several ways for Buridan to support a particular precondition. Because Buridan can use multiple effects to support this precondition, Buridan searches over all combinations of support for the precondition. Unfortunately, this search process is highly unsystematic–Buridan not only searches over all combinations of support, but also searches over all of the sequences of operations that it can use to establish these links. For example, suppose that there are 10 conditional effects on one step (the establisher) that are relevant to a precondition on another step (the consumer). When Buridan attempts to complete this plan, it initially tries to link individually to each of the 10 applicable conditional effects generating 10 new plans. It then selects one of these plans and attempts to add a link to each of the 9 remaining conditional effects generating 9 plans. This process continues with Buridan selecting from 8 remaining conditional effects, 7, etc. until there are no conditional effects left for Buridan to link to. If we use breadth first search with 115
9
Buridan, Buridan explores
- = 6235301 partial plans. ∑ ------n! 10!
32
UDTPOP will explore only
n=0
1.
3.9.2.2 Backtracking on Open Conditions
Buridan cannot systematically add support to a plan because it can never “closes” open condition. Buridan first establishes one source of support for every open condition of every conditional effect that supports the goal. After establishing this “nominal” amount of support, Buridan searches through all of the ways of adding support to this plan in order to increase its probability. Since Buridan is a multiple-support planner, Buridan can choose to add support to any precondition that supports a tree of causal links supporting the goal step. In a sense, Buridan never closes the original open condition. Any point where Buridan has added a single source of support is also a target for multiple support. Since Buridan cannot determine a priori which preconditions need multiple support and which do not, Buridan treats the selection of the open condition as a choice point. Since the number of ‘open’ conditions increases monotonically as planning progresses, the branching factor of the Buridan search space increases dramatically as we search for larger and larger plans.33
32.This behavior was unexpected. I initially tried this type of problem to test the performance of the REVERSE evaluator on links with lots of mutually-exclusive states. I constructed a domain with no action schemata and an initial conditions step that had 10 effects that were relevant to the goal. I constructed the domain so that Buridan would have to add links to each of the effects of the initial conditions step in order for REVERSE to conclude that the goal threshold probability was over a given threshold. Buridan was unable to identify the solution to this problem with breadth-first search. 33.It is possible to modify Buridan so that it does not backtrack on open condition selection. One way to do this: after adding a link to an open condition, modify Buridan so that it chooses explicitly to leave the open condition or remove it from the list of plan flaws. If this modification is made, Buridan can always choose to work on the first open condition in the flaw agenda without compromising completeness.
116
UDTPOP, on the other hand, terminates the exploration of an open condition with the addition of the first supporting causal link. Because UDTPOP does not backtrack over open condition selection, the search space for UDTPOP can be much smaller.
Initial conditions: A = false Goal: A = true Actions: Action: ONLY-STEP Preconditions: None Effects: A = true with probability 0.5. A = false with probability 0.5.
FIGURE 64. Simple Link Domain.
Buridan will pursue multiple support for a node even if it is impossible to resolve the threats resulting from multiple support. In the “simple link domain” (Figure 64), Buridan will pursue plans that have more than one step even though it is impossible to resolve the threats in those plans in any interesting way (the steps that are at the start of the plan end up being irrelevant to the goal). This happens because Buridan doesn’t examine all of the causal links that connect two steps when it attempts to resolve faults. The Buridan search space for this simple domain is infinite. The search space for UDTPOP contains only 1 partial plan.
3.9.2.3 Plan Evaluation and Mutual Exclusion
Buridan’s exact evaluators do not recognize that symmetric links are mutually exclusive until relatively late in the evaluation process. This makes it very expensive to evaluate plans such as the empty plan discussed in Section 3.9.2.1. Both of the recursive exact evaluators discussed in the [Kushmerick, et al, 95] allocate over 1 MByte of scratch space when evaluating the probability of this simple plan.
117
UDTPOP assumes that a step will use all of the available outcomes of an effect variable if it links to that step. It only allows confrontation or multiple support if there is “space” in the passive conditional outcomes of the threat to allow it. This increases the complexity of the design of the planning algorithm, particularly the portions concerned with open condition generation and consistency checking, but allows the evaluation algorithm to solve symmetric problems efficiently.
3.9.2.4 Persist-Support vs. Confrontation
The Persist-Support operation in UDTPOP adds more ordering constraints to a plan than does the Confrontation operation in Buridan. UDTPOP adds an ordering constraint in order to guarantee that the threat clobbers the causal link. Although this does result in a more committed plan, our past experience with several causal link planners [Peot+Smith, 93] and separation seems to indicate that addition of an additional ordering constraint may have the effect of reducing the overall size of the planning search space.
3.9.2.5 Soundness
Buridan’s criterion for soundness is based on the probabilistic evaluation of the partially specified plan. All of the evaluation algorithms for Buridan underestimate the probability that the plan will accomplish a goal. The planner does not return a solution unless the pessimistic estimate of the probability of that solution is above this threshold. If an evaluation algorithm such as the FORWARD algorithm is used, Buridan may be able to identify a solution very quickly without constructing very many causal links, especially when the FORWARD algorithm is modified to identify the best possible topological sort of the operators in a plan [Kushmerick, et al, 95]. The soundness of UDTPOP, on the other hand, is based on the soundness of the underlying causal model. If the model is not sound, the plan will be pruned, even though some topological sort of the steps may constitute a plan with a high expected value.
118
Since a topological sort of a sequence of steps implies a causal model, it should be possible to delay or eliminate the addition of links due to add-link and use polynomial-time model construction algorithms to complete plans based on a topological sort of the partial plan. This approach may be able to provide the same benefits of the Buridan/“optimistic” FORWARD combination while allowing the application of more sophisticated plan evaluation algorithms.
3.9.2.6 The Buridan Heuristic
There is a ‘hidden’ open condition selection heuristic in the Buridan planner that improves the performance of the planner considerably on deterministic problems. Buridan can add multiple sources of support to any precondition in a plan, but chooses not to do so until it has established one source of support for each open condition. On deterministic problems, this can result in greatly improved performance: the planner does not consider multiple support because every open condition only needs one source of support. The current version of UDTPOP does not share this performance advantage on deterministic domains. Since UDTPOP is a regression planner, it doesn’t really know whether the preconditions of steps can be achieved with probability 1.0 until the chain of steps achieving these preconditions are tied to the initial conditions step or some other step with a deterministic outcome. This means that UDTPOP will attempt to support steps indirectly through the passive conditional outcomes of other steps even though the probability of these passive conditional outcomes may be zero.34 There are several solutions to this problem. The first is to use multiple-support. Buridan can attempt to accomplish its goal using deterministic operators and then add additional support when the early attempt at a deterministic plan fails to produce a plan that satisfies the threshold probability. It does so by enhancing the probability of desirable open conditions through the use of additional steps or through support from other steps in a plan. 34.UDTPOP does prune these plans because they violate constraints.
119
UDTPOP can use the same strategy. Add-support can be modified so that UDTPOP steps can only directly support the effective conditional outcomes of the step that they directly link to. There are several important implications of this approach for UDTPOP: • UDTPOP would lose its ability to “close” open conditions. While the absence of
open conditions would imply a complete plan, this complete plan isn’t “finished”: subsequent addition of steps might significantly improve its utility. It may be difficult to identify dead-end plans–any plan might be improved arbitrarily through the addition of supporting steps. • The utility upper bound calculation depends on the fact that steps cannot be
added to the plan that increase the upper bound on the utility of the plan. • UDTPOP is ‘almost systematic’ because it does not allow multiple support.
When uncertainty is present, Buridan can construct the same step sequence a large number of ways through combinations of confrontation and add-support. There are only a few situations where UDTPOP can construct the same sequence of steps using add-support and persist-support. Section 3.10.4 shows that it is possible to eliminate all redundancy in the space of solutions. • After one source of support has been established for every open condition in
Buridan, open condition selection becomes a choice point. Every precondition in the plan is a potential target for additional support. This dramatically increases the branching factor of the planner and (equally dramatically) kills its performance. Allowing multiple support in UDTPOP would cause the same kinds of problems.
120
F
B
D
G
A
C
E
FIGURE 65. The network of roads used for the navigation domain.
The navigation domain (illustrated in Figure 65) was designed to illustrate the superiority of Buridan in some deterministic domains. The navigation domain models the navigation of a robot along a network of roads. Each operator has the form illustrated in Figure 66. The Move step only has an effect if the robot is at from. If the robot is at any other location, it stays at that location. Move(from, to) if (robot-location = from) then (robot-location = to) else (robot-location = from) FIGURE 66. The Move Operator.
Several different problems using this domain are described below. In each problem, the goal is to get to location G. The problems are as follows: • P1: The robot starts out in location A. • P2: The robot has an equal probability of starting from A or C. • P3: The robot starts from A, B, or C. • P4: The robot starts from A, B, C, or F. • P5: The robot starts from A, B, C, E, or F. • P6: The robot starts from A, B, C, D, E, or F.
121
P1
Buridan-R
Buridan-F
UDTPOP-B
UDTPOP-C
27/79
10/27
26/150 73%
4/26
38/156
20/64
48/307 70%
5/29
625/3135
67/349
88/624 66%
5/41
8321/73095
533/3334
1020/8585 60%
6/55
a
1248/8769
1020/8757 59%
6/56
a
a
3425/32352 55%
7/72
τ=1 P2
τ=1 P3
τ=1 P4
τ=1 P5
τ=1 P6
τ=1 a.Exceeded a 100,000 plan limit (plans generated).
TABLE 13. Empirical Comparison in a Navigation Domain
The results (Table 13) illustrate that, while UDTPOP-B creates more plans than Buridan-R or Buridan-F in the deterministic case, its performance falls off more gradually as the amount of uncertainty in the initial conditions increases. UDTPOP-C is a version of UDTPOP that uses a cost function to steer UDTPOP rapidly to the correct answer. This column illustrates that many of the performance deficiencies of UDTPOP (and Buridan, incidentally) can be improved through the development of realistic cost functions.
3.10 Extensions 3.10.1 Extending Relevance The relevance measure used in UDTPOP is extremely conservative. It allows the planner to develop support for low utility outcomes as long as these outcomes are not the worst
122
possible. While this strategy ensures that the absolute maximum utility plan is always chosen, it results in insufficient discrimination in some domains. One obvious modification is to throw out all of the alternative outcomes that are not worth pursuing both in the reward function and in the individual step cost functions. For example, say that the reward function for a domain has possible values { 100, 70, 60, 50, 10, 0 } . We can decide a priori that we will not add any steps to the plan if they are only relevant to the n bottom-most outcomes. In this case, we might decide that a step is not relevant if it supports either 0 or 10 . If we identify a plan that has a high expected utility, we may elect to prune more values from the list relevant reward outcomes. Let’s say that the expected utility of the empty plan is 65 due to especially favorable initial conditions. We might further restrict the relevance function for the remaining plans. For example, we might decide that only 100 and 70
are desirable enough to pursue.
This technique can be extended in an obvious way to step cost functions. If a step cost function has one or more expensive outcomes, we may introduce the constraint that the step may only be used if it is possible to incur only one of less expensive outcomes.
3.10.2 Extending Effectiveness An step is defined to be effective if it can possibly have an outcome that is relevant to the goal utility function or some intermediate step cost function. While this definition results in a complete planner, it is relatively weak. An effective step can actually reduce the probability of relevant outcomes, possibly reducing the utility of the overall plan. It would be desirable to find a stronger definition of effectiveness that can be used to prune plans that contain “effective” steps that reduce utility. For example, we might use the following (difficult to compute) criterion for effectiveness: a step is effective if deleting the step reduces the utility of the overall plan.
123
3.10.3 Tightening the Upper Bound on Evaluation One of the missing elements in the computation of the upper bound on expected value are the bounds on the cost and efficacy of the steps that are needed to complete the partial plan. Interval probabilities can be used within an operator graph [Smith+Peot, 93] to derive upper and lower bounds on the probabilities of outcomes of steps, by propagating interval probabilities through an operator graph until an equilibrium is reached. The probability intervals on the distributions of preconditions in the operator graph can replace the open condition distributions in the model for the partial plan. In a similar fashion, an operator graph can also be used to derive a lower bound on the cost of steps required to achieve (with greater than probability zero) any variable value given any combination of initial conditions. An operator graph can also be used to identify steps that can be the causal predecessors of open conditions in the plan. If an step in a partial plan can never be in the chain of causal links that supports one of the open preconditions in the plan, then there is no need to hang a vacuous likelihood off of the causal effect distributions of that step. Reducing the number of vacuous likelihood distributions in the plan model can narrow the bounds returned by LPE. The quality of the bound computed from the active sets can also be improved by using a more complex interval probability calculus. The most compact set of probability bounds for an interval influence diagram is not necessarily a rectangular interval, it is a convex polytope that is the convex hull of a number of points that is exponential in the number of interval distributions in the belief network. During calculation, LPE routinely finds a rectangular interval that bounds a more complex polytope. Every time that LPE simplifies the polytope, some information is lost and the bounds may begin to widen. One straightforward approach to handling this problem is the following: 1. If any distribution X in the interval belief network M is the predecessor of a vacuous distribution, replace X with a vacuous distribution. 2. Delete all of the predecessor arcs of every vacuous distribution in M .
124
3. Delete all of the vacuous distributions in M that have no successors. 4. Replace every vacuous distribution in M with a decision. 5. Optimize the probability of the desired query by setting the decisions in M . This will be the smallest upper bound implied by the interval belief network.
3.10.4 Systematicity A planner is systematic, loosely speaking, if no two plans in the planner’s search space are the same. In the Appendix (A.4.2), I demonstrate that UDTPOP is not systematic, because UDTPOP can partially retract planning commitments using persist-support. The intent of persist-support is to allow the planner to correct mistakes made during plan construction. The relevance restrictions prevent the threatening conditional effect distribution from being added through add-link or add-step. The completeness proof suggests one such constraint (the persist-support constraint): If persist-support is used to resolve the V
threat S T ⊗ ( S E →S C ) , then the conditional outcomes of S T cannot provide effective support for variable V . This provides only a partial solution to the problem of systematicity. This does result in a systematic search space, but only in the sense of solution systematicity [Kambhampati, 1993]. A partial plan is equivalent to another partial plan in the search space if there is a bijective mapping between the two plans for all of the causal links, ordering constraints and steps in these plans. A strongly systematic planner will not generate equivalent partial plans during planning. UDTPOP with persist-support constraints is not strongly systematic. Equivalent plan graphs can appear in the interior of the search space, but the sets of completions for these partial plans are mutually exclusive. UDTPOP can only prune a plan graph if it can bound the probability of the preconditions of the effective conditional outcomes away from zero for at least one persisted link. Unfortunately the planner may not be able to do this for a considerable amount of time after the 125
constraint is posted (possibly not until the last refinement is made to the plan!). This means that there will still be a lot of duplication of the interior nodes of the search space, even though the space of completions is systematic. It is an open question whether adding these constraints will help control duplication in the search space.35 In a domain like Mocha-Blocks-World (Section 3.8), persist-support is used frequently (72 ‘persist-support’ operations out of the 316 total plan refinement operations in the largest MBW problem solved.) so the potential for savings is large.
3.10.5 Multiple Support and Ordering Constraints 3.10.5.1 Multiple Support = Fewer Ordering Constraints (sometimes...)
One of the appealing properties of Buridan is that it does not need to order steps when they provide synergistic support for the same open condition. For example, in the Dry Gripper World [Kushmerick, 95] more than one dry gripper operator can support the goal gripper-dry
without clobbering each other–each operator can turn a wet gripper into a dry
one, but not vice versa. In this kind of problem, Buridan doesn’t require that the Dry-Gripper
steps be ordered with respect to each other. This is illustrated in Figure 67.
35.The constraint checker implemented in UDTPOP is not sophisticated enough to represent this constraint: it cannot bound the probability of variable values away from 0.
126
Goal
Dry Gripper
Dry Gripper
Dry Gripper
FIGURE 67. Multiple support can result in fewer ordering constraints.
In other domains, such as the Wet Towel World (Figure 68), Buridan must resolve threats between operators through confrontation and ordering – the Dry-with-wet-towel operations clobber each other. One operation may undo the effects of another. In this situation, Buridan is forced to linearize all of the Dry-with-wet-towel steps even though their effects are commutative.
Gripper = Dry 0.50 0.25 0.25
Gripper = Wet No Effect
FIGURE 68. The Dry-With-Wet-Towel Action from Wet Towel World.
3.10.5.2 Commutivity
The real reason why Buridan can avoid introducing ordering constraints (and, indeed, benefit from multiple support) is that many steps are commutative. Two steps commute if their effect is identical regardless of the order that they are applied. Any planner, including
127
UDTPOP, can take advantage of this property by aggregating steps that are mutually commutative. Actions can be individually tested for commutivity off line prior to plan generation. How can we take advantage of this? If steps are not distinguishable, there is no real benefit to aggregation. If steps are distinguishable, however, there can be a real benefit. For example, in the navigation domain discussed in Section 3.9.2.6, many steps are commutative. Obvious combinations include pairs such as [ move-a-b, move-b-a ] , but less obvious combinations such as [ move-a-b, move-c-d, move-e-g ] are fully commutative. When two steps commute, we can take advantage of this fact by only adding the operations
in
some
canonical
[ move-a-b, move-c-d, move-e-g ]
order.
For
example,
say
that
the
set
of
steps
commute. We can artificially restrict the possible steps that
can be added in order to satisfy the open conditions of these steps. move-c-d and move-e-g cannot be used to support move-a-b . move-e-g cannot be used to support move-c-d . When any step that is lower in the canonical order is added to the plan, we add the step to an unordered aggregation consisting of mutually commutative steps. For example, say that move-c-d
is used to satisfy a precondition. The precondition of move-c-d can be supported
by move-a-b but not move-e-g . The step move-c-d is replaced by the aggregate [ move-a-b, move-c-d ] .
3.11 Contributions This chapter described the design of UDTPOP, UDTPOP is a new utility-directed probabilistic partial-order planning algorithm. The algorithm is efficient, producing a search space that is orders of magnitude smaller than that of the first probabilistic least-commitment planner, Buridan. Other contributions include: • A proof that UDTPOP is sound. UDTPOP is sound in the sense that the causal
link structure of the plan correctly models the joint distribution over world states within any returned plan.
128
• A proof that UDTPOP is complete. The plan of highest utility is guaranteed to
lie somewhere within UDTPOP's search space. • The derivation of an admissible upper bound on the utility of all completions of a
particular partial plan. This admissible upper bound, in conjunction with the completeness proof, guarantees that UDTPOP will always find the plan of the highest possible expected utility when the utility bound is used with best-first search. • A new and efficient causal link mechanism is proposed. The probabilistic plan-
ner Buridan [Kushmerick et al, 1994] uses a “thin” causal link structure that allows the planner to decide exactly which outcomes of a step will be used in order to support a precondition. UDTPOP uses a less general “fat” causal link. A causal link in UDTPOP represents the commitment that the effect distribution of the establishing step will be used to establish the preconditions for some consumer. This simplified causal link structure reduces both the breadth and depth of the UDTPOP search space. This causal link structure, however, forces UDTPOP to make more step ordering commitments than Buridan. An extension is proposed, based on the commutivity of conditional effect distributions, that allows UDTPOP to recover some of this lost generality. • A constraint propagation mechanism is described that reduces size of the search
space domains that contain steps with deterministic or near-deterministic conditional effect distributions. • UDTPOP is compared to the probabilistic planner Buridan. This comparison
includes detailed discussion and a empirical comparison of the relative search space sizes for Buridan and a simplified version of UDTPOP called UDTPOP-B. This comparison demonstrates that the UDTPOP search space is smaller, but that Buridan can find more general plans than UDTPOP. This section describes tractability problems in Buridan that were not generally understood by the AI planning community prior to this dissertation.
129
130
4.0 Relevance and Independence
Chapter 5.0 describes DTPOP, a planner that can use observations to guide the selection of appropriate future steps. I will argue that this planner is complete because the structure of DTPOP is isomorphic to the structure of an algorithm for identifying probabilistically relevant nodes in an influence diagram. The purpose of this chapter is to provide the background material needed to understand that contribution. All of the material in this chapter is review material. I will discuss four topics in this chapter: • Section 4.1 defines notation and basic independence concepts for both belief net-
works and influence diagrams. • Section 4.2 briefly illustrates why observations must be relevant to value. • Section 4.3 defines the requisite element sets (Ne and Np) and relevant element
sets (Ni and Nr) for probabilistic queries [Shachter, 90, 98]. • Section 4.3.2 reviews the Bayes Ball algorithm for identifying the relevant and
requisite sets within a belief network with observed and observable nodes {Shachter, 98].
131
4.1 4.1.1
Notation and Definitions Belief Network Notation
A belief network [Pearl 88] is a directed acyclic graph (DAG) comprised of a set of probability (or chance) nodes N and a set of directed arcs A. Pa(J) will denote the set of nodes corresponding to the immediate graph predecessors (or parents) of node J. Similarly Ch(J), De(J) and An(J) will denote the children, descendents, and ancestors of J, respectively. Each node J in the belief network represents an uncertain variable with a conditional probability distribution P { J Pa(J) } . Deterministic distributions are conditional probability distributions that consist only of O's and 1's.
A deterministic node is a chance node that contains a deterministic probability
distribution. If node J is deterministic and the values are known for the parents of J, then only one value for J is possible – J is functionally determined given the values for the variables in Pa(J). If the values for the parents of J are known or are themselves functionally determined, then deterministic node J is also functionally determined [cf. Shachter, 98]. A belief network is a graphical representation for the joint probability distribution
P{N} =
∏ P { Ji Pa(Ji) } .
(3)
Ji ∈ N
Given a belief network B = and sets J, K, L ⊆ N , the set of uncertain variables J is conditionally independent of L given K in belief network B if P { J K, L } = P { J K } . This relationship will be denoted by J ⊥ L K . If the nodes in K are observed and J ⊥ L K , then it is not possible to learn more about the joint distribution of the nodes in J by observing any subset of the nodes in L. The opposite of conditional independence is conditional (or probabilistic) relevance. If ¬( J ⊥ L K ) , then the variables in L are conditionally relevant to the variables in J given observations for the variables in K.
132
One intuitive property of this independence relation is the relationship between the conditional independence of sets of nodes and the conditional independence of individual nodes within those sets: If J ⊥ L K then J m ⊥ L n K for all J m ∈ J and L n ∈ L . [Shachter, 98; Pearl, 88].
4.1.2
Influence Diagrams
N1
N2
D
V
FIGURE 69. Influence Diagram. An influence diagram is a directed acyclic graph that represents a decision problem. Decisions (D), Value Nodes (V), and Probability Nodes (N1 and N2) are denoted by rectangles, diamonds and ovals, respectively. The objective of the decision maker is to determine the decision policy, a function d ( V 1 ) , that maximizes the expectation the value node.
An influence diagram [Howard+Matheson, 81] is a directed acyclic graph representing a decision problem. Influence diagram M = is comprised of a set of uncertain nodes, N (denoted by ovals in the figure); a set of directed arcs, A; a set of decision nodes, D
(squares); and a value node V (diamond). Decisions represent factors that a decision
maker can control. Arcs into decision nodes will be called information arcs and the set of nodes that are parents of a decision node will be called information predecessors of that decision node. The information predecessors of a decision node D is the set of probability and decision nodes that can be observed before making decision D. The value node represents the reward function that the decision maker is attempting to maximize through the selection of appropriate values for each of his decisions. Value nodes may not be parents of any other kind of node in the influence diagram.
133
The objective of the decision maker is to select decision policies for each of the decisions represented by the influence diagram in order to maximize the expected value of the value node. In order to do so, I will usually require that the decisions be totally ordered D1, ..., DN.
The decision policy for each decision is a function d *i = f( I (D i)) , where I (Di) is the set
of information predecessors for Di. In general, the information set for a node Di is the union of the parents of Di, all of the previous decisions, and all of the information predecessors for all of the previous decisions. That is, I(D i) = Pa(D i) ∪
i–1
∪ ( { Dj } ∪ Pa(Dj) ).
j∈1
The definition of information set reflects the fact that it is not rational (in general) to forget the values of past observations or decisions when making a decision policy for a later decision.1 There may be implicit no forgetting arcs in the influence diagram from old decisions and the information predecessors of these decisions to newer decisions. It is possible in some circumstances to limit the size of some of these information sets. We can intentionally “forget” past information when there are intervening observations that render future value nodes independent of past observations. For example, suppose that we can perfectly observe the consequences of the first k decisions in a decision problem before making a subsequent decision. If we know all of the consequences of the early decisions, then we do not need to know the decisions and information that led to those consequences [Tatman+Shachter, 90; Shachter+Peot, 92; Shachter, 98].
4.2
Observation Relevance
A node in an information set is relevant to decision making only if that node is probabilistically relevant to the node. This obervation is the key to the development of DTPOP.
1. By “not rational,” I mean that the proscribed behavior is not compatible with maximizing expected utility.
134
DTPOP identifies good observation plans by identifying nodes that are relevant to one or more of the value or cost nodes in a DTPOP plan. An information predecessor (or observation) can only increase the expected value of a decision problem when that information predecessor is conditionally relevant to the value node [Poh+Horvitz, 96; Shachter, 86].2 The following three examples illustrate why this must be the case. In each of these examples, the decision problem is represented by an influence diagram possessing a single decision node D, a value node V, and a single probability node O.
O
D
V
FIGURE 70. A simple decision problem with one decision and no observations. O is not an information predecessor of D .
In the first problem (Fig. 70), the objective is to select the single value of D that maximizes the expected value of V if O cannot be observed prior to selecting D. The optimal course of action is to select the course of action that maximize the expected value of V conditioned on D. E [ V D ] =
∑ o∑∈ O vP { o }P { v o, D } .
The value-maximizing decision policy is
v∈V
d* = argmax ( E { V d } ) d
with expected value E { V D = d* } .
2. If this point is obvious, feel free to skip on to Section 4.3.
135
O
D
V
FIGURE 71. A simple decision problem with one observation.
Observations can increase the expected utility of a decision. Let’s say that we can observe the outcome of variable O prior to making decision D. In this case, our best decision policy
is
E [ V d, o ] =
a
function
∑ vP { v d, o } .
of
the
observation
d*(o) = argmax ( E [ V d, o ] ) d∈D
where
The expected value of this decision scenario is
v∈V
P { o }E [ V d*(o), o ] . ∑ o∈O O
It is only rational to condition the decision policy on observation
if the expectation of the decision policy that uses O is larger than the expectation of the
decision policy that is not a function of O: E [ V d, o ] > max ∑ P { o }E [ V d, o ] ∑ P { o } dmax ∈D d∈D o∈O
(4)
o∈O
The difference between the two sides of Equation 4 is called the Expected Value of Perfect Information (or EVPI) on observing O prior to making decision D [Howard, 66, 67; Poh+Horvitz, 96].
EVPI M(D O) =
E [ V d, o ] – max ∑ P { o }E [ V d, o ] ∑ P { o } dmax ∈D d∈D o∈O
(5)
o∈O
136
O
O
D
V
D
V
FIGURE 72. Irrelevant observations. If O is conditionally irrelevant to V (left), then conditioning the decision policy on O results in an expected value that is no greater than if O were not observed (right).
In order for O to change the expected value of the decision, it must be the case that the expected values for one or more of the decision options is a function of O. Say that V is conditionally independent of O (Fig. 72) then the EVPI for O is zero:
EVPI M(D O) =
P { o } max E [ V d ] – max ∑ P { o }E [ V d ] ∑ d∈D d∈D o∈O o∈O
= ⎛ ⎝
P { o }⎞ max E [ V d ] – max ⎛ ∑ P { o }⎞ E [ V d ] ∑ ⎠d ∈ D ⎠ d ∈ D⎝ o∈O o∈O
.
(6)
= 0
Generally speaking, it does not make sense to condition a decision on an observation if the expected value of the decision alternatives are independent of the result of the observation. This (somewhat obvious) observation will be the basis for the identification of observations in DTPOP. DTPOP identifies candidate observations by constructing observation plans that are conditionally relevant to a value node in the plan model.
4.3 4.3.1
Identifying Relevant Nodes in Belief Networks The Ne, Np, Nr and Ni sets
The structure of the belief network can reveal many of the independence relationships between uncertain variables. There are several classes of techniques for identifying inde-
137
pendencies in belief networks and related structures [Shachter, 86, 88, 90, 98; Geiger, et al, 89; Lauritzen, et al, 90]. DTPOP is based on the Ne and Np sets3 defined by Shachter [98]. In particular, DTPOP’s observation discovery mechanism resembles the “Bayes Ball” algorithm [Shachter, 98] described at the end of this section. Np
and its siblings were originally developed to identify those portions of a belief network
or influence diagram, B, that are actually required for the computation of a specific probability query P{J|K}. Ne(J|K,B) denotes the set of requisite observations for computing probability query P { J K }
in belief network B . Ne(J|K,B) is the subset of K required in order to compute P { J K } ; that is, P { J K } = P { J N e(J K) } . In probabilistic planning, the Ne sets will be used to identify observations. If the belief network node describing the observable outcomes of an observation step So is not in an Ne set for one of the value nodes in the plan, then it is not possible to exploit the observable outcomes of So to increase the expected value of the plan. Np(J|K,B)
denotes the set of requisite probability nodes required to compute P { J K } . All
of the conditional probability distributions for the nodes in Np(J|K,B) are required in order to calculate P { J K } . If none of the distributions in a step in a complete plan are in either the Ne or Np set of a value node, then there is no reason to execute that step: it cannot contribute to the expected value of the plan. Ni(J|K,B)
denotes is the set of nodes that are irrelevant for computing P { J K } . If a set of
nodes L is irrelevant, changes to the nodes in L (either distributions or values) will have no effect on the joint distribution of J . L ⊆ N i(J K) implies that J ⊥ L K .
3. These sets were originally called N π and N Ω sets [Shachter, 86, 88, 90].
138
Nr(J|K,B) denotes the set of nodes that are probabilistically relevant for computing P { J K } . ¬( J ⊥ L K ) P{J K} ,
implies that L ⊆ N r(J K) . If a node Xi is in the Nr set for a particular query
it is not necessarily a requisite element for that probability query. However, if a
node Xk is observed that is either the same as Xi or is one of Xi’s successors, then Xi will be a member of Ne{J|K, Xk) or Np{J|K, Xk). The planner can discover observation plans by finding observable successors for relevant nodes in the influence diagram that models the plan.
4.3.2
The Bayes Ball Algorithm4
The Bayes Ball algorithm [Shachter, 98] computes the requisite and relevant sets for a probability query P{J|K} by passing messages around a belief network. The algorithm is initiated by calling Collect-Requisite (defined below) on each of the nodes in J. As the routines in the Bayes Ball algorithm visit belief network nodes, those nodes are added to the relevant or requisite sets. Bayes Ball consists of two routines: • Collect-Relevant implements the top-down message illustrated in Figure 77.
The routine is called “Collect-Relevant” because the routine always makes its argument into a probabilistically relevant node. This routine might also be called something like “Search for Requisite Observations”: the routine initiates a chain of calls that descend the belief network until an observation is discovered. When the observation is discovered, Collect-Requisite will travel back up the network telling nodes that they are required parts of the query calculation. • Collect-Requisite implements the bottom-up message illustrated in Figure 76.
This routine is called “Collect-Requisite” because the routine always adds either the value or the distribution of its argument to the one of the requisite element sets ( N e or N p ).
4. The Bayes Ball algorithm gets its name from the way that the messages bounce through the belief network (the grayed out arcs in Figures 77 and 76).
139
In both routines, the condition in parentheses on the starred (*) lines has been added to prevent looping. The algorithm is initiated by calling the top-level routine Bayes_Ball with the query variables and belief network. After Bayes_Ball executes, the global variables will contain the sets of interest: N e(J K, B) , N i(J K, B) , N p(J K, B) , and N r(J K, B) . //Global Variables NodeSet D //Deterministic nodes NodeSet K //Observed nodes NodeSet N e //Nodes with requisite values NodeSet N p //Nodes with requisite distributions NodeSet N r //Nodes that are structurally relevant NodeSet N i //Nodes that are irrelevant Bayes_Ball( NodeSet J in , NodeSet K in , BeliefNet B = ) { //Initialization D := { i ∈ N i is deterministic } K := K in N e := ∅ N p := ∅ N r := ∅ for all j ∈ J in , Collect-Requisite(j,B) N i := ( N – N r ) ∪ K in N r := N r – K in
} FIGURE 73. Bayes_Ball. In order to compute the requisite and relevant sets for the query P { J K } in belief network B call Bayes_Ball( J , K , B ).
140
Collect-Requisite(Node j , BeliefNet B ) { * if j ∈ K and ( j ∈ N e ) { //j is observed. N r := N r + j //j is structurally relevant. N e := N e + j //j has a requisite value. } * if j ∉ K and j ∉ D and ( j ∉ N p ) { //j is not deterministic N r := N r + j //j is structurally relevant. N p := N p + j //j has a requisite distribution. for all k ∈ Pa(j, B) , Collect-Requisite( k , B ) for all k ∈ Ch(j, B) , Collect-Relevant( k , B ) } * if j ∉ K and j ∈ D and ( j ∉ N p ) {//j is deterministic. N r := N r + j //j is structurally relevant. N p := N p + j //j has a requisite distribution. for all k ∈ Pa(j, B) , Collect-Requisite( k , B ) } } FIGURE 74. Collect-Requisite.
Collect-Relevant(Node j , BeliefNet B ) { * if j ∈ K and ( j ∉ N p ) { //j is observed. N r := N r + j //j is structurally relevant. N p := N p + j //j has a requisite distribution. N e := N e + j //j has a requisite value. for all k ∈ Pa(j, B) , Collect-Requisite( k , B ) } * if j ∉ K and ( j ∉ N r ) { //j is not observed N r := N r + j //j is structurally relevant. for all k ∈ Ch(j, B) , Collect-Relevant( k , B ) } } FIGURE 75. Collect-Relevant.
141
4.3.3
Ne, Np, Ni, and Nr Examples
I will use few ‘representative’ inference problems in order to illustrate the purpose for each component of the Bayes Ball algorithm. Each one of these inference problems corresponds to a single case in Bayes Ball.
4.3.3.1 Collect-Requisite
In the examples shown in Fig. 76, R is known to have a distribution that is a requisite element of an unnamed probability query Q. In order to compute Q, we know that we will also need the distribution or the value for all of the parent’s of R (because the posterior distribution for R is a function of R’s parents). The Bayes Ball algorithm sends a message (via Collect Requisite) to T informing T that either its value or its distribution is required for the computation of the query.
T
R
R
R A
T
T
B
C
FIGURE 76. Collect Requisite Cases. The double oval in the third of these diagrams denotes a deterministic node. In these three figures, the distribution associated with R is known to be a requisite element for the calculation of probability query Q. Depending on the evidence observed for T and the type of the distribution over T, Collect-Requisite will send Collect-Requisite and/or Collect-Relevant messages to its neighbors.
142
Case A: T is observed
If T is observed (Fig. 76.A), T's value is a requisite element in the calculation of the target query, but T insulates its own distribution as well as the distributions and values of its neighbors from R. If T’s value is known with certainty, no knowledge of the neighbors of T can
reveal anything more about the value of T.
Case B: T is an unobserved node with an uncertain distribution.
If T is an uncertainty node and is not observed (Fig. 76.B), then the distributions or values for T's parents will also be required for the computation. In addition, if there is a likelihood function (that is, an observation) below T that would change the probability of T, then the distributions and values required for the computation of that likelihood function will become requisite elements of the calculation of the overall probability query. In this example, T will send a message to its parents (via Collect Requisite) informing them that they contain requisite elements of the query calculation. In addition, T also sends a message to its children (via Collect Relevant) informing them that they are relevant (not requisite) to the calculation by merit of their effect on the distribution of T. If one of these children or one of its successors is observed, that child will also become a requisite distribution for the probability query.
Case C: T is an unobserved node with a deterministic distribution.
If T is not observed and is deterministic (Fig. 76.C), T's parents are required for the computation of the target query because the distribution or value of these parents determines the distribution or value of T. The Bayes Ball algorithm will wait, however to declare that the children of T are relevant to the target probability query. If T's parents are known or are functionally determined, then T itself will be functionally determined. When this happens, no knowledge about the children of T can change the distribution over T itself if T is functionally determined. If one of the ‘requisite’ ancestors of T is uncertain, that ancestor will
143
send a message downward to all of its descendents telling them that they are relevant (Fig. 76.A).
4.3.3.2 Collect Relevant
While Collect Requisite collects distributions and observation values that are known to be requisite for a probability query, Collect Relevant identifies nodes that would be requisite if the their values or the values of any of their descendents are known. Suppose that it is known that one parent R of a node T is structurally relevant to a probability query Q (See Fig. 77). If the value for T or some descendent of T is known, then R (and T) will become requisite elements for the probability query Q.
R
S
R
T
T
A
B
FIGURE 77. Relevance Examples. Both of the figures above depict fragments of a belief network. If a node is shaded (right), then we know its value. In both of these figures, node R is known to be relevant (but is not necessarily requisite) to the calculation of a probability query. Which of the neighbors of T are relevant and which are requisite observations or requisite probability distributions?
Case A: T is observed
In Fig. 77.a, T is observed. T breaks the ‘line of communication’ between the children of T and the predecessors of T (including R). For example, say that T is a probability distribution over some set of numbers. The successors of T are numeric functions of T. If T is known, the successors of T are determined exactly. No further information about the parents of T will give us any more information about the successors for T.
144
This observation does, however, correlate the parents of T. Suppose that the distribution for T is the sum of the values of the parents of T. The value for any one parent can be computed given T and values for the other parents of T. Since the distributions for the parents of T are correlated, they all are relevant (and requisite) for computing the distribution over R.
In order to “collect” these requisite distributions,T sends a message to all of its parents
telling them that they are now requisite parts of the calculation for probability query Q. Note that observations tend to make the descendents of the observation mutually independent (Figure 76. and tend to make the ancestors of the observation mutually dependent.
T is not observed
In Fig. 77.b, T is not observed. T is probabilistically relevant to the query–that is, if we could observe T, T’s distribution and value would become requisite elements of the current probability computation. For example, let’s assume that our goal is to compute the probability distribution P{R}. T is defined to be the sum of R and another unobserved parent S. If T is unknown and all of its descendents are unknown, then T is not a requisite element for the calculation of P{R}. If T or any of its descendents is known, however, T does become relevant. For example, R might stand for the variable “Rain in Palo Alto on Monday.” T might stand for the variable “Sunday’s forecast for Monday’s weather.” If we do not know the forecast, T, or any of the consequences of the forecast (for example, other people’s actions), then our belief in the probability of R is not changed by the possibility of there being an unobserved forecast. If the forecast T or its consequences are known, however, our belief in R will change. Figure 77.b illustrates how the Bayes Ball algorithm captures these ideas. If R is relevant, it sends a message to T informing it that it is relevant to the calculation at hand and, if observed, will be a requisite element for the calculation of Q. T, in turn, passes the same message to its children, both to tell them that they are relevant and to tell them to initiate a search for a likelihood distribution (that is, as observation) that is downstream of T. At
145
some later point during the Bayes Ball algorithm, one of T’s descendents might report back that there is an observation Z that is downstream of T. At this point, the first case (Fig. 77.a) applies and Collect Requisite will walk back upstream telling T and other ancestors of Z that they are now requisite elements of the calculation for probability query Q.
4.4
Dynamic Influence Diagrams
The plan model for a complete DTPOP plan is a dynamic influence diagram M =
[Tatman+Shachter, 90]. A dynamic influence diagram is a normal influ-
ence diagram with multiple subvalue nodes5 { V 1, …, V n } instead of the normal single value node. Subvalue nodes have no descendents. The set of subvalue nodes represents a single additive utility function whose value is V = V 1 + … + V n .6 An additive utility function can make it easier to compute decision policies. The set of subvalue nodes that are relevant for computing Di is V ∩ De(D i) , the set of all of the subvalue nodes that are descendents of Di [Shachter+Peot, 92]. The Ne, Ni, Np, and Nr sets can be computed on the dynamic influence diagram using a variant of the Bayes Ball algorithm in time linear in the size of the influence diagram [Shachter, 98].
This algorithm can be used to prune the set of information predecessors
for each of the decisions in the influence diagram, resulting in smaller decision policies and faster inference. Theorem 14
(Ne in Dynamic Influence Diagrams) [Shachter+Peot, 92]: Let M
be a dynamic influence diagram with subvalue nodes V = { V 1, …, V n } and decision 5. These nodes are called subvalue nodes because they are only components of the implicitly represented value function. 6. Tatman and Shachter also consider multiplicative value functions. In this dissertation, we only will consider additive value functions.
146
nodes D = { D 1, …, D m } . The value function for M is the sum V = V 1 + … + V n . Let D i be any decision node in M . The set of value nodes that are relevant when computing D i is V i = V ∩ De(D i) . The set of information predecessors that are relevant to Di
Proof
is N e(V i D i ∪ I(D i)) . This proof can be found in Shachter+Peot [92].
147
148
5.0 DTPOP: Contingent Planning
In UDTPOP, step execution is unconditional. Observations made during plan execution have no effect on the choice of future actions. In this chapter, I will describe a contingent planner1 called DTPOP. The objective of DTPOP is to develop plans that have multiple contingencies, each designed to handle certain anticipated sources of uncertainty. Suppose that our objective is to plan a party. If we know that it will be sunny outside, the optimum plan is to hold the party outdoors. If we know, however, that the weather is uncertain, we may develop one or more contingency plans in order to improve the utility of our “party plan” when the weather is bad. This kind of reasoning is performed by a contingent planner. A contingent planner constructs a plan, looks for sources of uncertainty in this plan that might result in poor outcomes, and adds contingencies to make the plan robust to these anticipated sources of uncertainty. A plan is contingent if the execution of a step is conditioned on observations collected while executing other steps. An execution policy is associated with each step that dictates whether that step should execute or not. This execution policy is a function from the values of previous observations and step execution decisions to either “execute this step” (true) or “don’t execute this step” (false). One contingent plan for the party problem described above might be: 1. Historically, such plans were called conditional plans [Warren, 76]. Several authors [Pryor, 93; Draper, 94] have proposed to call these plans contingent plans to distinguish them from plans composed of actions whose effects (but not execution) are contingent on the outcomes of earlier actions [Pednault, 88a, 88b, 89; Pemberthy+Weld, 92].
149
1. Reserve an indoor location. 2. Reserve an outdoor location. 3. Watch the weather forecast. 4. If the weather forecast is for rain, then hold the party indoors. 5. If the weather forecast is for sun, then hold the party outdoors.
Notice that in general, a contingent plan may include “setup” actions (like reserving a location) prior to the sensory action and contingent branch.
The Contingent Planning Problem
DTPOP solves the following problem: given a planning problem and a bound N;
where R
is the reward utility function,
A
is the set of allowable actions,
IC
is the initial distribution over all of the variables in the domain,
including the distribution observations made prior to plan execution. N
is an upper bound on the number of steps in the plan;
DTPOP identifies the contingent plan with N or fewer steps that has the highest expected value or no plan if no plan can result in an outcome better than the worst possible.2 The bound on the number of steps is essential: DTPOP cannot identify the plan with the highest possible expected value, even if actions have a strictly positive cost. This is because s ome contingent planning problems have an optimal solution that has, in the worst case, an infinite number of steps. Suppose that we are offered the following lottery: we pay $10 in order to flip a fair coin (a coin that has a 50% probability of heads and a 50% probability of tails). If the coin lands heads up, then we win $100 and the game is over. If the coin lands tails up, then we get nothing, but have the opportunity to play again. The optimal plan in this situation is to
2. That is, the goal is impossible.
150
continue to play until we win regardless of the amount of money spent thus far.3 Since we are certain that the coin is fair, the history of coin flips is irrelevant to future decision making–after losing each time, we are, in essence, offered the original deal. If it was rational to accept the original deal (and we make decisions commensurate with the Δ property), then it is rational to accept the new deal. Note that this deal has a finite expected value even though the number of steps is unbounded. 1 1 1 ⎛1 --- + --- + …⎞ 100 – ⎛ 1 + --- + --- + …⎞ 10 = $80 ⎝ ⎠ ⎝2 4 ⎠ 2 4
DTPOP cannot represent this plan with a finite number of steps because it cannot represent loops.4 Step execution order in a DTPOP plan is represented as a partial-order, thus only a finite number of steps can be executed in any particular trace of plan execution.
5.1
Overview
This chapter describes DTPOP, a complete contingent partial-order planner. • Section 5.2 reviews several contingent planners and uses the problems encoun-
tered by these planners to motivate the design of DTPOP. • Section 5.3 describes DTPOP’s plan construction phase. • Section 5.4 describes DTPOP’s model construction and plan optimization phase. • Section 5.5 describes how DTPOP identifies open uncertainties that can profit-
ably serve as roots for observation plans.
3. This strategy relies on our adherence to the Δ -property [Howard, 68], our absolute belief that the coin is fair, and only a moderate amount of risk aversion. DTPOP assumes the decision maker is risk neutral (that is, he makes decisions according to the expected value of the outcomes). 4. Williamson and Smith [95] and Hanks [97] have derived planners that can identify plans with loops in restricted domains.
151
• Section 5.6, Heuristics, discusses some of the goal-ordering choices that can be
made in the implementation of DTPOP. • Section 5.7 provides an extended example of the planning algorithm. • Section 5.8 summarizes the formal properties of the planner, including proofs of
soundness and completeness. • Section 5.9 relates the approach used in DTPOP to recent planning research. • Finally, Section 5.10 discusses the limitations of DTPOP and proposes further
work.
5.2
History
In this section, I will review four influential contingent planners in order to 1) show how contingent plans are constructed and 2) outline the limitations of each of the contingent planers. • Warplan-C: I use Warplan-C [Warren, 76] to demonstrate the basic plan con-
struction strategy used by most contingent planners.5 Warplan-C demonstrates how contingent plans can be constructed incrementally through the addition of successive plan branches, each designed to handle a specific outcome of an uncertain step. • CNLP: I use CNLP [Peot+Smith 92] to illustrate two mechanisms: a threat reso-
lution operation called branching and a simple labelling technique for computing the step execution policies. • Cassandra [Pryor+Collins, 93]: The last planner I will present in this section is
Cassandra. The threat resolution and labeling scheme used for Cassandra was the inspiration for DTPOP’s threat resolution scheme.
5. Section 5.9.1 reviews research on a competitive approach: partially-observable and fully-observable Markov decision processes.
152
• C-Buridan: C-Buridan [Draper, et al, 93; 94a; 94b] is an extension of the Buri-
dan planner [Kushmerick, et al, 94; 95] that uses the branching and labeling mechanisms proposed for CNLP. I will use C-Buridan to introduce the action representation scheme used by DTPOP. The representation scheme proposed for C-Buridan extends with the CNLP knowledge representation scheme to separate sensing effects from effects that materially affect the world. In the Section 5.9.1, I review several other contingent planners that have a less direct relationship to DTPOP.
5.2.1
Warplan-C
The first contingent planner that I will describe is the total-order planner Warplan-C [Warren, 76]. Warplan-C was the first contingent total-order6 planner. Warplan-C allows certain actions (which are called multi-outcome actions)7, to have two or more mutually-exclusive effects. An example of such a step might be “Coin-Flip,” which has two mutually-exclusive effects: Coin-Face = heads or Coin-Face = tails. Warplan-C constructs a contingent plan by adding a contingent branch for each outcome of each multi-outcome action used by the plan. During the first pass of the planning algorithm, Warplan-C ‘pretends’ that multi-outcome conditional actions are really normal “single” outcome actions. For example, if a step has two mutually-exclusive outcomes P or ¬P , Warplan-C will pretend that this step is really two single outcome actions, one with the effect P and the other with the effect ¬P. Warplan-C first develops a non-contingent
6. A total-order plan is one that consists of a sequence of actions that are always executed in the same order that they appear in the sequence. Every step is ordered with respect to every other step. Contrast this with the idea of a partial-order planner, like UDTPOP, that allows unrelated steps to be executed in any order. Before approximately 1994, total-order and partial-order planners were referred to as linear and nonlinear planners, respectively. 7. In the original Warplan-C paper, contingent planning was called conditional planning and multi-outcome actions were called conditional actions.
153
plan, selecting effects from normal and multi-outcome actions in order to satisfy preconditions. If Warplan-C uses a conditional step, it conditions the execution of the remainder of that branch on the specific outcome used by that branch. IC
IC
IC
IC
2
2
2
2
C M ¬p
4
4
M p
M ¬p
3 G
a.
G
b.
M ¬p p
5
3
5
G
G
G
c.
d.
FIGURE 78. Planning in Warplan-C. Step execution order is from top to bottom. The fork at the conditional step M in 78.d is a choice point. If the outcome of M is p, the branch on the left executes, otherwise, the branch on the right executes.
For example, in Figure 78-a, Warplan-C uses effect p of multi-outcome action SM to satisfy a precondition of S3. Steps S3 and SG may only execute if p is the result of executing SM. After Warplan-C
completes a branch, it looks for dangling “else” clauses; unplanned-
for outcomes of multi-outcome actions. If it finds one, it reinvokes itself to plan for that else clause. For this second pass, Warplan-C starts with an initial plan consisting of a copy of the goal step and a copy of the initial segment of steps up to and including the multioutcome step. The planner replaces the multi-outcome step in this sequence with a step whose preconditions are the fluents of the initial segment that were used by S3 and SG.8 The outcome of this new step is ¬p (see Fig. 78.b). Warplan-C finds a new plan using this new initial segment (Fig. 78.c), combines this result with the original plan (Fig. 78.d), and looks for more dangling else clauses. This process repeats until Warplan-C has con-
8. Warplan-C adds these new preconditions to prevent future planning activities from threatening the “causal links” added from the initial segment to the first contingent branch of the plan.
154
structed a plan for each dangling clause. In our example, the plan depicted in Figure 78.d is complete. Note that a Warplan-C plan can add steps before the multi-outcome step in order to satisfy the ‘else’ part of the multi-outcome step (for example, S4 in Fig. 78-c,d). These steps are added after the multi-outcome step and are demoted to resolve threats with other components of the plan.
Limitations of Warplan-C
1. Total-Order
Planner: Warplan-C overcommits to step-ordering. Since a single partial-order plan can represent many total orders, this may mean that there are fewer plans in the search space of a partial-order variant on Warplan-C. 2. Perfect Observations: Warplan-C does not explicitly represent sensory actions. It is assumed that every uncertain event is perfectly observable at the time that that event occurs. 3. No Branch Fusion: Warplan-C makes no attempt to fuse the branches of the contingent plan; once step execution diverges at a choice point, the branches of the plan never rejoin. Of all of the planners that I will discuss (including DTPOP), only C-Buridan [Draper, et al, 93] can fuse the effects of two or more actions in separate branches in order to satisfy a single precondition of another step.9 5.2.1.1 CNLP
CNLP (the “Conditional NonLinear Planner”10) [Peot+Smith, 92] extends the basic plan construction technique of Warren [76] to partial-order planning. In CNLP, a step Si becomes contingent if one of the outcomes of a multi-outcome step supports one of the preconditions of Si. A step can also become contingent due to a threat-resolution operation called branching.11 Branching resolves threats by forcing the threat and the consumer of
9. C-Buridan can do this because it is based on the multiple-support noncontingent planner Buridan 10. Russell+Norvig [95] use the more modern term CPOP: “Contingent Partial-Order Planner.”
155
the threatened link to be contingent on mutually-exclusive outcomes of an earlier conditional step. IC
m 1
2
2
2 α
IC
IC
¬α
α
m 1
C: if α
¬p
α
¬α
1
C: if α
G1
R: if α
a.
G2
G1
R: if ¬ α
4
p
C: if α
¬p
¬p
G1 R: if α
5
m
3
R: if α
G2
R: if ¬ α
c.
b.
FIGURE 79. Planning in CNLP.
IC
IC 2
2 5
m 1
4
1 C: if ¬p
3
C: if α
G2
a.
5
m
4 p
C: if α
¬p G1
p
C: if ¬ α
G1
α
C: if α
3
C: if ¬ α
G2
C: if ¬ α
b.
FIGURE 80. Threat Resolution in CNLP.
The process used by CNLP when constructing plans is illustrated in Figure 79+80. CNLP first attempts to construct a non-contingent plan (Fig. 79-a). When a multi-outcome step (for example, Sm) is added to the partial plan, CNLP labels each of the mutually-exclusive outcomes of this step with a unique observation label (in this case, there are two mutuallyexclusive outcomes labelled ¬α and α). If any step Si is supported by one of these labelled outcomes, CNLP adds the observation label to the context label of Si. This context label 11. Referred to as “conditioning” in the original CNLP paper. The new term was proposed by Draper, et al [93,94a,94b].
156
for Si is a set of observation labels that summarizes the observation values that Si depends on. A contingent step is allowed to execute if the elements of its context label are consistent with the labels of the outcomes resulting from the execution of earlier multi-outcome steps. For example, suppose that the context label for step S1 (Figure 79-a) contains the observation label α from multi-outcome step Sm. If Sm executes and the outcome labelled α results, then S1 can execute. If the result of executing Sm is the mutually-exclusive outcome ¬α, then S1 cannot execute because one element of the context label of S1, α, is not consistent with the context, ¬α. Each step inherits the union of the context labels from the steps that establish its preconditions; thus each step can execute only if the steps that establish its preconditions have executed. If a CNLP plan can accomplish the goal for every combination of observation labels, that plan is complete. Otherwise, the planner identifies a conjunction of observation labels that does not have a corresponding plan branch and constructs a new plan branch (Fig. 79-b). CNLP adds a new goal step labelled with the unplanned-for conjunction of observation labels. This label is distinct from the context label and is called the reason for that plan branch. The reason for any step Si is the union of reasons of the goal steps that Si supports. During planning, CNLP uses effects from older branches in the plan to support preconditions in a new branch as long as the establisher of the new causal link has a context label that is compatible with the reason label of the consumer in the new branch. In Figure 79-c, for example, S3 might use S2 to satisfy a precondition because every observation label in the context of the initial conditions step (empty) is consistent with the reason, ¬α, of the goal, SG2, that S3 supports. On the other hand, S1 cannot be used to support S3 because the context for S1 (α) is not consistent with the reason for S3 (¬α).
157
Threats between steps in separate plan branches can be resolved via the normal threat resolution operations, promotion or demotion, or via branching. Fig. 80-a+b illustrates branching. S1 threatens the causal link between S5 and S3 (Fig. 80-a). Since S1 and S3 each support goal steps with mutually-exclusive reason labels, it is possible to resolve the threat by making the execution policy for S1 and S3 contingent on mutually-exclusive outcomes of an earlier multi-outcome step. The threat is resolved by adding a conditioning link from outcome label ¬α of Sm to S3. This conditioning link adds ¬α to the context labels of S3 and all of the successors of S3. Note that S4 and S5 are not contingent. It is possible (though not necessary) to execute this step even if the outcome of Sm is α . If every label in the reason for a step is inconsistent with the observation labels of past events, then that step should not be executed.
Limitations of CNLP 1. Perfect
Observations: CNLP, like Warplan-C, assumes that every event is perfectly observable at the time that the event occurs. 2. No Fusion: makes no attempt to fuse the branches of plans. 5.2.1.2 Cassandra
Cassandra [Pryor+Collins, 93] The solution that I propose was suggested by the branching mechanism used by Cassandra. Instead of using reason and context labels to identify branches and to resolve branchable threats, Cassandra uses a different labelling system based on positive and negative labels. • Positive labels (analogous to reason labels) indicate the set of goals that the sup-
ported by each plan element. The postive label for a step Si in the plan is the union of the positive labels of Si's immediate successors. The positive label for a goal step is the set of contingencies that the branch is designed to handle.
158
• Negative labels that indicate the set of goals that would be impossible to accom-
plish if a given plan element was to be executed. Each step inherits the negative labels of its immediate predecessors. For example, say that S E establishes precondition p for step S C . S E is a multi-outcome step that has three distinct outcomes, α 1 , α 2 , and α 3 . α 2 is the only outcome that contains p . In this case, if either outcome α 1 or α 3 result from executing S E , then we should not execute S C . The negative label for S C would include both α 1 and α 3 . V
Say that a threat S T threatens S E →S C whenever the set of outcome labels C is true. Cassandra can resolve this threat by making it impossible, for example, to execute both S T and either S E or S C when the set of outcome labels C is true. Cassandra does this by doing one of the following: Either • Cassandra adds C to the negative label of S T , or • Cassandra adds C to the negative label of S E or S C .
The design of Cassandra’s branch operator suggests an alternate approach that will allow us to delay specific decisions about contingencies: Say that the negative reason for a step S1
is actually a list of steps that should have execution policies that are mutually-exclu-
sive to the execution policy of S 1 . The meaning of the negative label becomes: “ S 1 can execute as long as none of the steps in its negative label have been executed.” DTPOP is based on two basic sets of ideas. The first set of ideas, presented in Section 5.2.2, concerns how contingent plans should be constructed and, in particular, how they use information in execution policies. The second set of ideas, presented in Section 5.2.3, concerns the identification of observations that are relevant to one or more of the contingencies in the plan.
159
5.2.1.3 C-Buridan
C-Buridan [Draper, et al, 93, 94a, 94b] solves a serious knowledge representation problem in CNLP. CNLP used a 3-valued logic to model the result of multi-outcome actions. In this representation scheme, an observation step for observing proposition p might have two mutually-exclusive fluents, obs(p) and obs(¬p) , as effects and have the fluent unknown(p)
as a precondition. The predicate unknown is used in this step to prevent the
planner from making a desired observation outcome for a step true by executing the step repeatedly–a fact can only be observed if it was initially unknown. Several authors [Goldman+Boddy, 94c; Pryor+Collins, 93, 96; and Draper, et al., 93] point out that this knowledge representation scheme works only in relatively limited situations, for example when sensors are perfect observers of facts. Pryor+Collins [93] point out, correctly, that this knowledge representation scheme confuses the distinction between observing a fact and ‘accomplishing’ that fact. In order to resolve this problem, C-Buridan introduced a new knowledge representation scheme that separates the observation content of a step from the effect of the step. In CBuridan, the outcomes of a multi-outcome step are grouped into distinct sets (called discernible equivalence classes or DECs). If it is possible for the plan execution agent to distinguish between two outcomes , these outcomes will be in distinct DECs. If it is not possible to distinguish between the two or more outcomes of a multi-outcome action, these outcomes will be collected into a single DEC. The label for the DEC is the only observable fact resulting from step execution. We must infer the distribution over uncertain effect variables using only the DEC labels resulting from the execution of this step and other steps in the plan. For example, a gluing action (Figure 81) might result in no bond, a weak bond or a strong bond. We can immediately distinguish between no bond and a stronger bond by observing the directly observable effects of the action; if the part doesn’t bond, it will fall apart. Suppose further that we cannot distinguish between a strong or weak bond using only the
160
observable outcomes of this step. The effects “strong bond” and “weak bond” will be grouped into the same DEC. . The fact that we cannot immediately distinguish between the outcomes of a step does not preclude us from using another step to further resolve the ambiguous outcome. For example, “test-bond” might have DECs that allow us to distinguish between strong bonds and weak bonds.
not bonded
Bond=No
Discernible Equivalence Classes
Bond=Weak bonded
Bond=Strong
FIGURE 81. Outcomes for a gluing operation.
Actions can have purely observational effects: a single action can have two or more empty DECs. A pure observation has no impact on the world, but may reveal information concerning uncertain events. A conditional effect is observable if it has more than one DEC. An effect variable of an action is perfectly observable if there is a distinct DEC for each of the mutually-exclusive outcomes of that action. Similarly an action is a perfect observer of its preconditions if there is a distinct DEC for each possible value of the precondition. C-Buridan uses DECs to resolve threats in almost exactly the same way that CNLP uses context labels. When C-Buridan wishes to resolve a threat via branching, it chooses some new or existing step that it can possibly order before both the threat and one of the steps (establisher or consumer) in the threatened causal link. The threat and the chosen causal link step are made contingent on mutually-exclusive sets of DEC labels by adding conditioning links that add those DEC labels to the context labels of the steps involved in the threat.
161
C-Buridan solves many of the representation problems of CNLP. Since the information content of a step is represented separately from the outcomes of the step, it is easy to model actions with noisy-observations or actions with partially-observable outcomes. There are, however, several problems with the CNLP or C-Buridan approach that make it difficult to develop decision-theoretic planners that are either efficient or complete. 1. The
Branch Replication Problem: C-Buridan makes actions contingent only to resolve threats.12 Say that we are trying to plan to treat a patient that has one of two possible diseases. One of the roles of the plan will be to discover exactly what disease the patient has using a number of noisy observations. C-Buridan might construct a plan branch for treatment of the first disease, a plan branch for treatment of the second disease and then add a single observation to remove any resource contention threats between these two treatment regimens. In order to make the two branches of the plan contingent on further observations, C-Buridan has to add identical copies of these two treatment plans to generate further branchable threats. CNLP suffers from a similar problem. The Impact: The depth of the search space, d, for the planner is proportional to the number of actions, arcs, and threats in the plan. The number of plans in this search space, in turn, is an exponential function of its depth. The number of branches (and, hence, the number of actions) in a C-Buridan or CNLP plan can be exponential in the number of observations in the plan leading to an explosion in the size of the search space. 2. The Observation Relevance Problem: In order to be “complete,”13 C-Buridan has to allow for the addition of new observation actions in order to resolve threats via branching. For example, in the medical treatment decision above, once we have added the second, conflicting treatment, there may be no actions in the partial plan with observable outcomes. When C-Buridan attempts to resolve this threat, it may need to branch using a new observation action. The current implementation of C-Buridan makes a nondeterministic choice between all possible observation actions because it cannot determine apriori which of the observation actions will be the best to resolve the threat. In Section 5.2.3, we will argue that it is impossible for C-Buridan to decide which observation to use because it doesn’t have enough information at the time that the branching operation is applied to make an informed choice. 12. This is not quite true. C-Buridan can make the predecessors of a contingent step contingent themselves if these predecessors are used only to support the contingent step. 13. Actually, neither C-Buridan nor CNLP is complete in any compactly describable way.
162
The Impact: The Observation Relevance problem in C-Buridan increases the branching factor, b, of the search space substantially. C-Buridan searches over all possible ways to branch the threat, using pre-existing observation actions in the plan in addition to new actions generated from the domain description. Increasing the average branching factor results in an substantial increase in the size of the search space. 3. The Early Commitment Problem: At the time that branching is used in CBuridan and other contingent planners of that ilk [Goldman+Boddy, 94a, 94b, 94c], there is not enough information in the plan to allow the planner to make an informed choice on how, precisely, to make subsequent steps contingent on the alternative observable outcomes of the multi-outcome action. In order to be moderately complete, these planners must branch on every possible partition on the DECs of each observable step. For example, say that we wish to branch a threat using a step S i that has n DECs. If neither the threat or causal link are already contingent on an outcome of S i , then there are 2 n – 2 possible ways to resolve the threat via branching. Ideally, we would pick the one partition that results in the highest overall utility for the completed contingent plan and explore only that partition. Unfortunately, at the time that we are forced to make this decision in an “early execution policy commitment” planner such as C-Buridan, we cannot know how the utility of the plan depends on the observation.14 The Impact: Early Commitment increases the branching factor exponentially with the number of DECs in the plan. Section 5.2.2 (this section) describes the approach used for solving the first and third problems (Branch Replication and Early Commitment) as well as one aspect of the second problem (Observation Relevance). A solution for the second problem will be described in Section 5.2.3.
5.2.2
Constructing Contingent Plans
The first set of ideas that I will discuss concern how a planner should use the results of observations to select between plan branches. DTPOP delays determining step execution 14. Why? One reason for this is that the preconditions of the observation action may not be established. The outcomes for a new observation action added to the plan only become correlated with events in other plan branches when enough causal links have been added to the plan in order to establish an active path [Geiger, et al, 90] between the observation and the events in the other plan branches. In particular, the shared establisher must have a conditional effect distribution that is uncertain and is either unobservable or partially observable.
163
policies until relatively late in the planning process. In this section, I review the relevant past work in contingent planning and use this work to motivate both the representation and the “late execution policy commitment” strategy used in DTPOP.
5.2.2.1 DTPOP
DTPOP uses the strategy proposed in the last section to delay both observation selection and execution policy commitments until the plan is otherwise complete. DTPOP separates construction of the plan into two distinct phases, a plan construction phase that is similar to the plan construction algorithms discussed in this section and a plan optimization phase that computes the execution policy for each step in the plan.15
Plan Construction in DTPOP
DTPOP introduces only one new plan element type to the set of plan elements already used by UDTPOP (steps, causal links, ordering constraints, and relevance constraints). This new plan element is called a mutex constraint. A mutex constraint constrains a set of steps {S 1, …, S n }⊥ to have mutually-exclusive execution policies. A DTPOP step S i can execute if the steps that established its preconditions have executed and S i is not the member of any mutex constraint that contains another step S j that has already been executed. Section 5.3.2 describes how the structure of the plan can be used along with the mutex constraints to derive a general test for the mutual-exclusion of any two steps. If two steps are not mutually-exclusive, they are said to be compatible. DTPOP uses a reason label that is analogous to CNLP's reason label or Cassandra's positive labels. Like CNLP, the reason for a step S is the set of terminal steps that S supports. A terminal step is any step that can logically terminate a branch. This includes all of the goal steps in the plan as well as some observation steps.
15. The latter phase solves a restricted Partially-Observable Markov Decision Process (POMDP).
164
Add-link uses the reason label with the mutex constraints to determine when it can add a causal link from one step (for example, S E ) to another step ( S C ). We can only add this V
causal link S E →S C if S E is compatible with all of the terminal steps in the reason of S C . Why? If S E and any terminal step in S C 's reason are mutually exclusive, then adding the causal link will make it impossible to execute that particular terminal step. V
Mutex constraints simplify branching. In DTPOP, S T threatens S E →S C iff • It is possible for S T to execute before S C and after S E , • V is an effect variable of S T , and • S T is compatible with S C .
Branch repairs the flaw by making the last condition in the definition of the threat false. If the intersection of the set of reasons for S T and the set of reasons for S C is empty, then we V
can resolve the flaw S T ⊗ ( S E →S C ) by adding the single mutex constraint {S T, S C }⊥ . The constraint on reason(S T) ∩ reason(S C) prevents branch from making it impossible to execute the terminal steps in reason(S T) ∩ reason(S C) . Let’s compare the branch operator of DTPOP to that of C-Buridan/CNLP. Our new branch operator has a (search space) branching factor of 1. By comparison, the branch operator of C-Buridan/CNLP has a branching factor of
∑
(2
DECs(S i)
– 2) ,
where O is
Si ∈ O
the set of all possible observation steps16 and DECs(S i) is the set of discernable equivalence classes for S i .
16. This is the union of all of the observation steps in the domain description with the set of all observation steps in the plan.
165
DTPOP defers the relatively “high branching factor” process of matching observations and outcomes until the plan is otherwise complete. This basic strategy of deferring plan expansion operators with a high branching factor has been shown to reduce the complexity of the search space considerably [Peot+Smith, 93; Joslin+Pollack, 94, 95].
Plan Optimization
Once a plan is constructed, DTPOP computes the execution policy for each step using dynamic programming (this is explained in Section 5.4). The mutual exclusion constraints constrain the decision policies in the influence diagram used to model the plan preventing the creation of threats.
5.2.3
Identifying Observations
The missing element from the story in the last section is an approach for solving the Observation Relevance problem. Recall the C-Buridan must add random observation actions into the plan in order to resolve threats. DTPOP solves the Observation Relevance problem by interleaving regression on preconditions with forward planning from relevant uncertainties in order to incrementally construct an active path connecting a value function to an observation that can be used for threat resolution. In this section, I will use the conditional independence properties of belief networks presented in the last chapter in order to motivate the design of the open uncertainty mechanism in DTPOP.
Plan Generation
In order to motivate the open uncertainty mechanism, I will describe how the completeness proof for DTPOP uses probabilistic relevance. The completeness of both UDTPOP and DTPOP is proved via a clairvoyant proof [McDermott, 91]. A clairvoyant proof relies on the use of a “clairvoyant meta-reasoner”
166
to make all of the correct choices in a non-deterministic planning algorithm. In the proof, the clairvoyant (magically) knows the form of the final plan (the exemplar) and uses this exemplar to devise a sequence of plan construction that replicate the exemplar. Since the clairvoyant finds one possible path to an arbitrary solution in the search space, every must be in the search space. In order to reconstruct the exemplar plan, the clairvoyant needs to have some way to systematically explore the exemplar. In the proof for UDTPOP, the clairvoyant reconstructed the plan by systematically reconstructing the chain of support for preconditions. The basic strategy for DTPOP will be to reconstruct the set of conditional effects that are probabilistically relevant to value functions. All actions must be relevant to the expected value, because otherwise the action can be deleted from the plan without decreasing its expected value. The DTPOP clairvoyant uses the Bayes Ball algorithm discussed in the last chapter in order to systematically identify conditional effects in the exemplar that are relevant to value. The sequence of actions used for identifying these nodes will be used to drive the planner, which is itself isomorphic to the Bayes Ball algorithm. While the Bayes Ball algorithm needs only to search through the plan in order to identify relevant probability nodes, the planning algorithm will construct new steps or links whenever it attempts to search forward or backward to a step that doesn’t exist. The CollectRequisite routine in the Bayes Ball algorithm collects all of the distributions that are required for the evaluation of queries involving the value nodes in a plan. Collect-Requisite chains back through the predecessors of a node using exactly the same mechanism that is used for regression on preconditions in UDTPOP: the points in a partial plan where Collect-Requisite “runs out of plan” to chain back on are called open preconditions. When there is an open precondition, UDTPOP uses Add-Step and Add-Link to construct the requisite support for the preconditions, step by step. In order to develop a complete planner for domains with noisy observations17, we need to develop a plan construction analogue to the Collect-Relevant routine. The observation
167
construction mechanism in DTPOP is based on the following simple mechanism: At every point where the Bayes Ball algorithm might explore forward for relevant observations, DTPOP inserts an open uncertainty that acts as a root for new observation plans in exactly the same way that open conditions act as the root for subplans that support the preconditions of goals or cost functions. When DTPOP encounters one of these open uncertainties, DTPOP searches forward to find a step that has conditional effect with a precondition variable that matches the variable in the open uncertainty. DTPOP constructs these new segments of the plan using either Add-Step-Forward or Add-Link-Forward. Exemplar Plan
Reconstructed Plan
+
+
Open Uncertainty
+
+
+
Relevant Observation
+
+
Observation Step
+
FIGURE 82. Observation Plan Construction in DTPOP.
This process is illustrated in Fig. 82. The figure on the left illustrates the full plan model for a small exemplar plan. The winding arrow follows the active path explored by the Bayes_Ball algorithm when it is searching for observations that are relevant to the value node. Open uncertainties are unobserved nodes that are probabilistically relevant to a value function in the plan. These open uncertainties represent points in the plan where the Bayes_Ball algorithm would call Collect-Relevant if the open uncertainty had a child. Instead of tracing along a pre-established belief network, DTPOP extends the plan by adding causal links that point forward in time to new steps (Add-Step-Forward) or to unused conditional effects of pre-existing steps (Add-Link-Forward). 17. I will argue in Section 5.9.4 that a planner only needs to explore open uncertainties if the plan is partially observable.
168
The approach of using a probabilistic relevance criterion to drive the model construction is not new. AlterID [Breese, 92], Frail3 [Goldman+Charniak, 93] and Haddawy [94] use a similar technique to assemble influence diagrams from a rule-like declarative knowledge representation scheme. Frail3 has even been applied to the development of probabilistic underpinnings for a CNLP-style planner [Goldman+Boddy, 94c]. However, this dissertation represents the first time that probabilistic relevance criterion have been applied directly to plan construction.
5.3
DTPOP
In the next 5 sections, I will describe the DTPOP algorithm. • This section will outline the top level design of the planning algorithm. • Section 5.4 describes the plan model construction and plan optimization algo-
rithm. • There are a few choices for ordering planning decisions. Some of these choices
have implications with respect to completeness. Section 5.6 presents some heuristics for decision-ordering in DTPOP. • Section 5.7 presents an extended example of DTPOP. • Section 5.8 describes the formal properties of DTPOP.
The DTPOP planner is a modification of UDTPOP. These are the primary differences between UDTPOP and DTPOP: • In DTPOP, there are two phases in the plan generation process. In the first phase
(plan construction), a set of alternative plans is identified using a regression planner similar to the UDTPOP planner. In the second phase (plan optimization), the planner identifies the optimum step execution order and computes the optimal step execution policies. There is no analogue for this latter phase in UDTPOP. • The UDTPOP planner works backward from the goal, establishing support for
unsupported (open) preconditions. In addition to precondition satisfaction, DTPOP
169
also projects the consequences of open uncertainties in order to identify relevant observations. • DTPOP modifies the add-link and persist-support operations for contingent
planning. A causal link may not be added if addition of the link makes it impossible to satisfy one of the goal steps due to a mutex constraint. • DTPOP adds new plan construction operations for threat resolution (branch),
for observation identification (add-link-forward, add-step-forward) and for the addition of new branches (add-branch). The problem specification for DTPOP is similar to that of UDTPOP. DTPOP solves the following problem: given a planning problem and a bound on the number of steps N, where is the reward utility function; A is a set of uncertain, partially-observable actions; and IC is the distribution over all of the variables prior to the execution of the plan, R
DTPOP finds the contingent partial-order18 plan with N or fewer steps that is of maximum expected value. DTPOP returns no plan if there is no plan that can result in an outcome better than the worst outcome in the reward utility function. In the first phase of plan construction, DTPOP constructs a partial order plan that resembles that produced by UDTPOP. A partial plan is a tuple P = . The new field, M , is the set of mutex constraints added by branch. In the second phase of plan construction, DTPOP uses the partial-order plan constructed in the first phase to support the construction a decision tree or influence diagram to compute the optimal step execution policies. The causal links and ordering constraints together
18. Actually, we will see that there are a lot more restrictions on the partial order in DTPOP than there were in UDTPOP.
170
guarantee that the evaluation algorithm will always compute the correct expectation for any set of step execution decisions consistent with the mutual exclusion constraints. The final plan is a topological sort [ S i ] Ni = 1 of the steps in the plan annotated with a set of step execution policies. The step execution policy for Si is a function from the values of observable variables and step execution decisions in [S1, ..., Si-1] into a binary value. If this binary value is true (false), Si should (should not) execute. A sketch of the first phase of the algorithm is shown below, the second phase of the algorithm will be described in Section 5.4. Complete-Plan ( P , Flaws ) if ( constraint_violated(P) ) return ∅ else if ( Flaws = ∅ ) either * 1.a. P′ = Add-Branch( P ) 1.b. P′ = Optimize( P ) else Choose a flaw, x , in Flaws . 2. if ( x is an open condition) either 2.a. P′ = Add-Step( x , P , Flaws ) 2.b. P′ = Add-Link( x , P , Flaws )
*
3. if ( x 3.a. 3.b. 3.c. 3.d.
is P′ P′ P′ P′
a = = = =
* * *
4. if ( x 4.a. 4.b. 4.c.
is P′ P′ P′
an open uncertainty) either = Add-Link-Forward( x , P , Flaws ) = Add-Step-Forward( x , P , Flaws ) = Remove-Uncertainty( x , P , Flaws )
threat) either Promote( x , P , Flaws ) Demote( x , P , Flaws ) Persist-Support( x , P , Flaws ) Branch( x , P , Flaws )
return P′
FIGURE 83. The top level of the DTPOP planning algorithm. Starred lines indicate the major differences with respect to UDTPOP.
In the discussion in the next few sections, I will emphasize only the differences between DTPOP and UDTPOP and will not discuss the operations in the planner that are the same.
171
5.3.1
DTPOP Plan Elements
Before launching into the description of DTPOP, I need to refine the informal definitions of “support,” “terminal step,” “branch,” “mutex,” “threat,” and “reason” introduced in the first sections. A plan branch is a complete UDTPOP plan that is crafted in order to accomplish some goal in a restricted set of circumstances. For example, a plan branch might be a plan to make a particular observation or to accomplish the goal when a particular condition is true. The goals for plan branches will be called terminal steps. The plan branch itself will be identified by its terminal step. Each step in a plan branch may be a component of more than one plan branche. The reason label on the step lists all of the terminal steps (plan branches) that that step “supports.”
Definition 20 (Causal Support): A step S A provides causal support to step S B if
there is a directed chain of causal links from S A to S B . “Support” will be used as a synonym for “provide causal support to.”
Definition 21 (Terminal Step): A step S T is called a terminal step if either
1. S T is a goal step, or 2. S T is an observation step that is relevant to a utility function in a branch that S T does not provide causal support for. Definition 22 (Branch): A terminal step S T along with the set of steps that support ST
is called a branch.
Definition 23 (Reason): The reason for a step S A is the set of terminal steps that S A
supports.
172
The definition of pertinence in UDTPOP was based on the existence of a single terminal step, the goal. We will need to modify the definition of pertinence to account for multiple terminal steps. Each step needs to be pertinent to all of the goals that it supports. If a step doesn’t provide effective support for a goal, then it must provide support for a step cost function or for a terminal observation.
Definition 24 (Pertinent Preconditions in Complete Plans): Let R min be the mini-
mum possible value of the reward utility function, R min = min R(X) . Value o j of preX
condition variable O j in a complete plan is pertinent when it is the graphical predecessor of a cost subvalue node, reward subvalue node or terminal observation, oj
is possible , and P { R > R min O j = o j } > 0 for every goal supported by O j .
Definition 25 (PertinentP Precondition): A precondition value, A = a , is perti-
nentP if one of the following is true: • A = a is a precondition of a cost function. • A = a is a precondition of a terminal observation that is not otherwise perti-
nentP.. • A = a is a precondition of the reward function and there exists a utility outcome
such that R > R min .
• A = a is a precondition of conditional effect and e is pertinentP..
5.3.2
Mutual Exclusion Constraints and Execution Policies
In order to resolve threats between alternative contingency plans, DTPOP introduces mutual exclusion constraints between steps. A mutual exclusion constraint, {S A, S B }⊥ , restricts the execution policies of S A and S B so that it is impossible to execute both of these steps.
173
Both DTPOP and UDTPOP reject the notion of multiple support for actions, thus a step can only execute if the steps in its support set execute.19 This implies that two steps are mutually exclusive if they share a mutual exclusion constraint or their causal predecessors share a mutual exclusion constraint. This can be determined easily using the co-mutex and mutex sets defined below.
Definition 26 (Support Set): The support set of S A in plan P is all of the steps that
support S A in P . This set is denoted support(S A, P) . Definition 27 (Mutex Constraint): A mutual-exclusion (mutex) constraint is a set {S 1, …, S n }⊥
of steps. The steps in the mutex constraint are forced to have mutuallyexclusive execution policies. The co-mutex set of SA is the set of steps in the support set of SA that also belong to mutex constraints. The mutex set of SA is the set of steps that share a mutex constraint with any step in the support set of SA. If one of the steps in the mutex set of SA are executed, then SA cannot execute, because one of the steps in its support set wasn’t executed.If the intersection of the mutex set of one steps and the co-mutex set of another step is not empty, then the steps are mutually-exclusive.
Definition 28 (Co-Mutex Set): A step S B is in the co-mutex set of a step S A in plan P =
iff S B is in { S A } ∪ support(S A, P) , and S B is an element of a
mutex constraint in M . Definition 29 (Mutex Set): A step S B is in the mutex set of a step S A in plan P =
iff there exists some S X in { S A } ∪ support(S A, P) such that S X ≠ S B and both S X and S B appear in the same mutex constraint.
19. I will call this the DTPOP Single Support hypothesis.
174
Given these definitions and single support, two plan steps S A and S B are mutually exclusive if the intersection of co-mutex(S A, P) and mutex(S B, P) is non-empty. Theorem 15 (Mutually exclusive execution): S A and S B are mutually exclusive if co-mutex(S A, P) ∩ mutex(S B, P) ≠ ∅ .
Definition 30 (Compatible execution): If two steps S A and S B are not mutually
exclusive, we will say that they are compatible.
5.3.3
Plan Flaws
There are two changes to the set of plan flaws in DTPOP: 1. The definition of threats is changed to account for mutual exclusion, and 2. A new threat type, the open uncertainty, is introduced to identify points in
the plan where observations might be added in order to increase the utility of the plan.
5.3.3.1 Plan Flaws: Threats V
In order for a step S T to threaten a causal link S E →S C , it must be possible for both S T and S C to be executed in the same execution trace of the plan. Suppose that S T and S C are mutually exclusive. If S T executes, then there is no causal link to threaten since S C cannot execute. If S C executes, then it must be the case that the threatening step S T was never executed.
175
ST
V
V SE
SC
V
FIGURE 84. Threats. S T threatens S E →S C in DTPOP only if S T and S C can both be executed in the same execution trace (that is, they are compatible).. V
Definition 31 (Threat): An action S T is said to threaten a causal link S E →S C if
1. S T can possibly occur between S E and S C , 2. V is an effect variable of S T , and 3. S T and S C are compatible. 5.3.3.2 Open Uncertainties
An uncertainty is open when it is possible to learn something from the uncertainty that will allow us to influence the value of the plan. The test for open uncertainties relies on the definition of the plan model given in Section 5.5.2. In this section, I will present a brief preview of the technique. Open
Briefly, an uncertain variable Open is an open uncertainty (denoted S E → ) if 1. Open is not observable. 2. There is a connected subgraph B' of the information plan that includes both E and a value node Vi.
relevance network B for the
3. Open is
relevant to Vi in B'. 4. The steps corresponding to the nodes in B' are mutually compatible. 5. All of the steps in B' with the exception of Vi are possibly before Vi. Let’s discuss each of these requirements. Requirement 3 declares that the open uncertainty is probabilistically relevant to the utility function in subset of the plan model (See Section 5.4). In addition, in order for it to be possible for the open uncertainty to influence the utility node, we must be able to observe both the open uncertainty and any other observa-
176
tion that makes the open uncertainty probabilistically relevant prior to making a decision that can influence the utility function (Requirements 4 and 5).
VOM Calibration
U
VOM Reading
VOM Bias
Battery Voltage
FIGURE 85. Voltage Measurement Example. Nodes with observable outcomes are shaded.
A simple example (Figure 85) will illustrate why 4 and 5 are necessary. Say that the utility of a step SV in a plan is dependent on the amount of charge stored in a battery. We have an old volt-ohm meter (VOM) that we can use to test the battery, but we have reason to doubt its accuracy. One way of increasing the confidence that the charge measured by the VOM is correct is to test the VOM itself using a known voltage. The result of calibrating the VOM is relevant to the utility function of SV, but we can only take advantage of this information if we observe both the result of the calibration and the actual measurement prior to making a decision concerning SV. In order for this to happen, it must be possible for both of the observation steps to precede SV (condition 5) and it must be possible to execute both of these observation steps in the same execution trace (condition 4).
Every plan construction operation that adds or modifies a causal link (Persist-Support, Add-Step, Add-Step-Forward, Add-Link and Add-Link-Forward) also inspects the plan to discover new open uncertainties. DTPOP delays the discovery an open uncertainty until it is certain that all of the conditional effect nodes in the chain from the open uncertainty to the utility node that it influences are uncertain.
177
5.3.4
UDTPOP Modifications
Add-Step, Add-Link, Add-Support and Persist-Support are identical to their UDTPOP counterparts with two exceptions: • Add-Link and Persist-Support must ensure that reasons of the consumer of
each new causal link are compatible with the establisher. If the establisher and any goal step in the consumer’s reason are mutually exclusive, then adding the causal link will make it impossible to execute that particular terminal step. • Add-Support and Persist-Support must check for any new open uncertainties
created by the addition of a causal link and add them to the list of flaws. Add-Support and Persist-Support will not add open uncertainties if they have been identified previously by other plan construction operations.20
5.3.5
Threat Resolution: Branch
DTPOP introduces a new threat resolution operation, Branch. Branching resolves threat V
S T ⊗ ( S E →S C )
by constraining the execution of the threatening step S T and the consumer
of the causal link S C to be mutually-exclusive via the addition of a mutex constraint {S T, S C }⊥ .
This mutex constraint signals the plan optimization algorithm that the execu-
tion policy for the plan should select to execute only one of these two steps. In order to ensure that Branch will not make it impossible to execute any terminal step, the intersection of the set of reasons for the threat and the consumer should be empty. It is impossible to use branch to resolve a threat unless there is at least two terminal steps in the plan.
20. This restriction allows DTPOP to eventually remove all of the open uncertainties from the flaw agenda.
178
ST
V
{S C, S T }⊥ V SE
SC
FIGURE 86. Branch. Branch signals the plan optimization algorithm that it must choose execution policies for S T and S C that make it impossible to execute both steps during the same execution trace. V
Branch( T = S T ⊗ ( S E →S C ) , P = , Flaws ) If ( Reason(S T) ∩ Reason(S C) = ∅ ) then { Complete-Plan( , Prune(Flaws) ) } else { return ∅ } 5.3.6
Discovering Alternative Plans: Add-Branch
DTPOP adds additional contingency plans via Add-Branch. Add-Branch adds a new goal step that serves as the root for a new plan branch. Add-Branch also adds mutex constraints to prevent the planner from reaping the reward from more than one goal step. Add-Branch( P = ) { Let Flaws = { Open ∈ Pre(S G) }
M New = { {S G, S i }⊥ S i ∈ Goals(P) } return Complete-Plan( , Flaws )
} } 5.3.7
Identifying Observations: Add-Link-Forward and Add-Step-Forward
DTPOP identifies observations for open uncertainties by projecting forward sequences of action that depend in some way on the open uncertainty. Add-Step-Forward adds new steps to the plan. Add-Link-Forward adds new links forward to unused or open precon-
179
ditions of steps that are already in the plan. In either case, the brunt of the work performed by these operations is performed by Add-Support-Forward. Add-Support-Forward adds the new causal link, adds any new open conditions required, identifies new threats, and adds newly discovered open uncertainties. Open conditions may added to both the consumer and the establisher of the new causal link. Open conditions are added to the establisher of the causal link to ensure that support will be added for each precondition that can influence the probability of the variable protected by the causal link. Open conditions are also added to the obeservable conditional effect distributions of consumer of the causal link. If a conditional effect distribution, P CE , is observable and the open uncertainty is a precondition of P CE , then we should always observe the outcome P CE
because it is relevant to utility and we can observe it for free.
Note that Add-Support-Forward does not remove the open uncertainty. Repeated observation steps can be used to reveal more and more about a specific uncertainty. For example, imagine that we wish to estimate the probability of heads for a biased coin. Every coin flip reveals more information about the probability distribution that describes the coin. Observing the result of any one coin flip does not necessarily eliminate the value of further observations. DTPOP makes an explicit decision to close an open uncertainty using Remove-Open-Uncertainty described in the next section. Open
Add-Step-Forward(Open Uncertainty ou = ( S E → ) , Plan P = , Flaws ) { if (there exists an action S C in the set of action schema A such that Open is a precondition variable for a conditional effect of S C ) { return Add-Support-Forward( ou , S C , , Flaws ) } else { return ∅ } }
180
p
Add-Link-Forward(Open Uncertainty ou = ( S E → ) , Plan P = , Flaws ) { if (there exists an action S C in S such that S C is possibly after S E , S E is compatible with the reasons of S C , Open is a precondition variable of a conditional effect of S C , and Open S S C or open there is no causal link → E Open condition ( → S C ) ) { return Add-Support-Forward( oc , S C , P , Flaws ) } else { return ∅ } }
Open
Add-Support-Forward( ou = ( S E → ) , S C , P = , Flaws ) { Open L′ := L + ( S E → S C ) let P' := ou new := newOUs(P′) threats new := newThreats(P′) CE obs := a conditional effect that is observable and has Open in its preconditions. pres obs := the preconditions of CEobs. oc link := newOCs P(S E, Open, P′) 21 ⎧ Open ⎫ Open oc obs := ⎨ ( → S C ) Open ∈ pres obs ∧ ( S → S C ) ∉ L′ ⎬ 22 ⎩ ⎭ Flaws' := ( Prune(Flaws) – oc ) ∪ threats new ∪ oc link ∪ oc obs ∪ ou new return Complete-Plan( P′ , Flaws' ) }
5.3.8
Remove-Open-Uncertainty
In many cases, it is not desirable to attempt to develop observation plans for an open uncertainty. Remove-Open-Uncertainty removes open uncertainties unless doing so would abandon a half-created observation plan that terminates in a unobservable terminal
21. These are new open conditions for the establisher of the link that was just added. 22. These are new open conditions for the observable outcomes of S C that have Open as one of their preconditions.
181
step. Eventually, Remove-Open-Uncertainty must be called to remove every open uncertainty created in a plan. The two situations that we would like to avoid is the addition of extraneous steps to the plan or the addition of justifications (causal links) to conditional effect distributions that are never used. For example, if for some reason the planner constructs a plan with a terminal step that is neither a goal nor is observable, then that step can be pruned from the partial plan without decreasing its expected value.23 Open
Consider an open uncertainty ou = ( S C → ) . Open is an outcome variable of conditional effect P CE . ou can be removed if one of the following four cases hold: 1. One
of the outcome variables (other than Open ) of P CE is an open uncertainty. 2. P CE is observable. 3. One of the outcome variables of P CE contributes a causal link to another step. 4. P CE shares a precondition variable with another conditional effect on the same step that satisfies either Case 1, Case 2 or Case 3. Furthermore, if there is a causal link or open condition on any precondition variable of P CE , that precondition variable must be a precondition of a conditional effect that satisfies either Case 1, Case 2, or Case 3. At first blush, these restrictions seem complicated: let’s examine each of these cases in turn. Case 1: Even if ou is closed, there is another open uncertainty on P CE that can prevent P CE
from being a terminal unobservable (See Figure 87.).
Case 2: If P CE is observable, then a plan branch containing P CE might possibly constitute a useful observation plan.
23. Actually, these extraneous terminal steps can be pruned by the plan optimization algorithm (the execution policy for these steps will always be set to false). Still, it is inefficient to consider provably dominated plans–they increase the size of the search space and are also more complex to evaluate than plans that only consist of relevant steps.
182
Case 3: If P CE contributes a causal link to another step, then P CE is relevant to a goal.
Step Utility CE CE Observable CE Causal Link Variable
FIGURE 87. Inefficient Plans. Remove-Open-Uncertainty prevents the planner from returning these two types of suboptimal plans.The plan on the left has a terminal step that is not observable. This step can be elided without decreasing utility. Even though the second step in the plan on the right is non-terminal, it contains an unused conditional effect that has “established” preconditions.
Case 4: Say that there is no open uncertainty attached to P CE . If P CE is not an observation and does not contribute a causal link to the plan, then the plan construction algorithm does not need to evaluate the probability of P CE and, therefore, does not have to evaluate the probability of its preconditions. Causal links (and open conditions24) added to support the preconditions of P CE are extraneous justifications (See Figure 87).
24. Every open condition eventually becomes a causal link.
183
Open uncertainty under consideration
Case 1
Case 2
Other open uncertainties
FIGURE 88. Remove-Open-Uncertainty Cases I and 2. If the dark triangle were the only open uncertainty in Case 1, then removing the open uncertainty would create an unobservable terminal step. The open uncertainty can be removed, however, if there is another open uncertainty on the node that will be “forced” to eventually add an observable conditional effect distribution. In Case 2, we can remove the open uncertainty because an observable node is an allowable kind of terminal node.
184
Single Causal Link
Case 3
Case 4
FIGURE 89. Remove-Open-Uncertainty Cases 3 and 4.
In Case 3, the open uncertainty can be removed because the conditional effect already donates a causal link to the rest of the plan. In this case, either the conditional effect is contributes to a goal, observation, or utility node OR there is a descendent of this node that has an open uncertainty that will eventually add an observation node. In Case 4, there is no need to expand the open uncertainty because all of the causal links into the step are used by conditional effect distributions that are either observable or donate a causal link to the rest of the plan.
185
Open
Remove-Open-Uncertainty( ou = ( S E → ) , P , Flaws ) { Let P CE := the conditional effect distribution of S E that has Open in its outcome variables. if (
I
∃( S E →) I ≠ Open //Case 1 or Observable(P CE) //Case 2 V or ∃( S E →S C ) V ∈ Pres(P CE) //Case3 V ) ∧ V ∈ Pres ( P ) or ( ( ∃( S →S CE VE or ∃( →S E ) ∧ V ∈ Pres(P CE) ) and there exists P CE2 such that V ∈ Pres(P CE2) , P CE ≠ P CE2 , and X
∃( X ∈ Outs(P CE2) ( ∃( S E →S C ) ∨ Observable(P CE2) ) ) ) ) then return Complete-Plan( P , Flaws – ou ) }
5.4
Plan Optimization
All of the savings generated by the delay of branching decisions is paid for in the plan optimization phase. Recall that the branching operation resolves threats by declaring only that two branches should have mutually exclusive execution policies. In older contingent planners, these execution policies are determined incrementally at the time of threat resolution by making execution of the threat and the threatened branch contingent on mutually exclusive DECs of an observation step that is chosen either from the plan or generated from the domain description. This step is very expensive. Recall that the branching factor for the branch operation is
∑
(2
DECs(S i)
– 2) .
Furthermore, the planner may end up
Si ∈ O
expanding identical search spaces for plans that differ only in their execution policy. By delaying step execution policy decisions to the end of the planning process, we reduce the branching factor of branch (recall Section 5.2.2.1), but do not eliminate the overall exponential complexity of the planner. This complexity is pushed into the plan evaluation and optimization algorithm. Rather than using ‘relatively’ fast probabilistic inference algorithms to evaluate each plan (as is done in C-Buridan), DTPOP uses an exponential optimization algorithm on exponentially fewer partial plans. 186
Section Outline • Section 5.4.1 describes the simple strategy used in DTPOP to model asymme-
tries in the plan model. • Section 5.4.2 describes a technique to optimize step execution decisions that pre-
serves completeness and describes some techniques for improving its performance. • Section 5.4.3 describes an approximate optimization technique that optimizes
only step execution decisions that are forced.
5.4.1
Contingent Plans and Asymmetric Influence Diagrams
An influence diagram is asymmetric if the existence of the event described by a node in the influence diagram is a function of the state of the graphical ancestors of that node. For example, suppose that we are constructing an influence diagram to model two very different career paths: that of a doctor and that of a rock star. The “doctor” half of this influence diagram refers to many events that only exist if we decide to go into medicine. For example, the uncertain variable “medical board scores” would not exist in the “rock star” half of the influence diagram. A DTPOP plan is asymmetric because of step execution decisions. If we opt to execute a certain step, the mutual exclusion constraints in the plan might prevent us from executing another. The events described by the unexecuted step and the steps that it supports will not exist given past execution decisions. DTPOP uses a simple representation scheme to model asymmetry in contingent plans. For quantitative purposes, DTPOP captures action execution asymmetry by adding a extra distinguished value, “not applicable” or na, to every uncertain variable in the plan.25 This ‘not applicable’ value indicates that the event corresponding to the variable will not occur.
25. NAs were first used in Pathfinder [Heckerman+Nathwani, 92; Heckerman, et al. 92] and are a component of Knowledge Industries’ belief network tool DXPress and SimNet [Heckerman, ***].
187
All of the step execution decisions in DTPOP have two possible values T (meaning “execute”) or F (meaning “don’t execute”). This step execution decision conditions all of that step’s conditional effects and cost functions. If the step execution decision for Si is F or if any of the preconditions of Si are na, the conditional effects of Si are na with probability 1.0 and the cost function for Si is 0.0 (because it isn’t executed.). If Si is executed (because both its step decision is to execute and none of the precondition variables are na), then the probability for its conditional effects and the value for its cost function are the same as the distributions specified in the definition of the action. Given values for the decisions, the ‘na status’ of each of the variables in a plan execution model is completely determined and we can use the normal Bayes Ball algorithm to determine the conditional relevance of nodes ignoring any nodes that are na given the decisions.26 Given a set of mutex constraints, we can also determine if there is any set of decisions consistent with the mutex constraints that allow one uncertain variable to be relevant to another. In order to determine the independence or relevance of uncertain variables in the network, we need to know only the step execution decisions or the constraints on those decisions.
5.4.2
Plan Optimization
Execution policies for the plan are computed by constructing and evaluating an influence diagram representing the plan. Section 5.4.2.1 through 5.4.2.3 describe how the plan model is constructed. Section 5.4.2.4 suggests how this plan model is evaluated. Evaluating this plan model is very expensive. Section 5.4.2.5 provides some suggestions for reducing the complexity of policy evaluation.
26. This concept is similar to “subset independence” in Bayesian Multinets [Heckerman+Geiger, 93]. In a Bayesian Multinet, the structure of a belief network (instead of the list of ‘active’ variables) is conditioned on a single hypothesis node (instead of a set of decisions).
188
One major change from UDTPOP is the loss of action execution flexibility in DTPOP. In UDTPOP, we were allowed to execute actions in any order permitted by the ordering constraints in the partial-order graph. In DTPOP, on the other hand, the optimal step execution decision is a function of observations revealed during plan execution. When we change the order that steps are executed, we may change the order that observations are revealed. As a result, the execution decision policies on each step is subject to a different set of information. The plan model for DTPOP allows the planner to evaluate and optimize the utility of any action sequence commensurate with the ordering constraints of the partial order plan; however, the value of some of the sequences will dominate the value of others. The outline for the plan optimization algorithm is illustrated below. The optimization routine constructs a decision model for every topological sort of the plan, evaluates the decision model, and returns the topological sort and step execution policy with the highest expected value. Optimize(P) { let T := all topological sorts of the plan. return argmax (evaluate(PM(P, Ti))) Ti ∈ T }
FIGURE 90. Plan Optimization.
5.4.2.1 Plan Model Construction
The function PM constructs an asymmetric influence diagram given a complete plan P and a topological sort T = [ S i ] ni = 1 of that plan. The input plan is a tuple P = . The main difference between the structure of this plan and that of a UDTPOP plan lies in the mutex constraints, M . The mutex constraints play three roles in DTPOP:
189
• Each mutex constraint represents a “forced” decision. The plan execution agent
(and the optimizer) must choose between the steps in the mutex using information collected from the steps that execute prior to the earliest step in the mutex. If no relevant information is available at the time that the earliest step in the mutex executes, then the decision policy for this first step cannot be contingent. This has the effect of forcing at least one of the steps in the mutex to have a non-contingent “don’t execute” policy. This is discussed further on pg. 197. • Two probability nodes cannot be relevant to each other if they have mutually-
exclusive execution policies. Thus, the constraints on the execution policies allow DTPOP to conclude that there are more independencies in the plan than are apparent only from the graphical structure of the plan model. This allows DTPOP to reject more observation plans as being irrelevant. (See Sections 5.3.3.2 and 5.5) • The mutex constraints constrain the allowable policies for each execution deci-
sion. These constraints, in conjunction with other execution policy constraints, can greatly reduce the complexity of the plan optimization algorithm because the optimizer needs to consider fewer combinations of decisions/observations.
Modelling Plan Failure
Since DTPOP relies on UDTPOP’s pertinence and effectiveness criteria to eliminate bad branches, it is not possible for DTPOP to add a contingency if that contingency makes it impossible to achieve the goal. We would like to ensure that exactly one goal step is executed after executing the rest of the plan, no matter what the outcome is, in order to make sure that all of the outcomes of the plan are accurately evaluated. If it is impossible to achieve the goal in a particular context, then DTPOP has no way of adding a goal step that scores that context. If the reward utility function can be negative, this means that DTPOP will over-estimate the utility of plans that have some chance of failing. One solution to this problem might be to add an explicit “fail branch” alternative, as was done in CNLP [Peot+Smith, 92]. CNLP ensures that the execution policies for the goal 190
steps form a tautology. If it is impossible to achieve the goal in some context, CNLP adds an explicit failure action that declares that the goal is no longer possible to achieve. The approach that I have adopted instead is to rescale the reward utility function so that the worst possible outcome has value 0.0. If DTPOP does not execute any goal step, then the utility of plan failure is represented implicitly as being equivalent to the worst possible reward. The plan evaluator can underestimate the utility of an incompletely defined DTPOP plan, but it can never overestimate utility.27 The basic philosophy underlying this approach is that if it were possible to achieve a non-zero utility outcome, then there is a plan somewhere in the planner search space that will add exactly the right branch to take advantage of this outcome. While DTPOP will underestimate the expected value of an “incomplete” plan, the estimate for the fully elaborated final plan will be exact.
Modelling the Plan
PM is organized into two passes. The first pass assembles a belief network from the conditional effects and value functions contained in the plan. In the second pass, PM adds decisions and information arcs to turn this belief network into an influence diagram that is consistent with the selected topological sort.28
27. This policy does lead DTPOP to make less than optimal choices for decisions. If DTPOP aborts execution of the plan, the resulting plan state might have an outcome that partially satisfies the goal. Since DTPOP underestimates the utility of aborting the plan, it also underestimates the utility of any plan branch that might be aborted. 28. DTPOP, like UDTPOP, does not actually construct a distinct belief network to represent the plan model. Instead, it has algorithms that know how to interpret the plan itself as a belief network. The presentation here is intended to show what the equivalent belief network might look like.
191
n
PM( P = , T = [ S i ] i = 1 ) { B := PM_1(P) //constructs a belief network capturing the //cause and effect relationships between //uncertain variables M := PM_2(B,T) //turns the PM_1 belief network into a //influence diagram. return M }
FIGURE 91. PM. PM_1 constructs a belief network from the conditional effects and value functions in the plan. PM_2 turns this belief network into a dynamic influence diagram by adding decisions and information arcs. There are two versions of PM_2.
5.4.2.2 Pass 1
PM_1 is composed of several component construction routines that each construct fragment of the overall belief network. Each of these routines return a belief network fragment of the form: B frag = , where N is a set of probability nodes, A is a set of arcs, O ⊆ N is the subset of N that is observable, and V is a set of subvalue nodes. I will abuse the union operator to combine the components of these belief network tuples: B 1 ∪ B 2 = ∪ =
.
PM_1 calls Model_Step for each of the steps in the plan. Model_Step adds a conditional effect model for each of the conditional effects in the step that has established preconditions. Even though a conditional effect distribution and its outcome variables are not used in a causal link, its DECs might provide useful information for a future decision step. After PM_1 has added nodes and arcs to model each of the steps in the plan, Model_CL adds one or more belief network arcs for each causal link in the plan. Each of arcs connects an outcome variable of one step to a conditional effect distribution of another.
192
PM_1(Plan P = ) { B = <∅, ∅, ∅, ∅ >
for each step S i ∈ S , B := B ∪ Model_Step( S i , P ) for each causal link L j ∈ L B := B ∪ Model_CL( L j , P ) return B } FIGURE 92. PM_1. PM_1 constructs a belief network for the components in the complete plan P by concatenating models for the causal links and steps in the plan.
Model_Step(Step S i , Plan P = ) { B step = <∅, ∅, ∅, ∅ > for each CE k in Eff(S j)
{ if (there is a causal link for each precondition of CE k ) { B step := B step ∪ Model_CE( CE k ) }
} B step := B step ∪ <∅, ∅, ∅, Cost(S i) >
return B Step } FIGURE 93. Model_Step. Model_Step adds belief network fragments for each supported conditional effect and adds value nodes to the step model.
193
Model_CE(Conditional Effect CE i , Step S k ) { N := { CE i }, O := ∅, A := ∅ if ( CE i is perfectly observable) then { O := O + CE i } else { if CE i is observable { let DEC i be a deterministic distribution that maps the conditional outcomes of CE i to their corresponding DECs. N := N + DEC i O := O + DEC i A := A + ( CE i, DEC i ) } for each E j in OutVars(CE) { let E j(S k+) be a deterministic distribution that maps the conditional outcomes of CE i to values in E j . N := N + E j(S k+) A := A + ( CE i, E j(S k+) ) if E j is perfectly observable { O := O + E j(S k+) } } } return FIGURE 94. Model_CE. Model_CE adds a subvalue node for each cost or reward function in the step. Recall (start of Section 5.5.2) that the reward utility function (and only the reward utility function) must be additively scaled so that the worst possible outcome of the reward function has a value of zero.
194
E1(S+)
CEi
E2(S+)
E1(S+) E2(S+)
CEi
E3(S+)
E1(S+)
CEi
E3(S+)
DECi
E2(S+) E3(S+)
DECi
FIGURE 95. Observability and Model_CE. This figure illustrates the types of belief network fragments that are constructed by Model_CE. Nodes that are observable are shaded. Deterministic nodes are denoted by double ovals. The belief network on the left is a model for a conditional effect that is unobservable (all conditional outcomes are in the same DEC). This is the same structure that is constructed by UDTPOP’s Model_CE. The deterministic nodes to the right of the conditional effect node marginalize out the distributions for each of the effect variables. Arcs added by Model_CL attach to these distributions, not to the conditional effect itself. The belief network in the middle is a model for a conditional effect that is partially observable. In this example, one of the effect variables is perfectly observable. The belief network on the right model a perfectly observable conditional effect. In all cases, the pruning algorithms applied after constructing the dynamic influence diagram can eliminate extra structure. V
Model_CL(Causal Link S E →S C , P = ) { A := ∅ for each CE k in Eff(S C) { if ( V ∈ PreVars(CE k) and there is a causal link for each precondition of CE k ) { A := A + ( V(S E+), CE k ) } } for each UCE k in Cost(S C) { if ( V ∈ PreVars(UCE k) ) { A := A + ( V(S E+), UCE k ) } } return <∅, A, ∅, ∅ > }
FIGURE 96. Model_CL. Model_CL connects effect variables to the preconditions of conditional effects and cost functions.
195
5.4.2.3 Pass 2
In the second pass of the model construction algorithm, PM_2 constructs an influence diagram for a given topological sort T = [ S i ] Ni = 1 of the steps of the plan. PM_2 does the following: 1. PM_2 adds a decision node Di representing the execution decision to each step Si
that follows the first observation in the plan. This decision node has two values: T (meaning “execute”) and F. The execution decision for each step is connected to each conditional effect and value node in Si that has satisfied preconditions. 2. PM_2 adds an na value to every probability node that is a descendent of a decision. The value of each probabilitiy node is na with probability 1.0 given that any of their predecessors is na or the associated step execution decision is F. Otherwise, the conditional effect distributions are unchanged. 3. PM_2 modifies the goal reward function and step cost functions to have a value of zero if any of their predecessors are na or the step execution decision is F. 4. PM_2 adds information predecessors for each of the decisions. The initial set of information predecessors for decision Di on Si is the set of all decisions and all observable variables that have satisfied preconditions on steps that execute prior to Si in T. 5. PM_2 adds a constraint to each step execution decision that requires the execution decision to be F if the execution decision for any step in the support set is T. 6. Finally, for every mutex constraint, PM_2 adds a constraint between the decisions of the steps participating in the mutex constraint: A step decision policy cannot be T if the execution policy of earlier step in the mutex constraint is T. In Section 5.4.3, I will propose an alternative approach for constructing the plan model using only decisions required for threat resolution. This greatly reduces the total number of decisions. Unfortunately, the resulting execution strategies cannot be guaranteed to be optimal.
5.4.2.4 Plan Optimization
Given a plan P and a topological sort T , PM constructs an influence diagram that can be optimized using normal influence diagram evaluation techniques [Shachter+Peot, 92; Jensen, et al, 94]. However, because of the large amount of asymmetry in the diagram an algorithm like Cooper’s [88] might make more sense.
196
5.4.2.5 Simplifying the Decision Problem
Because of the large number of decisions introduced by PM_2, it is important to eliminate decisions, information predecessors, and dominated topological sorts before optimization.
Eliminating Decisions without Decreasing Utility
The first strategy for decreasing the time required for plan optimization is to eliminate decisions from the plan that cannot improve the utility of the plan. Theorem 16 (No Intervening Observation): Sj and Sk are two actions in the topo-
logical sort T = [ S i ] ni = 1 for plan P. If there are no compatible observation actions in [Sj, Sj+1, ..., Sk-1] and the reasons for Sj and Sk are the same, then the execution policies
for Sj and Sk should be identical. Theorem 17 (Shared Reason): Sj and Sk are two steps in the topological sort n
T = { Si }i = 1
for plan P and Sj < Sk. If the reason for Sk is a subset of the reason of Sj
then there is never a reason to execute Sk if Sj is not executed. Early Recognition of Bad Topological Sorts
Recall that each mutex constraint constrains two or more steps to have mutually-exclusive execution policies. One of the goals of the optimizer is to find a topological sort for the plan elements that makes every plan element essential in some way. For example, say that the execution policy for some step in the optimized plan is always false (don’t execute). If the search procedure for phase 1 of DTPOP is complete, then there is guaranteed to be an equivalent, simpler, plan in the search space that has the same topological sort and fewer steps prior to optimization. The optimizer should rejects this dominated topological sort as a solution. This kind of problem can be noticed early during the optimization process by examining the information requirements of the execution policy decisions using Theorem 12. Certain of the decisions in the unoptimized plan are forced. In particular, the mutex constraints force the optimizer to make a contingent decision at the point that the earliest step in the mutex constraint is executed. In order to make this decision, the opti-
197
mizer needs to find an observation that is relevant to a value function that is a descendent of one of the steps in the mutex constraint. If no such observation exists, the topological sort can be discarded immediately. The following two restrictions on terminal observations and mutex constraints can rule out many topological sorts without checking relevance relationships: • Each terminal observation must be necessarily before every step in at least one
mutex constraint in the topological sort. • For each mutex constraint mk in the plan, there must be some observation that is
necessarily before every step in mk in the topological sort. After checking these ordering constraints, we can compute the requisite information sets for the forced (mutex) decisions using a variant of the Bayes Ball algorithm [Shachter, 98].
Minimizing the Number of Topological Sorts
Iterating over all topological sorts in a plan is expensive. One way to reduce this expense is to determine equivalence classes of topological sorts. The algorithm of Figure 97 computes a set of partial orders. Every topological sort of each partial order returned by topo_classes has the same utility. Not that this algorithm does eliminate very many topological sorts if there are a lot of observations in the plan and there is very little mutual exclusion between steps.
198
topo_classes(plan P) { select SA and SB from P such that 1. SA is compatible with SB, 2. SA is an observation, and 3. SA is not ordered with respect to SB in P. if (there is no SA and SB satisfying these conditions) { return {P} else { return topo_classes(P + ( S A < S B )) ∪ topo_classes(P + ( S A > S B )) . } }
FIGURE 97. An algorithm for systematically generating partial orders with equivalent value topological sorts.
Theorem 18 (Equivalence Classes): Let P be a partial order generated from topo_classes. The expected value of every plan model generated for every topological sort consistent with P is the same.
5.4.3
Approximating Plan Optimization
The final decision optimization step for a DTPOP plan is very expensive. In general, there is a decision associated with each step in the plan and the result of every decision and step is a relevant information predecessor to a later decision. This means that the time required to determine the best set of execution policies is exponential in the number of steps in the plan. I don’t believe that this solution is very satisfying. One possible objective for the planner is to develop a conditional plan that gives a decision-maker some insight into the possible courses of action that she might take. If the planner returns a complicated dynamic program with dozens of decisions with complicated decision policies, then it might be difficult for the decision maker to gain any insight from the plan. One obvious way to reduce this complexity is to eliminate most of the decisions from the plan. After all, the only forced decisions in the plan model (from the standpoint of soundness) are decisions that resolve threats between two competing courses of action. If we
199
adopt such a strategy, we only need to add one decision for each mutex constraint in the plan. I use the following heuristic for adding decisions to DTPOP: For every mutex constraint M introduced by branch, I add a single n-way decision with one value, execute-Sk, for each step in the mutex constraint. This decision is added immediately after the last observation before the first step in M in the topological sort. This is not necessarily the optimal point to make the decision; making the decision earlier might eliminate some preparatory steps that execute before the steps in the threat. It does, however, guarantee that all of the information possible is used in the decision. It is extremely difficult to construct an influence diagram that models the efficient plan execution sequences. The trickiest part of this process is determining how arcs should be added that describe the effect of each decision on the execution of subsequent steps. Every time that a branch decision is made, DTPOP declares that it is not possible to execute some step that, in turn, supports some number of terminal steps. Since it is impossible to execute those terminal steps, it is not rational to execute steps that only support these eliminated terminal steps. It is difficult to determine the conditions that govern when a given step can be skipped due to other threat resolution decisions. In DTPOP, I completely finesse the issue by incrementally constructing and evaluating a decision tree from a topological sort of the plan. The algorithm works in two passes. The forward pass constructs a decision tree. The second pass optimizes the decisions. The notation that I will use for the decision tree is based on the CPT-tree29 representation proposed by Boutilier, et al, [96].
29. CPT: Conditional Probability Table
200
A a1
B C
b1
c1 c2 c3
b2
C
a2
B
b1
b2
c1 c2 c3
FIGURE 98. A Decision Tree.
A decision tree (Figure 98) is a map from paths to valuations. Thepath (Π) is a sequence of generalized literals30 that describes a branch of the decision tree. For example, the leftmost (bold) branch of the decision tree in Figure 98 is described by the path [ ( A = a 1 ), ( B = b 1 ), ( C = c 1 ) ] .
A valuation is a pair [p, v] where p is the probability of the
joint event represented by the conjunction of the generalized literals in the path and v is the value of the history of events represented by the conjunction of the generalized propositions in the path. Our strategy for constructing a decision tree will be to simulate every possible trace through a topological sort of the plan, constructing the tree along the way. The pseudo code for constructing the tree is listed in Figures 99 through 101. I make the following assumptions in order to simplify the presentation: • There is only one conditional effect and one cost function per step. • The mutex decision is made immediately before the execution of the earliest step
in the mutex constraint, not after the last observation before this step. • It is more efficient, but is slightly more complicated, to marginalize out unob-
served variables immediately after the last of the variables that they influence are added to the decision tree.31
30. variable assignments 31. In the spirit of bucket elimination [Dechter, 96]
201
The general strategy used during tree construction is to incrementally construct the path for each of the branches in the decision tree. This path records past choices that allow us to determine the feasibility or desirability of future actions. An action is feasible if its context is compatible with action execution decisions contained in the path.32 An action is desirable if the reasons for that action are compatible with the action execution choices recorded in the path. The tree construction algorithm is called Build-Tree. Build-Tree looks at each action in a topological sort of the plan and extends the path based on the relationship between that action and the older portions of the path. If an action is not compatible with the path or the reasons for the action are not compatible with the path, Build-Tree skips the step. If Build-Tree encounters an action that is compatible with the path, it adds the cost function and conditional effects of that action to the path’s valuation. If Build-Tree encounters an action that is mentioned in one of the mutex constraints, Build-Tree adds a decision comprised of the action execution choices that are compatible with past execution decisions.
32. See Section 5.3.2
202
Build-Tree(n, T , Π , M, p, v) //n is an index over the actions in the topological sort. //T is the topological sort. // Π is the path constructed thus far. //M is the set of mutex constraints in the plan. //p is the joint probability of the events in the path. //v is the value of the events in the path.. if (n > N) { //Termination condition return { <Π, p, v > } } else { Let Tn := be the n-th step of T. if ( ∃m ∈ M such that T n ∈ M ) { Let Dn,m := a new decision with a values equal to the set of actions in the mutex constraint m. return Build-Tree(n,T, Π + ( D n, m = d ) , d ∈ D n, m Prune-Mutex(M - m, Π ),p,v) } else { return Build-Action(n,T, Π ,M,p,v) }
∪
FIGURE 99. Build-Tree.
Prune-Mutex(M, Π ) { For each mutex constraint m k ∈ M { For each action S j ∈ m k { if (no goal gi in reason(S j) is compatible with the path Π ) { Remove Sj from mk } if (mk has only zero or one action) { Remove mk from M. } return M. }
FIGURE 100. Prune-Mutex. Removes all of the mutex constraints that are determined by past decisions.
203
N
Build-Action(n, T o , Π , M, p, v) { Let Tn := be the n-th step of T. if ( ¬compatible(T n, Π) ) { //Action n can’t execute. return Build-Tree(n+1, T, Π , M, p, v) else { Let con := the conditional outcomes in CEs(Tn) that have triggers that are consistent with the variable assignments in Π. Let ucon := the utility outcome in Cost(Tn) that is consistent with Π. return N Build-Tree(n+1, T o , Π + rename(outcome(co n)) , M,
∪
co ∈ co n
p ⋅ P { outcome(co n) trigger(co n) } , v + utility(uco n) )
}
FIGURE 101. Build-Action. ”rename” generates a new name for each variable in the plan by concatenating the action’s location in the topological sort with the variable name. The variable “Dec” is used to denote discernable equivalence classes.
Evaluating the Tree
Given the decision tree constructed by Build-Tree, it is fairly straightforward to evaluate it. The first step is to multiply the probability and value at the termini of each path in order to form an expectation. The second step is to marginalize out the uncertain variables that are not observable. In order to do this, I merge the subtrees under the marginalized variable, summing the expectations for matching branches. This merge is easy because the order of the variables in each of the subtrees is exactly the same. The fact that there may be missing branches in some subtrees complicates this merge only slightly: when this happens, the merge between the missing branch and the ‘present’ bramch is equal to the present branch. The merge algorithm (similar to [Cheuk+Boutilier, 97]) is shown in Figure 102. Let Treej(A) ∅.
denote the j-th subtree of tree A. If the j-th subtree of A is missing, Treej(A) returns
The function Maketree constructs a new tree using its arguments as subtrees.To sim-
plify the presentation, I will assume that we are merging only two subtrees.
204
merge(TA, TB) { if T A = ∅ return TB else if T B = ∅ return TA else if TA and TB are leaves return TA + TB else return Maketree( merge(Tree1(TA), Tree1(TB)), merge(Tree2(TA), Tree2(TB)), . . ., merge(Treen(TA), Treen(TB)) ) }
FIGURE 102. Merge.
After the unobservable variables are marginalized, the policies can be solved for via the use of tree “rollback.” Unlike a traditional decision tree, the joint probability for all of the events in the decision tree is included in the expectation on each branch of the tree. Decisions are marginalized by selecting the alternative with the maximum expected value.33 Uncertain variables are marginalized by summing the expectations of their outcomes [Shachter+Peot, 92; Jensen, et al, 94]. The tree rollback process is illustrated on pg. 230. This procedure correctly computes the final expected utility for the plan, as well as the correct policy for each decision. Note, however, that the expected utility computed for an intermediate node is wrong by a constant factor equal to the joint probability of the observation events on the path between the intermediate node and the root of the decision tree.34
33. This expectation is equal to the probability of getting to the decision point times the value of the plan up to that decision point. 34. The decision tree is in normal form instead of the usual extensive form presented in decision analysis texts.
205
5.5 5.5.1
Recognizing Open Uncertainties Background
In this section, I discuss how the probabilistic relevance criteria discussed in Sections 4.3 and 4.3.2 can be used to identify good points for adding observation plans.
5.5.2
The Information Relevance Network
The first phase of the plan construction algorithm, PM_1, constructs a belief network that can be used to model many topological sorts of the plan. We will use PM_1 to define an informational relevance network (or IRN) that captures many of the independence properties that are captured by the influence diagram constructed from PM_2. An informational relevance network can be constructed for the plan as follows: 1. add
a binary decision node Di for every step i in the plan. 2. add an arc from the decision node Di to every conditional effect distribution and value function constructed from step i. The value for the conditional effect distribution is na with probability 1.0 if any one of that nodes graphical predecessors is na or the associated step execution decision is false. Given any set of values for the decisions, we can treat the resulting network as a belief network. We will only assign values to the decisions that are consistent with the mutex constraints of the original plan. The partial order information contained in the plan graph allows us to determine whether a node cannot be an information predecessors for each decision. Let step be a function from IRN nodes to their corresponding step in the plan graph. We will say that node A in the IRN is before node B iff step(A) < step(B) . If a decision node is necessarily before an observable node, then that observable node is not observed at the time that the decision is made. If a decision node is necessarily after an observable node, then that observable node is observed at the time that the decision is made.
206
An information relevance network implies more independence than a belief network. For example, two nodes might be relevant in a belief network that is structurally isomorphic to the IRN, but these same two nodes can be independent in the IRN when ordering constraints and mutex constraints are considered. Let’s consider two examples: one where the mutual exclusion constraints in the IRN imply independence and one where the ordering constraints imply independence
Mutual Exclusion Example
In order for two uncertainties to be relevant to each other in an information relevance network, both of the uncertainties and some set of the nodes connecting those uncertainties must be mutually compatible. For example, say that nodes C and D are mutually exclusive in Figure 103. In the influence diagram that corresponds to the belief network of the figure, there must be a decision node (or nodes) that conditions both C and D. These decision will always be made before C or D can be observed. A and B are conditionally independent given the values for these decision nodes (See Figure 104).
B
A
⊥ C
D
FIGURE 103. Active Subgraph Example, Part I. Say that C and D are mutually exclusive. In the final influence diagram corresponding to this information relevance network, there will be a decision node that selects between node C and node D. Given the value of this implicit decision node, A and B are independent.
207
B
A
C
D
B
A
C
B
A
D
FIGURE 104. Active Subgraph Example, Part II. In this figure, the belief network is conditioned on the “implicit” decision described in Figure 103. The decision depicted (left) has two mutually exclusive sets of values. For some values of the decision, the step associated with node C is executed and D is not (middle). For other values of the decision, the step associated with D is executed and C is not (right). However, the mutual-exclusion constraint makes it impossible for both C and D to exist simultaneously. Therefore A and B are conditionally independent given the decision to execute C or D.
Ordering Information and Probabilistic Relevance The VOM example presented at the start of the chapter illustrates how ordering information affects relevance. Suppose that we are trying to identify uncertainties that are relevant to an unspecified decision that affects a given value node. In order for VOM Calibration to be relevant to any decision that affects value node U, both the VOM Calibration and VOM Reading observations must be made before the step containing U executes. In fact, the only step execution policies that can be influenced by the VOM Calibration observation are the execution policy decisions for steps that are possibly after both VOM Calibration and VOM Reading.
208
VOM Calibration
U
VOM Reading
VOM Bias
Battery Voltage
FIGURE 105. The VOM calibration example.
Define an active subgraph to be the minimum portion of any belief network that supports a given probabilistic relevance relationship.35 For example, the active subgraph for “VOM Reading is relevant to U given VOM Calibration” consists of VOM Reading, Battery Voltage and U (and the arcs connecting those nodes). There may be many active subgraphs that support a given relevance relationship (see Figure 106).
Definition 32 (Active Subgraph): An active subgraph is a mimimal set of nodes in a
belief network hat supports a given probabilistic relevance relation. In order for an active subgraph to support a given relevance relation on an information relevance network, we have to identify an active subgraph that consists entirely of compatible nodes.
Definition 33 (Compatible IRN Nodes): Let the function step(A)return the plan step
associated with value or probability node A. Two IRN nodes A and B are compatible iff step(A) is compatible with step(B) . Theorem 19 (Active Subgraphs on Information Relevance Networks): If A is
probabilistically relevant to B given C and some value for the decisions D, then there 35. This idea is similar to the concept of active paths in D-separation, but is more relevant to the operation of the Bayes Ball algorithm and considers explicitly observation nodes that are descendents of ‘head-tohead’ nodes in the active path.
209
is an active subgraph of the information relevance network supporting this relation that consists entirely of nodes that are mutually compatible.
A
A C
B D
A C
B
C
B
D
D
FIGURE 106. There are possibly many active paths for a single relevance relation.
Computational Complexity
The Bayes Ball algorithm described in the last chapter identifies all of the nodes that are relevant to the computation of a probability query in time linear in the size of the belief network (# of arcs). Determining probabilistic relevance in an information relevance network, on the other hand, is intractable. There are many possible active subgraphs of a belief network for a specific relevance relation. In order to prove that two nodes are relevant in a normal belief network, we only need to identify one. In order to prove that two nodes are relevant in an information relevance network, we need to identify the one active graph that satisfies the mutual compatibility restriction. Since the number of active subgraphs increases exponentially with graph size, this task is intractable. Theorem 20 (IRN Relevance is NP-complete): It is NP-complete to determine
whether A is relevant to B given C and some combination of decisions D in an IRN. Open Uncertainties
Now we have the tools that we need to define an open uncertainty. An open uncertainty is any probability node that is relevant to a value node that can be observed before executing the step containing the value node. In order to compute relevance, we need to explicitly
210
consider the ordering information in the IRN. Say that we are trying to determine possible information predecessors for Si. If Sk has an observable effect and Si precedes Sk in the plan, then the observable effect on Sk cannot be observed at the time that we make the execution decision for Si. If Sk precedes Si, on the other hand, the observable effect of Sk will always be observed at the time that we decide whether to execute Si. Theorem 21 (Relevant Node for a Decision): N is a relevant node for a decision Di for some topological sort T of the plan graph if there exists a subgraph B of the IRN
such that the event described by N is before Di in T, and N is
relevant to a value function Vthat is a descendent of Di in B given the observable nodes that precede Di and the decisions that precede Di.
Definition 34 (Open Uncertainty): An open uncertainty is an unobserved node that
is relevant to some decision.Di. Given a set of decisions in IRN, there is a tractable algorithm based on the Bayes Ball algorithm that identifies all of nodes that are relevant to a value function (plus a few that aren’t). Given a partial-order and a decision Di, the nodes in the IRN can be broken down into three classes: 1) the set of nodes that are definitely before Di, 2) the set of nodes that are definitely after Di, and 3) the set of nodes that are unordered with respect to Di. If an observable node is in the first set, then it is effectively observed when the decision Di is made. If an observable node is in the latter set, then it is not observed when Di is made. Rather than iterating over the all of topological sorts of these nodes with respect to the decision (an exponential process), we can “pretend” that these nodes are in the most advantageous state for the purposes of determining relevance. We can re-engineer the Bayes Ball algorithm to take advantage of known ordering relationships between observable nodes to prune more ‘inactive pathes.’ (See Figure 107). Suppose that chance node T is observable and is unordered with respect to Di in the partial order graph. Say that active path connecting value node V to T passes through a predeces-
211
sor node R (Figure 107-a). In this case, the active path cannot pass from R to the successors of T: T is necessarily observed before the nodes that are successors to T and, therefore, will block active paths to those successors. Suppose that the active path connecting value node to T passes through a successor Q of T (Figure 107-b). T does not necessarily block active paths to the predecessors of T: it is possible that T will execute after Di meaning that T
will not be observed at the time that this decision is made. T does, however, block active
paths connecting the successors of T. At the time any of T’s successors execute, T is observed.
R
T
T
Q A
B
FIGURE 107. Relevance For Observable Nodes. The shaded node (T) is an observable node that is not ordered with respect to a given decision. Observable nodes correlate their predecessors, pass relevance through to past observations (for decisions that are evaluated before T is observed) and block communications between their successors (T is always observed before the successors are observed).
A variant of the Bayes Ball algorithm that exploits this ordering information is shown in Figures 108 through 110. BB_IRN computes the set of nodes that are relevant to a value node/decision pair.
212
//Global Variables NodeSet F
NodeSet
K
//Deterministic nodes //Unsatisfied preconditions are //treated as deterministic nodes. //Observable nodes
//The bottom and top sets prevent looping. NodeSet N bot //Nodes that sent a msg downward. NodeSet N top //Nodes that sent a msg to the top. NodeSet N r //Nodes that are structurally relevant Collect-Relevant(Node j , Decision Di) { //The only relevant steps are before Si if (step(j) is possibly before Si) then { //j is observable. * if j ∈ K and ( j ∉ N top ) { N r := N r + j //j is structurally relevant. N top := N top + j for all k ∈ Pa(j, B) , Collect-Requisite( k , B ) } //j is not observable * if j ∉ K and ( j ∉ N bot ) { N r := N r + j N bot := N bot + j for all k ∈ Ch(j, B) , Collect-Relevant( k , B ) } } }
FIGURE 108. Collect-Relevant.
213
Collect-Requisite(Node j , Decision Di) { //j is possibly before D and is a chance variable //and is observable. * if j ∈ K and j is possibly before Di and ( j ∉ N top ){ N r := N r + j N top := N top + j for all k ∈ Pa(j, B) , Collect-Requisite( k , B ) } //j is not observable if j ∉ K { N r := N r + j * if j ∉ N bot { N bot := N bot + j for all k ∈ Pa(j, B) , Collect-Requisite( k , B ) } //j is not deterministic * if j ∉ F and j ∉ N top { N top := N top + j for all k ∈ Ch(j, B) , Collect-Relevant( k , B ) } } }
FIGURE 109. Collect-Requisite.
BB_IRN(Decision Di, ValueNode V, IRN B = ) { //Initialization F := { i ∈ N i is deterministic } K := { i ∈ N i is observable } N top := ∅ N bot := ∅ N r := ∅ Collect-Requisite(V, Di, B) }
FIGURE 110. Bayes_Ball_IRN. Computes the set of nodes that are relevant to value. Bayes_Ball_IRN does not consider the mutual exclusion relationships (see next section).
214
Identifying Open Uncertainties.
In order to make certain that only relevant open uncertainties are identified, we need some scheme for ensuring that we identify only compatible paths in the IRN. One way to do so is illustrated below. Find_Open_Uncertainty(IRN B = ; MutexConstraints M) { Open := ∅ loop over all compatible combinations m of values for the mutex constraints { Let B’ := The IRN derived by pruning all of the nodes in B that are not compatible with m. D := the decision nodes in B’. for all D i ∈ D { Let V := The value nodes that are successors of Di for all V i ∈ V { Open := Open ∪ BB_IRN(V j, D i, B′) } } } return Open }
5.6
Heuristics
In this section, I discuss a couple of agenda-ordering heuristics for improving the performance of the planner. The second heuristic discussed has important implications for the completeness of the planner.
5.6.1
Inter-branch threats V
Say that ST threatens the causal link S E →S C . If the intersection of the reasons for ST and SC
is empty, I will call this threat an inter-branch threat. If the intersection of the set of
reasons for ST and SC is not empty, I will call the threat an intra-branch threat. A threat can change from an inter-branch threat to an intra-branch threat through the addition of causal links, but an intra-branch threat can never become an inter-branch threat. Inter-branch threats are, generally, easier to resolve than intra-branch threats. In fact, it is guaranteed that branch can always be used to resolve the threat as long as the threat 215
remains an inter-branch threat. This suggests a heuristic for ordering threat resolution: delay the resolution of any threat as long as it is an inter-branch threat.36 If there are several inter-branch threats to resolve, then there is some advantage to resolving the “earliest” of these threats first; resolution of early inter-branch threats via branching can make later threats disappear.
5.6.2
Closing and Constraining Open-Uncertainties
One of the topics that was left deliberately vague during the definition of open-uncertainties was the idea of the “closed” open-uncertainty. In order to improve the systematicity of our planner, it would be desirable to resolve an open-uncertainty and then never work on it again. DTPOP uses the following strategy: the first time that an open uncertainty is identified, it is added to the flaw agenda. Once the open uncertainty is removed by removeopen-uncertainty, the planner puts it on a closed list and will never “open” that uncertainty again. Unfortunately, if we allow open uncertainties to be closed, then we also have to allow for the possibility that subsequent steps will add additional reasons for the original open Open
uncertainty to be open. For example, say that we discover that S j → is an open uncertainty because of a value function on step Sk. While attempting to discover an observation plan for this open uncertainty, we find a observation plan that has a terminal action So that is necessarily after Sk. Since this observation plan is useless for decisions concerning Sk, we would like to prune this plan from the search space. This creates a problem. The planner might complete this plan by adding another contingency plan that has a step Sm that is could execute after So and Sm has a value function that is possibly relevant to the observable outcome of So.
36. This kind of flaw ordering is done in [Peot+Smith, 93].
216
I can think of three strategies for closing open uncertainties that maintain completeness of the planner: • Once an open uncertainty is closed, it is never reopened. Identification of obser-
vation plans and identification of other contingencies can be interleaved. If this strategy is used, then no subsequent action can prune an observation plan because the observation plan loses its relevance to the current partial plan. Subsequent planning actions might restore the relevance of the observation plan. • Once an open uncertainty is closed, it can be reopened if a step or causal link is
added that makes the open uncertainty relevant to a new value function. Identification of observation plans and the addition of new contingencies can be interleaved. This strategy allows the planner to prune plans with irrelevant observation sequences and allows the planner to place probabilistic relevance constraints on observations. For example, the planner might place the constraint that observation O be relevant to a particular value function or set of value functions and use this constraint to prune subsequent plans. • Once an open uncertainty is closed, it can never be reopened. The planner does
not call add-branch after the first “add-forward” function is called. This last strategy (used by DTPOP) will be called The DTPOP Open Uncertainty Heuristic. The Open Uncertainty Heuristic allows us to restrict the definition of open uncertainties and the “add-forward” plan construction operations without compromising the completeness of DTPOP. It is not obvious what the best strategy should be for closing open uncertainties.
5.7
Example
I will illustrate the underlying principles DTPOP using a variant of the party problem [Howard, 1993].
217
The objective of the party problem to determine the party location that maximizes the expected value of a party. We have a choice of throwing the party indoors or outdoors. The success of each of these parties depends on the weather: if the weather is good, an outdoor party has the highest payoff. If the weather is bad, we may be better off with an indoor party. The actions in this domain include:37 • Invitations - Invite-indoors: Tells people that the party is indoors. Effect: Announced-Location = Indoors Cost: 0
- Invite-outdoors: Tells people that the party is indoors. Effect: Announced-Location = Outdoors Cost: 0
• Party Actions38 - Party-Indoors Effect: Held-Party=True Cost Function:
Preconditions
Cost
Announced-Location
Weather
Indoors
Rain Sun Rain Sun
not(Indoors)
40 50 200 200
- Party-Outdoors Effect: Held-Party=True
37. This domain description contains several conditions to ensure that only one party is planned. For example, the cost of a party is higher if you tell people to go to the wrong location. This is one of the interesting roles for planning: to make explicit assumptions that may be artificially limiting the alternatives available to the decision maker. 38. Each party action has negative utility; the benefit of the party is contained in the final reward utility function. This is a somewhat artificial restriction. If we had a richer language for expressing temporal and other resource restrictions, we could add actions to the planner if they either 1) satisfy the preconditions of subsequent actions or 2) create their own positive utility outcomes. For example, an MDP or POMDP (see Section 5.9.1.2) can reason over an infinite horizon plan if the value for future rewards is discounted by an exponential factor. The planner attempts to maximize the Net Present Value of the discounted rewards.
218
Cost Function:
Preconditions
Cost
Announced-Location
Weather
Outdoors
Rain Sun Rain Sun
not(Outdoors)
100 0 200 200
• Initial Conditions - Weather: {(0.4, Weather=Rain), (0.6, Weather=Sun)} - Announced-Location = None - Held-Party = False
The reward utility function is 100 when Party=True and is 0 otherwise. The ‘party domain’ only has one action with observable effects: Watch-Forecast. Our first (flawed39) attempt at describing this action is shown below. The discernible equivalence classes (‘Rainy’ and ‘Sunny’) are noted using the prefix ‘Dec.’ These observation labels are used only to distinguish between discernible equivalence classes–there is no significance to the fact that the labels in this action are “Sunny” and “Rainy.”40 - Watch-Forecast When (Weather=Rain) {(0.2, Dec(Sunny)), (0.8, Dec(Rainy))} When (Weather=Sun) {(0.8, Dec(Sunny)), (0.2, Dec(Rainy))}
5.7.1
The First Plan Branch
DTPOP uses the Open Uncertainty Heuristic described on pg. 217. DTPOP identifies successive contingency plans to maximize the expected value the goal. Once some number of 39. This formulation of the watch-forecast operator allows us to use a large number of identical observations of the forecast to increase the certainty of the forecast. For example, imagine that we can observe a very large number of weather forecasts (millions). If 20% of these forecasts are for rain, we can conclude that the weather is going to be “sun” with a probability approaching 1.0. In order to resolve this problem, we will need to extract the uncertainty from the forecast operator. We will revisit this problem in Section 5.9.3. 40. Indeed, we could switch the labels.
219
contingency plans have been identified, DTPOP looks for a set of observations or observation plans that can be used to select between these contingencies. DTPOP will not begin to look for observation actions until the plan is otherwise complete. Until DTPOP completes the first branch of the plan (closes all of the open conditions and resolves all of the threats) DTPOP acts exactly like UDTPOP. Initially, DTPOP finds an unconditional plan. One such plan is {Invite-Outdoors, PartyOutdoors} which has a total expected value of 60 (Figure 111). The utility of this plan is, however, conditioned on the (unobservable) state of the weather. The value of this plan when the weather is sunny is 100 and is 0 when rainy (see Figure 112). If we could find an alternative plan that was less sensitive to the rainy weather, we might be able to increase the overall utility of the overall plan by selecting this alternative when rain is likely (shaded line).
220
ICs: Weather Invite-Outdoors
Context: {} Reason: {Goal 1}
Party-Outdoors
Context: {} Reason: {Goal 1} Legend Causal Link:
Goal 1
Context: {} Reason: {Goal 1}
FIGURE 111. A non-contingent plan for partying outdoors.
Expected Value
100
Expected value of a desirable backup plan.
50
Expected value of the “Party-Outdoors” plan
0 0.0
0.6 Probability of Sun
1.0
FIGURE 112. Expected Value of Party-Outdoors. The expected value of the “Party-Outdoors” plan is a function of the probability that the weather will be sunny. In this case, the prior probability that the weather is sunny is 0.6, so the expected value of the plan is 60. When it is rainy, the utility of this plan is zero. Accordingly, it would be highly desirable to find a backup plan that has a higher expected value where the value of the original plan branch is low.41
5.7.2
Making the Initial Plan More Robust
When the first branch is complete, DTPOP can either return that plan or can attempt to enhance the plan by adding further branches. In our example, we call Add-Branch to continue planning after we complete the first branch. Add-Branch adds a new goal step, “Goal 2”, to the plan and adds a mutex con41. I should add that DTPOP does none of the meta-analysis described in the example, though it may be desirable to do so.
221
straint to prevent both “Goal 1” and “Goal 2” from appearing in the same execution trace of the plan: The goal steps provide the reward for accomplishing the goal: we should not be able to reap the reward function twice by executing both goal steps. M = { {S Goal 1, S Goal 2 }⊥ }
ICs: Weather Invite-Outdoors R: {Goal 1}
Party-Outdoors R: {Goal 1}
Goal 1
Legend Causal Link: Open Cond:
Goal 2
R: {Goal 1}
R: {Goal 2}
FIGURE 113. The plan after starting another plan branch.
For a moment, we will ignore the inter-branch42 threats and construct the second plan branch using the normal UDTPOP operations (Figure 114). M = { {S Goal 1, S Goal 2 }⊥ }
ICs: Weather
Invite-Indoors
Invite-Outdoors
R: {Goal 2}
R: {Goal 1}
Party-Outdoors
Party-Indoors R: {Goal 1}
Goal 1
R: {Goal 1}
R: {Goal 2}
Goal 2
Legend Causal Link: Open Cond: R: {Goal 2}
FIGURE 114. The plan after completing the second plan branch.
The expected values of the two plans as a function of the uncertain variable “Weather” is illustrated in Figure 115. We can see that the “Party-Indoors” alternative has the potential for improving the expected value of the plan if the probability of Sun is low. 42. Recall Section 5.6.1.
222
“Party-Outdoors”
Expected Value
100
“Party-Indoors” 50
0 0.0
0.6 Probability of Sun
1.0
FIGURE 115. Expected Value for the party alternatives. The optimal choice as a function of the posterior probability of sun is shown in bold. If we make some observation that reduces the probability of sunshine, then the “Party-Indoors” branch will be more attractive than the “Party-Outdoors” branch.
5.7.3
Inter-Branch Threats M = { {S Goal 1, S Goal 2 }⊥ }
ICs: Weather
Invite-Indoors
Invite-Outdoors
R: {Goal 2}
R: {Goal 1} Announced-Location
Announced-Location
Party-Outdoors
Party-Indoors R: {Goal 1}
R: {Goal 2}
Held-Party
Goal 1
Held-Party
R: {Goal 1}
Goal 2
R: {Goal 2}
Legend Causal Link: Open Cond: Threat:
FIGURE 116. Inter-branch threats.
Figure 116 illustrates the multitude of inter-branch threats that were created by construction of the second plan branch. Each of these threats is an inter-branch threat because the intersection of the reasons for the threat and the consumer of the threatened causal link is empty.
223
Section 5.6.1 suggests that the best inter-branch threats to resolve are those threats between actions that are executed earliest. Let’s select one of the two threats originating in one of the “Invite” actions and see why. Location
Suppose that we choose S Invite-Outdoors ⊗ ( S Invite-Indoors →
S Party-Indoors ) .
Inviting people
to attend an outdoor party certainly threatens a plan to invite people to an indoor party. In this example, we will resolve the threat by requiring the decision policies for S Invite-Outdoors and S Party-Indoors to be mutually exclusive. When we add the mutex constraint {S Invite-Outdoors, S Party – Indoors }⊥ to the plan, we resolve all but one of the threats. Figure 117 depicts the mutex and co-mutex sets43 for each of the actions in the plan. If the co-mutex set of one action shares elements with the mutex set of another action, the execution policies for those actions are mutually exclusive. All of the execution policies for all of the actions in “Branch 1” are mutually exclusive with “PartyIndoors” and “Goal 2” in Branch 2. The early branch operation resolves all of the later inter-branch threats. M = { {S Goal 1, S Goal 2 }⊥, {S Invite-Outdoors, S Party-Indoors }⊥ } ICs: Weather Invite-Outdoors Announced-Location
Announced-Location
Party-Outdoors
C: {IO} M: {PI}
Party-Indoors
Held-Party
Goal 1
C: {} M: {}
Invite-Indoors
C: {IO} M: {PI}
Held-Party C: {IO} M: {PI}
Goal 2
C: {PI} M: {IO}
C: {PI} M: {IO} Legend Causal Link: Open Cond: Threat:
FIGURE 117. Mutex Sets After Branching. “m” and “c” denote the mutex and co-mutex sets, respectively.
43. Section 5.3.2 on page 173.
224
We can resolve the last inter-branch threat via demotion or promotion. In this case, we will use demotion, adding ordering constraint S Invite-Indoors < S Invite-Outdoors to the plan.
5.7.4
Searching for Observations M = { {S Goal 1, S Goal 2 }⊥, {S Invite-Outdoors, S Party-Indoors }⊥ } IC
Weather
Invite-Outdoors
Invite-Indoors
Location
Location
Party-Indoors
Party-Outdoors
Held Party Goal 1
Cost
Held Party Goal 2
U
U
Cost Legend Chance Node:
A
Deterministic Node:
D
Value Node:
U
Ordering Constraint:
FIGURE 118. The Information Relevance Network.
Figure 118 depicts the information relevance network that models the partial plan of Figure 117. There is only one open uncertainty in this network. All of the “Announced-Location” and “Held-Party” nodes are certain. “Weather” is the only uncertain variable in the network that is relevant to a later value or cost function; in this case, “Weather” is relevant to the cost subvalue functions in “Party-Indoors” and “Party-Outdoors.” Weather
There are three plan construction operations that can resolve open-uncertainty S IC→
:
Add-Link-Forward, Add-Step-Forward, or Remove-Open-Uncertainty. Add-LinkForward is not applicable in this instance; “Weather” is not the precondition of a conditional effect in any other action in the partial plan. We might use Remove-Open-Uncertainty to remove this flaw, but in this instance this would leave us with a plan that has at least one decision and no relevant observations (see pg. 197).
225
The only plan construction operation that makes sense is Add-Step-Forward. Add-StepForward adds an instance of “Watch-Forecast,” the only action in the domain description in which “Weather” is a precondition of a conditional effect. The new plan is shown in Figure 119. M = { {Goal 1, Goal 2 }⊥, {Invite-Outdoors, Party-Indoors }⊥ } IC
Weather
Invite-Outdoors
Location
Party-Outdoors
Held Party
Watch-Forecast
Invite-Indoors
Location
Cost
Dec
Party-Indoors
Held Party
Cost
Legend Chance Node:
A
Deterministic Node:
D
Value Node:
U
Observable Node:
D
Goal 1
U
Goal 2
U
Ordering Constraint: Open Uncertainty:
FIGURE 119. The Information Relevance Network after Add-Step-Forward.
No new open uncertainties can be identified after Add-Step-Forward executes; there are no uncertain variables in Watch-Forecast that can serve as the basis for a causal link. In order to resolve the last flaw in the plan, we can call Remove-Open-Uncertainty.
5.7.5
Constructing the Decision Tree.
After Remove-Open-Uncertainty is called, there are no further flaws in the plan and the plan can be optimized. The first step in optimizing the contingent plan is select a topological sort for the plan that is consistent with all of the ordering constraints. Figure 120 illustrates one possible topological sort. The actions order in this figure is indicated by the vertical position of each action in the figure; actions are executed from the top down. After selecting the topological sort, we check to make sure that every terminal observation is 226
before some mutex constraint (yes) and that all of the steps in every mutex constraint is after some observation (yes).44
Increasing Time
ICs: Weather
Watch-Forecast Invite-Indoors Invite-Outdoors Party-Indoors Party-Outdoors
Goal 1
Goal 2
FIGURE 120. One possible topological sort for the contingent plan. In this topological sort, every step is ordered after the only step with observable outcomes, “Watch-Forecast.”
Once we have found an admissible topological sort, we can use Build-Tree constructs the plan model.45 Build-Tree recursively constructs the all of the paths in the decision tree by simulating the application of each of the operations in the plan. I will describe how BuildTree constructs just one of these paths and then illustrate the entire decision tree that would be constructed for the example plan. In the following description, I will trace the calls to Build-Tree, Build-Action, and Prune-Mutex. Throughout, I will use the first initials of the actions in the plan to denote each action; for example, SPI = Party-Indoors. The initial topological sort is T = [ S IC, S WF, S II, S IO, S PI, S PO, S G1, S G2 ] .
Recall that the arguments to Build-Tree are
44. See Section 5.4.2.4. 45. In this example, I will only talk about Build-Tree and will neglect PM2.
227
n: the index of the current step (initially 1), T: the topological sort, Π : the path thus far (initially empty), M: the mutex constraints, p: the joint probability of the events in the path (initially 1.0), and v: the value of the events in the path (initially 0.0). 1. Build-Tree(n = 1, T, Π = [ ] , M = { {S PI, S IO }⊥, {S G1, S G2 }⊥ } , p = 1.0, v = 0.0) In the initial call Build-Tree calls Build-Action to model SIC. Build-Action constructs a decision tree branch for each value of Weather. I will simulate just the path generated for Weather = sun. Build-Action adds this assignment to the path and multiplies p by P{Weather = sun}. p = P { Weather = sun } = 0.6 Π = [ Weather IC = sun ] .
2. Build-Tree(n = 2, T, Π = [ Weather IC = sun ] , M = { {S PI, S IO }⊥, {S G1, S G2 }⊥ } , p = 0.6, v = 0.0) Build-Tree calls Build-Action to model SWF. Again, I will simulate just one path– the path generated for DEC = sunny. p = 0.6 × P { DEC WF = sunny Weather IC = sun } = 0.48 Π = [ Weather IC = sun, DEC WF = sunny ] .
3. Build-Tree(n = 3, T, Π = [ Weather IC = sun, DEC WF = sunny ] , M = { {S PI, S IO }⊥, {S G1, S G2 }⊥ } , p = 0.48, v = 0.0) SWF
was the last observable action in the plan. Build-Tree decides that this is a
good time to make some decisions.46 Build-Tree adds a decision variable D1 to the path with alternatives {execute-SPI, execute-SIO} and calls Prune-Mutex on the remaining mutex constraint. The set of goal steps supported by SPI is {SG2}. If we choose execute-SIO, it is impossible to execute SG2. Similar reasoning applies to SG1. Thus making the decision between SPI and SIO makes the decision between SG1 and SG2. I will simulate the path where (D1 = execute-SPI). 4. Build-Tree(n = 3, T, Π = [ W IC = sun, DEC WF = sunny, D 1 = exec-S PI ] , M = { } , p = 0.48, v = 0.0) SII is
compatible with the path. Build-Action adds Loc II = indoors to the path.
46. The description of Build-Tree delays decisions until they are forced. The third bullet on page
228
5. Build-Tree(n = 4, T, Π = [ W IC = sun, DEC WF = sunny, D 1 = exec-S PI, Loc II = indoors ] , M = { } , p = 0.48, v = 0.0) SIO is not compatible with the path. Build-Tree skips this action and goes on to the
next action. 6. Build-Tree(n = 5, T, Π = [ W IC = sun, DEC WF = sunny, D 1 = exec-S PI, Loc II = indoors ] , M = { } , p = 0.48, v = 0.0) SPI is compatible with the path, so Build-Tree calls Build-Action. “Party-Indoors”
has both a conditional effect (Held-Party = T) and a cost function (this is the first action that we have explored that has a value function). Build-Action adds the value from this cost function to the past value, 0.0. Π = [ W IC = sun, DEC WF = sunny, D 1 = execute-S PI, Loc II = indoors, Held-Party PI = T ] v = 0.0 – Cost(( W IC = sun ), ( Loc = indoors )) = – 50.0
7. Build-Tree(n = 6, T, Π = [ W IC = sun, DEC WF = sunny, D 1 = execute-S PI, Loc II = indoors, Held-Party PI = T ] , M = { } , p = 0.48, v = -50.0) SParty-Outdoors
is not compatible with the path.
8. Build-Tree(n = 7, T, Π = [ W IC = sun, DEC WF = sunny, D 1 = execute-S PI, Loc II = indoors, Held-Party PI = T ] , SGoal 1
M = { } , p = 0.48, v = -50.0) is not compatible with the path, either.
9. Build-Tree(n = 8, T, Π = [ W IC = sun, Dec WF = sunny, D 1 = execute-S PI, Loc II = indoors, Held-Party PI = T ] , SGoal 2
M = { } , p = 0.48, v = -50.0) is compatible with the path and contributes a reward of 100.0.
The final path is: Π = [ W IC = sun, Dec WF = sunny, D 1 = execute-S PI, Loc II = indoors, Held-Party PI = T ]
The value and probability associated with this path are 50.0 and 0.48, respectively. BuildTree repeats this process for every branch in the tree. The resulting decision tree is shown in Figure 121. The path constructed in the example is shown in bold.
229
IC
WF
Variable
WeatherIC DecWF
II
IO
PI
PO
Probability Expectation
D1
LocII
LocIO
HeldPI
HeldPO
and Value
exec-SPI
indoors
T
[0.48, 50] 24
sunny sun exec-SIO exec-SPI
outdoors indoors
T T
[0.48, 100] 48 [0.12, 50] 6
rainy exec-SIO exec-SPI
outdoors indoors
T T
[0.12, 100] 12 [0.08, 60] 4.8
sunny
Corresponding branches are combined when unobserved variables are marginalized.
Step
rain exec-SIO exec-SPI
outdoors indoors
T T
[0.08, 0]
0
[0.32, 60] 19.2
rainy exec-SIO
outdoors
T
[0.32, 0]
0
FIGURE 121. The Full Decision Tree Constructed by Build-Tree.
5.7.6
Solving the Decision Tree
The first step in solving the decision tree is to multiply the probability and value function on the leaves of the decision tree (see the last column in Figure 121). The next step is to marginalize out the unobserved variables by adding the expectations of corresponding branches of the decision tree (See Figure 122). Once the unobserved variables are marginalized out, we can use the resulting tree to determine the decision policies and the expected value for the entire plan. Decisions are removed by selecting the decision alternative with the maximum expected value. Observable variables are removed by summing over the expectations of the leaves attached to the uncertainty. Note that the expectations for intermediate branches are multiplied by the probabilities of the events on the path leading to those intermediate branches. Since the
230
probability of the path is “rolled” into this expectation, there is no need to multiply the expectation by the probability of the observable variable given past observations–these probabilities are already included in the expectation. Figures 122 through 124 illustrate the rollback procedure. Figure 122 shows the effect of combining (adding) corresponding branches when marginalizing out the unobserved variable (Weather). After the unobserved variables are eliminated, the remaining variables are removed working from the leaves toward the root of the tree. The first variable to remove is the decision. The decision is removed by selecting the alternative that maximizes the utility of the decision (Figure 123). Finally, the remaining uncertainty (the result of WatchForecast) is removed by adding the remaining leaves of the tree. It should be repeated that the procedure used to compute policies correctly computes the only the final expected utility. The utilities at any interior uncertainty or decision are wrong by a constant–the probability of the observations that separate that uncertainty or decision from the root of the decision tree.
Step
WF
Variable
DecWF
II
IO
PI
PO
D1
LocII
LocIO
HeldPI
HeldPO
exec-SPI
indoors
T
Expectation
28.8
sunny exec-SIO exec-SPI
outdoors indoors
T T
48 25.2
rainy exec-SIO
outdoors
T
12
FIGURE 122. Decision Tree after Marginalizing Out the Unobserved Variables.
231
Step
WF
Variable
DecWF
Expectation
D1 exec-SPI
28.8
exec-SIO
48
exec-SPI
25.2
exec-SIO
12
48 sunny
25.2 rainy
FIGURE 123. Decision Maximization.
Step
WF
Variable
DecWF
Expectation
48 sunny 48 + 25.2 = 73.2 25.2 rainy
FIGURE 124. Marginalization of Observable Variables.
232
“Party-Outdoors”
Expected Value
100
“Party-Indoors” 50
The optimal execution policy given no forecast or a “sunny” forecast is to ‘Party Outdoors’.
30
0 0.0
0.6 Probability of Sun
If the forecast is for rain, the optimal execution policy is to “Party Indoors.” This contingent plan has an expected value that 1.0 is 30 greater than the “Party Outdoors” plan when the forecast is for rain.
FIGURE 125. Value of the contingent plan. A contingent plan can have an expected value that is larger than a non-contingent plan because execution of plan steps can be made contingent on information collected during planning. In the example above, the expected values of the two plan branches in Figure 120 are plotted as a function of the probability of sun. The optimal policy maximizes the expected value of the plan given the observations seen thus far. If the forecast is observed prior to any other step in the plan, then the optimal policy is to execute the steps corresponding to “Party-Outdoors” if the forecast is for sun (this alternative is indicated by the vertical dotted line to the right of the center dashed line). If the forecast is for rain, the optimal execution policy is to execute the steps in the “Party-Indoors” contingency (this alternative is indicated by the vertical dotted line to the left of the center vertical dashed line). The expected value of the non-contingent plan shown in Figure 111 is 60. The expected value of the contingent plan is 73.2, an improvement of 13.2.
5.8 5.8.1
Formal Properties Soundness
UDTPOP is sound in a very strong sense, every step in a complete plan is effective and the plan model is correct and consists only of those nodes that are necessary for computing expected value. DTPOP is sound in a weaker sense: • In the first phase of plan construction, each step is guaranteed to be effective, at
least for some step execution policies. DTPOP does not consider the effectiveness of steps during plan optimization, however. It may be optimal for a particular set of branches to select a series of steps that are ineffective. DTPOP plans are weakly effective (See Section B.4).
233
• The DTPOP evaluator described in Section 5.4.2 underestimates the utility of
aborted plans. If at some point, the evaluator discovers that every plan branch has negative value, it declares that the optimal policy is to abort plan execution and assigns a value of 0.0 (the worst possible outcome if we do nothing) to that particular decision policy. It is possible, however, for the aborted plan to leave the plan execution agent in a situation that has an expected value that is greater than zero. Because of this, DTPOP underestimates the overall utility of suboptimal plans and may compute some suboptimal decision policies. I feel that the impact of these two steps is relatively minor. Neither of these problems will affect the optimal DTPOP plan and, since when the plan is suboptimal the evaluator underestimates utility, the optimal plan of bounded size will have the best expected value. Theorem 22 (Soundness): A complete DTPOP plan is weakly effective and has correct causal structure for any admissible decision policy. For aborted traces DTPOP can underestimate plan utility.
5.8.2
Completeness
The completeness proof for DTPOP (Section B.5.3) proves that DTPOP can identify the optimal plan of bounded size. The argument works as follows: • Say that the optimal solution of length N to a contingent planning problem is a
sequence of contingent steps Q = [ Q i ] Ni = 1 with decision policies fi. • In Q, there may be steps that are supported by more than one causal link. This is
illegal in DTPOP, so we need to “split” these “fused” links in order to construct a legal DTPOP exemplar. • A clairvoyant completeness proof shows that DTPOP can always construct a
plan equivalent this new exemplar and this plan has the same expected value as Q. Theorem 23 (Completeness): Let Q be a sequence of contingent actions that is an
optimal solution of length N to the planning problem D = . The cost func-
234
tions for the possible actions in A are all greater than zero. DTPOP with the appropriate search control strategy can identify a plan P' that has the same expected value as Q .
5.9
Discussion
5.9.1
Related Work
The research that is most directly applicable to the design of UDTPOP and DTPOP has been presented elsewhere. In this section, I will focus the review on relevant research or topics that haven’t been reviewed elsewhere in this dissertation.
5.9.1.1 Causal Link Planners
We have already reviewed the links between DTPOP and Warplan-C [Warren, 76], CNLP [Peot+Smith, 92], C-Buridan [Draper, et al, 93], and Cassandra [Pryor+Collins, 93; 95; 96].
SENSp
SENSp [Etzioni, et al, 92] a variant on SNLP [McAllester+Rosenblitt, 91] that was designed for the construction of information-collection plans. The specific problem addressed by SENSp is how to express and achieve specific information-collection goals without developing a plan that “sets” the environment in a way commensurate with the desired information goals. Suppose that we are trying to find a file named “dissertation.txt” in a Unix file system. One way to identify a file with the requisite properties is to find any file, rename that file to be “dissertation.txt” and return that file. This is clearly cheating. One of the contributions of SENSp is a mechanism for implementing information retrieval goals that avoids this problem. SENSp annotates effects to distinguish between effects that are observations and effects.that are real. SENSp also annotates goals to distinguish between goals of achieve235
ment and information collection goals. Goals can either be goals of satisfaction (satisfy ?p), goals of preservation (hands-off ?p) or information-collection goals (find-out ?p). Goals annotates with “satisfy” are similar to goals in other causal link planners. Preservation goals prevent SENSp from affecting the state of the preserved goal in any way using any action that executes prior to the action containing the “hands-off” goal. Informationcollection goals (find-out) protect the state of the information that they are trying to collect unless there is no way to collect information without changing this information. DTPOP uses the C-Buridan [Draper, et al, 93] technique for distinguishing between the observable and non-observable effects. C-Buridan’s separate annotation of the observable effects of an action (the DECs) and the real effects of actions is similar in a cursory way to SENSp’s separation of observable and real effects. Unlike SENSp however, C-Buridan cannot express explicit informational-collection goals and has no analogue to the “handsoff” concept. All goals in C-Buridan are goals of attainment. Information “goals” arise in C-Buridan only via the branching operation for threat resolution. Since DTPOP uses C-Buridan’s technique for annotating observable and non-observable effects of actions, DTPOP never runs into a problem with confusing informational goals and goals of attainment. Like C-Buridan, DTPOP cannot explicitly formulate informationcollection goals. Instead, DTPOP determines the uncertain events that are relevant to the expected value of the plan and uses probabilistic relevance to drive the construction of information-collection plans that provide direct or indirect evidence of the state of the original uncertainties. The “hands off” or “find out” annotations of SENSp are represented implicitly via the causal links in observation and primary plan branches. If the expected value of the primary plan branch is dependent on a past uncertainty, then there will be a causal link between some action in the primary plan branch and that past uncertainty. This causal link prevents a parallel observation plan from interfering with the protected condition.
236
ε-Safe Planning [Goldman+Boddy, 94a] and Plinth [Goldman + Boddy, 94b] The ε-safe planner is a variant of CNLP that uses a selective completion criterion to decide whether it should construct branches for “dangling else” clauses of low probability. The objective is to focus early planning on high probability outcomes and to complete lower probability outcomes as resources and time allow. The ε-safe planner also uses probabilistic relevance to drive the construction of models for plans. The ε-safe planner uses a CNLP-like planner to construct alternative branches and then uses AlterID [Breese, 93] to construct a belief network that models the events in that plan. DTPOP takes a fundamentally different approach: the probabilistic relevance criterion is used to drive the planning process more or less directly. Plinth is a contingent total-order planner similar to the ε-safe planner and CNLP. Goldman+Boddy claim that the case for partial-order planning for contingent plans is less compelling than it is for non-contingent plans. I concur with this judgement, though for different reasons than the one given in the paper. The optimal policy for a DTPOP plan places some fairly significant limitations on the optimal partial orders for the plan (See the section on topological sorts on pg. 198). Generally speaking, steps that are adjacent in the optimal total order can only be swapped if they are unobservable or if they have mutuallyexclusive execution policies.
DRIPS [Haddawy+Suwandi, 94]
DRIPS [Decision-Theoretic Refinement Planning System] is a refinement planner that represents sets of actions or sequences of actions as abstract actions whose outcomes are described by probability intervals. These probability (and utility) intervals bound the best and worst case performance of every concrete action that can be derived by refining the abstract action. DRIPS uses this abstract action representation to prove the dominance of one class of plans over another without the direct comparison of concrete plans from those classes. DRIPS, therefore, can prune many classes of plans from consideration without analyzing the individual plans. 237
There is no direct analogue between DTPOP and DRIPS. It would seem to be valuable to use the dominance pruning techniques of DRIPS for pruning provably dominated action sequences in DTPOP. This does not seem to be possible either for the abstraction of sequences or for the abstraction over refinements of an abstract operation. We might be able to apply operator graphs [Smith+Peot, 93] to identify subsequences that are more-or-less independent in the plan graph and use this independence information to power a DRIPS-like abstraction mechanism. Alternatively, we might consider developing a hierarchical version of DTPOP that does directly support Haddawy, et al’s abstraction calculus.
5.9.1.2 Markov Decision Processes
One of the more popular research areas for planning under uncertainty is research on Markov Decision Processes (or MDPs) [Puterman, 94; Bellman, 57]. An MDP is uses a discrete-time Markov process to model the time evolution of a dynamic system. The state of a dynamic system at one point in time (say T) is represented in terms of a joint probability distribution over the state of the system given the state of the system at time (T-1). The MDP adds action decisions and a value function to this basic Markov process; the probability distribution over the state at time T is a function both of the state at T-1 and the action chosen at time T. The reward that the agent receives is one a function on the Markov state at one point in time or some function of the entire trajectory of states. The objective of an MDP-based planner is to determine a decision policy for each action decision as a function of time. This decision policy can be conditioned on various amounts of information: • In the classic form for the Markov Decision Process, the agent has perfect
knowledge of the last state in the Markov process. • In a Partially-Observable Markov Decision Process (POMDP), the agent may
only observe some, partially noisy, function of the state variables.
238
• It is also possible that the agent can make no observations about the world state,
in which case the optimal plan is a fixed sequence of decisions. The first type of MDP is typically the easiest to solve. The optimization problem47 for MDPs is known to be NP-complete [Littman, et al, 98]. If there are no restrictions on the size of the solution for a POMDP, the plan existence problem is known to be PSPACEcomplete [Littman, et al, 98]. The POMDP problem is more general than the problem attacked by DTPOP and UDTPOP. The objectives of DTPOP and UDTPOP resemble those of POMDP solvers because DTPOP and UDTPOP are both are attempting to derive an optimal execution policy based on partial or no observability of the underlying state. POMDPs, however, are more general because they can represent processes that evolve independently from the actions of the planning agent (Recall the no spontaneous action assumption of Section 2.2.5.). Although there has been a flurry of research into POMDP’s and MDP’s in the past few years, the basic problem formulation and solution techniques have been established for a long time. Bellman [57] proposed the MDP and proposed the basic solution technique (value iteration) for MDPs and POMDPs. Howard [60] proposed the policy iteration techniques for solving MDPs. Sondik [Smallwood+Sondik, 73] demonstrated that a partially-observable markov decision processes can be turned into a totally-observable MDP where the state is the belief state of the agent. Normally this isn’t very helpful because the because the belief state of the agent is continuous, but Sondik showed that the value function over this continuous belief state can be represented efficiently as a set48 of vectors representing a piecewise-linear convex function. The main contributions by AI researchers to MDP research: 47. Actually, the existence problem is NP-complete. The optimization problem is not a decision problem. 48. Possibly exponential in size.
239
The AI planning community has contributed to the POMDP and MDP community by developing techniques that allow the basic POMDP and MDP algorithms to work more efficiently. Some AI researchers [Boutilier, Dean, + Hanks, 95] argue that MDPs are the right representation for unifying all research into probabilistic and utility-based planning. While I feel some sympathy for this sentiment, I feel that there are some points for caution: • While the best MDPs solvers can solve problems with millions of states (corre-
sponding to 20 or so binary variables) [Littman+Majercik, 98] the best POMDP solvers can solve problems with only have a few states (approximately 10).49 • MDP and POMDP algorithms, by and large, work on the space of the joint distri-
bution over the state at each time period. The plan construction and optimization phase of a probabilistic planner, on the other hand, constructs a factorable model. In some domains, this graphical model can be used directly to eliminate information predecessors or to prune dominated plans. For example, in order to guarantee the best possible solution to a general POMDP, we need to construct a plan whose size is exponential in the number of observations and decisions seen thus far, The ultimate size of the final solution ultimately bounds the complexity properties for the underlying algorithm. A POMDP algorithm must be exponential in the number of observations over all time because the size of the optimal plan is exponential in the number of observations. In a factorable model, we can use D-separation [Geiger, et al, 90] or the Bayes Ball algorithm [Shachter, 98] to prune irrelevant information predecessors from downstream decisions–reducing the size of decision policies by a factor that is exponential in the number of eliminated information predecessors. Here are some of the recent contributions to the MDP literature from the AI research community:
49. Michael Littman, personal communication, 1998.
240
Envelope techniques: Envelope techniques solve MDPs by identifying a single basic plan and expanding an MDP around that plan. The basic idea is that if you have a pretty good plan for achieving a goal, an MDP can use this plan as a framework for focussing value or policy iteration. Drummond+Bresina [90] develop the initial solution to an MDP from an plan graph that describes what is, hopefully, an efficient solution to the MDP. Dean, Kaelbling, Kirman, and Nicholson [93] solve an MDP by identifying a “one state wide” solution to the MDP and then expand the “envelop” of this solution to encompass all of the high priority or high utility states. Factoring techniques: Boutilier and others [Boutilier, et al, 95] [Boutilier+Dearden, 94] [Boutilier+Poole, 95] have developed a series of techniques for using conditional independence to aggregate elements in the joint distributions of MDPs. The general idea of structured probabilistic inference (SPI) is to represent the conditional and joint distributions for a probability distribution in terms of a CPT tree. Boutilier+Dearden [94] developed a technique for performing dynamic programming updates in MDPs using this tree representation. Boutilier+Poole [95] developed a tree-based technique for updating the vectors describing the piecewise linear convex value function used in POMDPs. Dynamic Programming: A lot of the effort in AI community (and the POMDP community in general) has been aimed at reducing the number of vectors required to represent Sondik’s piecewise linear value function. Littman, Cassandra, and Kaelbling [96] developed the witness algorithm for POMDPs and demonstrated that the performance of this algorithm dominates the performance of algorithms by Monahan [92] and Cheng [88]. The incremental pruning algorithm [Zhang+Liu, 96] is at this writing the fastest general algorithm for the solution of POMDPs [Cassandra, et al, 97]. Several authors ([Parr+Russell, 95], [Zhang+Liu, 96]) have developed approximations to the value function, resulting in improved performance.
241
5.9.1.3 MAXPLAN
Recently, Littman and others [98] demonstrated that the plan existence problem for polynomially bounded probabilistic plans assembled from a succinct representation is NPPPcomplete. Selman and Kautz [96] recently proposed to encode NP-complete planning problems as NP-complete SAT and solve them using an efficient SAT solver. Inspired by the success of this approach, Majercik and Littman [98] have constructed an efficient solver for the canonical NPPP-complete problem, E-MAJSAT, and use this solver to solve appropriately formulated probabilistic planning problems. The E-MAJSAT problem is the following: Given a Boolean formula with choice variables (variables whose truth status can be arbitrarily set) and chance variables (variables whose truth status is determined by a set of independent probabilities), find the setting of the choice variables that maximizes the probability of a satisfying assignment with respect to the chance variables. [Littman, et al, 98]. Majercik and Littman have developed a technique for direct solution of this satisfaction problem drawing from the literature on constraint propagation [Davis, et al, 62]. MAXPLAN encodes UDTPOP-type problems into E-MAJSAT problems and solves them. I feel that there are a number of important synergies that can be exploited between UDTPOP, DTPOP and MAXPLAN. One of the problems with MAXPLAN is that it doesn’t have very much knowledge about the characteristics that tend to lead to one plan dominating another. As a result, MAXPLAN must pursue “to the bitter end” a number of plans that can be demonstrated to be dominated or infeasible using a relatively small number of operations. For example • The effectiveness criterion mentioned in Section 3.2.3 can be used to immedi-
ately eliminate a plan by showing that, for some step, no effective conditional outcome is possible or relevant. • The probabilistic relevance criterion mentioned in Chapter 4.0 can be used to
eliminate irrelevant conditional effects from consideration. In addition, plans with irrelevant observations can be pruned.
242
• The DTPOP can be made more systematic by using the utility of plan branches
to control the expansion of plans [Sproull, 77]. It may be possible to build these constraints into MAXPLAN: if the utility of plan branches violates some “canonical order,” we can eliminate the plan from consideration because we know that we have explored a similar one elsewhere in the search space. I expect that this approach will be a very fruitful one to explore further.
5.9.1.4 Knowledge-Based Model Construction
Research in knowledge-based model construction is very relevant to my research on contingent planning. The most relevant research is on the construction of influence diagrams and belief networks from declarative representations. AlterID [Breese 87; 92] and Frail3 [Goldman+Charniak, 93] both use probabilistic relevance criterion to drive the construction of influence diagrams, including the construction of information-collection “plans.” DTPOP also uses the same probabilistic relevance techniques to identify information collection plans. My research differs in two respects: • DTPOP is a planner and, therefore, has to reason about threat resolution.50 • The probabilistic relevance mechanisms in Frail3 and AlterID are designed for
uncertain models that are symmetric. DTPOP uses a probabilistic relevance mechanism that is designed for highly asymmetric models of contingent plans.
5.9.2
Knowledge Preconditions
McCarthy+Hayes [69], Moore [85] and others [Morgenstern, 87; Davis, 94] have argued the case for knowledge preconditions for actions and plans. The knowledge preconditions for an action is the information that an agent requires in order to execute that action. The knowledge preconditions for a plan is information required in order for an agent to execute 50. David Smith (personal communication, 1990) points out that the principal distinction between theorem provers and planners is that the latter have to reason about threats between parallel threads of action and have special mechanisms for making threat detection and resolution efficient.
243
a plan, including both the information that the agent has at the time that planning commences and information gained during the course of execution [Davis, 93]. DTPOP does not use knowledge preconditions for actions. Actions are assumed to be executable regardless of the state of the world. The exact effect of the action in different world states is embedded in the definition of the action. For example, we might compose two types of move actions. move-1 is designed to move the agent from a specific location to another specific location. If the agent is not at the specific location designated in the precondition for the action, the agent’s position is unchanged. move-1 implicitly assumes that the final decision concerning whether the action is executable rests with the agent: the agent is assumed to be able to sense the world to the extent that it needs in order to execute the action. move-2, on the other hand, is designed to move the agent forward some number of feet, no matter where the agent is pointing. The exact effect of the action is a function of the agent’s location in the world, the presence of several obstacles, cliffs, etc. In this instance, less is assumed of the plan execution agent–the agent may have no ability to sense the world and therefore cannot collect knowledge preconditions. DTPOP does not implement the “formal” definition of knowledge preconditions for plans, either. UDTPOP, the basis for DTPOP, allows an agent to execute any sequence of actions whether or not they are feasible. The formal definition for knowledge preconditions for plans usually makes some sort of guarantee that the plan will eventually execute successfully if the appropriate knowledge is available before plan execution. A DTPOP plan is not guaranteed to achieve a goal–the DTPOP planner only guarantees that the model for any DTPOP plan is correct and that its outcome isn’t too awful (e.g. it is at least as good as the worst utility outcome possible). DTPOP does, however, allow the collection of information that allows the agent to increase the probability that a goal will be achieved. In particular, DTPOP plans information collection actions that allow the agent to evaluate and select among contingent plan
244
branches. Our approach is similar to the one espoused by Pryor [95]. There is an implicit decision after every observation: the agent can elect to pursue or abandon any number of plan branches subject to a set of mutual exclusion constraints.51 Pryor adds explicit decisions to plans and adds explicit knowledge goals (or preconditions) to those decisions that allow Cassandra [Pryor+Collins, 93; 95] to select a course of action that is guaranteed to accomplish the stated goal. The “policy” in these decisions is a set of condition-action rules derived from an analysis of how plan branches depend on past uncertain events. For example, suppose that we are trying to synthesize a plan to drive from one location to another in a domain that includes a number of drawbridges. We might derive a plan branch whose success is dependent on the state of one or more of these bridges. When Cassandra “builds” a decision policy to choose between two or more alternative plan branches, it adds knowledge preconditions to this decision that direct the planner to acquire information about the state of the drawbridges. Cassandra cannot construct information collection plans for these knowledge preconditions: It is assumed that when the agent needs to decide between different courses of action, it will have perfect knowledge of the state of the relevant knowledge preconditions.52 In our drawbridge example, it is assumed that the agent is able to perfectly observe the state of the drawbridges before deciding on a course of action. Where DTPOP differs from Cassandra (and in much of the formal work derived from Moore [85]) is in the relatively rich (though propositional) knowledge representation scheme that it steals from C-Buridan [Draper, et al, 93; 94a; 94b]. C-Buridan makes a strong distinction between what is true in the world and what the agent believes about the
51. These mutual exclusion constraints, by the way, summarize limitations of the underlying model of the plan. If the mutual exclusion constraints are violated, the underlying model for the plan can no longer be guaranteed to be sound. The action sequences derived by violating the mutual exclusion constraints might, none-the-less, be quite good. 52. Cassandra does not have an inference engine that can compute the state of the world based on indirect observations.
245
world. The work of Moore, Pryor, and others [Moore, 85; Pryor, 95; Morgenstern, 87] all seems to have the form “if you need to know XXX in order to execute an action, then you should observe XXX directly.” With this “impoverished” model of observations, regression on knowledge preconditions make sense (see Section 5.9.4). With probabilistic inference, however, we can use very indirect observations to give us hints about the structure of the world state. This implies that we should be looking for plans that depend on uncertain events rather than looking for specific tests for uncertain events. For example, say that Ted wants to ice skate on the pond behind his house but fears falling through ice. He might choose to test the ice directly using some action such as boring a hole through the ice to measure the thickness. This is the sort of direct observation that one might use to satisfy a “thickness of the ice” precondition. Ted can also learn about the ice indirectly through the execution a variety of plans that 1) depend on the thickness of the ice and 2) have an observable outcome. For example, he might do one of the following to reduce his uncertainty about the safety of the ice: • Ted might throw a big rock onto the ice. • Ted might encourage his sister to skate on the ice. • Ted might encourage his St. Bernard to fetch a stick thrown out onto the ice. • Ted might construct a model of the freezing process, the depth of the pond, and
the temperature distribution over the recent past to determine the thickness of the ice. There are an unbounded number of ways to satisfy the “ice thickness” precondition that all involve indirect observations of the results of plans that depend on the precondition. I will demonstrate in Section 5.9.4, that this behavior is a consequence of partial observability. I feel that the term knowledge precondition is, therefore, a bit deceptive. There is a requirement to have enough information prior to the execution of an action to make that action “ground” or executable, but the information that needs to be gathered in order to satisfy a knowledge precondition is not necessarily enumerable in advance. Instead, I
246
believe that the action should be defined in terms of the effects that that action would have as a function of preconditions on the world state (possibly unknown) and parameters set by the plan execution agent. The parameters set by the agent are decisions that are a function of information that is probabilistically relevant to the uncertain precondition parameters and to value nodes that are descendents of these decisions.
5.9.3
Observation Actions
I alluded an interesting knowledge acquisition problem in Section 5.7 (Footnote 39 on page 219). If an observation action is uncertain and the uncertainty for that action is contained in the action, then it may be possible to use a large number of observation actions to build up what is in essence a perfect observation for the uncertain variable. Consider the following weather forecast action: If the weather is going to be sunny, we observe the otherwise empty “Sun” DEC with probability 0.8 and the “Rain” DEC with probability 0.2. If it is going to rain, we observe the two DECs with reversed probabilities, 0.2 for “Sun” and 0.8 for “Rain”. Let’s say that our prior on the weather is a 50% chance of sun and a 50% chance of rain. If we observe a forecast of rain, the likelihood ratio for rain vs. sun is 4:1. The posterior probability of rain after one observation is that we believe that there is an 80% chance of rain. Now say that we watch the forecast a bunch of times. If we watch the forecast 20 times and it came up rain 15 times, what is the probability that it will rain? Well, given our 15 10 model, the odds ratio for rain vs. sun is 4-------= 4 = 1048576 . In other words, watching 5
4
the forecast enough times will give us a more-or-less perfect assessment of the future state of the weather. This is clearly a problem. The problem in this situation is that we are not being sensitive to the (lack of) independence of repeated copies of an observation action.In order to encode this observation action correctly, we need to factor out the correlations between duplicate
247
observation actions and leave in only the uncertainty that we feel is truly independent. Here is an example, we know that the individual forecasts for the local news stations is a function of the measurements provided by the National Weather Service and a function of the individual forecasters ability to interpret the data. Once a forecaster has come up with a forecast, it doesn’t change for some amount of time (let’s say that it is the next day.).
NWS Bias
Forecaster 1 Bias
Forecast 1
Unknown Weather NWS Forecast
Forecast 2 Forecaster 2 Bias
FIGURE 126. Simple Forecast Example.
In this model, the watch-forecast operations themselves are completely deterministic. The uncertainty is all hidden in the unknown bias of the individual forecasters and the unknown bias of the National Weather Service forecast (Fig. 126). Watching multiple forecasts prepared by the same forecaster does not reveal any more information about the data provided by the NWS. Given the conditioning information, Forecaster Bias and NWS Forecast,
the forecast is completely determined. Watching the forecast from one forecaster
multiple times during the day has the same effect as watching a video-taped recording of the forecast multiple times (Fig. 127). Watching forecasts from multiple forecasters eventually will provide perfect information about the NWS report, but doesn’t provide perfect information about the weather. Essentially, multiple observations are removing the uncertainty introduced by multiple forecasters interpretation of the NWS forecast.53 53. I am assuming that the individual forecasters do not use their own data.
248
NWS Forecast
Radio Forecast at 5:00 pm
Forecaster 1 Bias
TV Forecast at 6:00 pm
TV Forecast at 11:00 pm
FIGURE 127. Multiple Forecasts.
When engineering observation actions, we need to be particularly careful to consider the independence of multiple observation actions. In the original watch-forecast action, the individual forecasts did not have the same degree of independence that was implied by the distribution and preconditions of the action. In order to repair this problem, we need to remove the uncertainty from the watch-forecast action itself and associate the uncertainty with an external variable, either in the initial conditions or in any action that can affect the state of the weather or affect the quality of the forecast. An important question to ask when engineering an observation action is “Are the effects of multiple instances of this observation independent given the preconditions?” If the answer is “yes”, the action is properly designed. If not, then we may have to pull the correlations between copies of the observation into an external “bias” or “noise” variable.
5.9.4
Classes of Plans and Independence
There is an interesting relationship between the kinds of plans that are constructed by DTPOP and the kinds of actions that are in the domain description. Let’s consider the following four classes of plans: • Deterministic domains: Action effects are determined by the values of precondi-
tions. The effects of the initial conditions step are certain.
249
• Uncertain domains without observations: Actions and the initial conditions may
contain effects that are uncertain, but unobservable. • Uncertain domains with perfect observability: Actions and initial conditions may
contain effects that are uncertain. Every effect is perfectly observable. That is, there is one distinct DEC for each distinct outcome of the action. • Uncertain domains with partial observability: Actions and the initial conditions
may contain effects that are uncertain. Some actions have observable effects. I will show that indirect information plans are only generated for the last class of plan.
Deterministic plans
If all of the actions in a plan are deterministic, then DTPOP will never generate an open uncertainty. Recall that open preconditions are treated as deterministic. If every conditional effect in a plan is deterministic, then every node in the plan model is functionally determined and there will never be any open uncertainties. DTPOP will never call addstep-forward or add-link-forward, thus it never constructs any plan branches that are not terminated by a goal. In fact, if the plan construction phase of DTPOP ever constructs more than one plan branch, the optimizer will prune the plan–if the branches are deterministic, then one branch will always be worse than the other. The execution policies for the dominated branch will always false.54 The only type of plan that DTPOP can return for these domains is a single branch plan consisting of functionally-determined actions–this is the same sort of plan returned by a classical planner like SNLP [McAllester+Rosenblitt, 91].
Uncertain domains without observations
Once DTPOP begins to construct an observation plan, it can only remove the open uncertainties in the “terminal” steps of this observation plan if that terminal step is observable 54. I should add a restriction to add-branch that prevents add-branch from adding a new branch to a plan that consists of only one functionally determined branch.
250
(Recall Figure 87). Since there are no observable actions, there is no way to terminate the expansion of an observation plan. DTPOP, therefore, cannot return a plan that is not terminated by a goal. Since there are no observations, the DTPOP optimizer will not be able to find observations that will allow it to optimize a multi-branch plan. Therefore, the only kind of plan that DTPOP can return for these domains is a single branch plan consisting of uncertain actions–this is the same type of plan returned by Buridan [Kushmerick, et al, 94] or UDTPOP.
Uncertain domains with perfect observability
If every effect is observable, then no effect can be an open uncertainty, by definition (see pg. 211). The model construction algorithm makes all of the conditional effects of a step into observable nodes if the DECs for that conditional effect allow all of the conditional outcomes to be distinguished. Since all of the effects of each action are perfectly observable, this is the case. DTPOP can never call add-step-forward or add-link-forward in order to develop an indirect observation plan for these uncertainties. For these domains, DTPOP develops a multiple branch plan consisting only of branches terminated by goal steps. This is the same kind of plan as that produced by Cassandra [Pryor+Collins, 93].
Uncertain domains with partial observability
What the previous three cases show is that indirect observation plans cannot be returned by DTPOP unless the domain is uncertain and partially observable.
5.10 Extensions and Conclusions 5.10.1 Experimental Validation Cassandra [Cassandra, et al, 97; Cassandra 98] has assembled a database of domains for benchmarking the performance of POMDP algorithms.55 Clearly, benchmarking DTPOP against some of the more traditional POMDP algorithms would be informative. On one
251
hand, the early experiments with DTPOP seem to indicate that it is not efficient at all. The principal reason for this seems to be two fold. • The size of the search space increases exponentially with the number of branches
in the plan (discussed in 5.10.3) and the search space is highly nonsystematic. • The number of observation plans for a given uncertainty is huge. Any plan that
depends on a this uncertainty that also has an observable outcome is a valid observation plan for this uncertainty. On the other hand, the POMDP algorithms also seem to suffer from extreme efficiency problems. The ceiling for the number of states (not variables) for a POMDP seems to be in the vicinity of 10 to 20 states [Littman, personal communication, 98] [Zhang+Liu, 96]. These domains are smaller than the smallest of toy domains for normal AI planning. For example, a Blocks World domain with only 4 blocks has 85 states. My current belief is that POMDP-style approaches eventually will be more efficient than DTPOP because a POMDP does not need to distinguish between initial segments of a plan that result in the same belief and world state (this is the dynamic programming principle). I believe, however, that these POMDP algorithms can benefit from some sort of probabilistic relevance analysis to prune actions that cannot be relevant to a goal. It is an open question whether there is a tractable technique for doing this.
5.10.2 Evaluation of Partial Plans One of the missing elements in DTPOP is an admissible evaluation function for partial plans. Sproull [77] suggests that in perfectly-observable domains, that plan branches can be constructed in decreasing (***Check this) order of utility without compromising completeness. This implies that the expected value for the already constructed plan branches, conditioned on the observations that lead to those branches, is an upper bound on the expected value of the overall plan. 55. http://www.cs.brown.edu/people/arc/research/pomdp-examples.html
252
It is an open question whether a similar technique can be derived for DTPOP.
5.10.3 Multiplicative Growth in Search Space with Plan Branches. The size of the search space for DTPOP is exponential in of the number of elements in the plan. In particular, the search space is exponential in the number of plan branches. It seems that we should be able to take care of some sort of factoring of the search work load with respect to the individual branches so that these search space sizes are additive instead of multiplicative. Dan Weld56 suggests that the right way to construct a contingent plan is to glue a lot of noncontingent plans together. This suggests the following architecture: Use UDTPOP to construct individual noncontingent plan branches and figure out some systematic way to “glue” these branches together to form a set of good contingent plans. Denise Draper57 points out, however, that the problem of “zipping” together independently derived plans is also known to be intractable [REFERENCE, ***]. One incomplete strategy for handling this problem is to develop some number of candidate plan branches using DTPOP, select one that achieves a high utility outcome with high probability and use that plan as the seed for a new DTPOP plan. Instead of searching for the plan of highest overall utility, we identify one plan branch that has high utility conditioned on some value for the uncertainties and then attempt to robustify that plan by adding plan branches that accomplish the goal in those circumstances where the original plan branch fails. If this strategy is used, we sacrifice completeness but gain an exponential speed up in the time required to find a contingent plan.
56. personal communication, 1993. 57. personal communication, 1994.
253
5.11 Contributions This chapter described DTPOP, a novel utility-directed contingent partial-order planning algorithm. The primary contributions of this chapter include: • DTPOP is shown to be sound. DTPOP is sound in the sense that the causal link
structure correctly models the utility of the individual plan branches and that all of the steps in each plan are effective given the appropriate choice of decision policy. • DTPOP is shown to be complete. DTPOP is guaranteed to find the best possible
solution for a planning problem. DTPOP is the first regression-based contingent planner for partially observable domains that has been shown to be complete. • A new technique was developed to determine the relevance of events in asym-
metric models. This technique was used to develop the novel open uncertainty mechanism in DTPOP.
254
6.0 Bibliography
[Allen, et al, 90] Allen J.; Hendler, J.; and Tate, A., Readings in Planning, San Mateo, California: Morgan Kaufmann. [Bellman, 57] Bellman, R. Dynamic Programming. Princeton University Press. [Blythe, 94] Blythe, J. Planning with External Events, Proceedings of the Tenth Conference on Uncertainty in Artificial Intelligence, San Francisco: Morgan Kaufmann, p. 94101. [Blythe, 96] Blythe, J. Event-Based Decompositions for Reasoning about External Change in Planners. Proceedings of The Third International Conference on Artificial Intelligence Planning Systems. Menlo Park: AAAI Press, p. 27-34. [Boutilier+Dearden, 94] Boutilier, C. and Dearden, R. Using abstractions for decision-theoretic planning with time constraints. Proceedings of the Twelfth National Conference on Artificial Intelligence, p.1016-22. [Boutilier, et al, 95a] Boutilier, C.; Dearden, R.; and Goldszmidt, M. Exploiting structure in policy construction. AAAI Spring Symposium on Extending Theories of Action, Stanford.
255
[Boutilier, et al, 95b] Boutilier, C.; Dean, T.; and Hanks, S. Planning Under Uncertainty: Structural Assumptions and Computational Leverage. Proceedings of the Third European Workshop on Planning (EWSP'95), Assisi, Italy, September, 1995. [Boutilier, et al, 96] Boutilier, C.; Friedman, N.; Goldszmidt, M.; and Koller, D. ContextSpecific Independence in Bayesian Networks. Proceedings of the Twelfth Conference on Uncertainty in Artificial Intelligence, San Francisco: Morgan Kaufmann, p. 115-23. [Boutilier+Poole, 96] Boutilier, C. and Poole, D. Computing Optimal Policies for Partially Observable Decision Processes using Compact Representations. Proceedings of the National Conference on Artificial Intelligence. Menlo Park, CA: AAAI Press. [Brandman et al, 1990] Brandman, Y.; Orlitsky, A; and Hennessy, J. A Spectral Lower Bound Technique for the Size of Decision Trees and Two-Level AND/OR Circuits, IEEE Transactions on Computers, 39:2, p. 282-7. [Breese, 87] Breese, J. Knowledge Representation and Inference in Intelligent Systems. Ph.D. Dissertation, Department of Engineering-Economic Systems, Stanford University. [Breese, 92] Breese, J. Construction of belief and decision networks. Computational Intelligence. 8(4), p. 624-48. [Bylander, 92] Bylander, T. Complexity results for extended planning, Proceedings of the First International Conference on AI Planning Systems, San Mateo, CA: Morgan Kaufmann, p. 20-7. [Cassandra, et al, 94] Cassandra, A.; Kaelbling, L.; and Littman, M. Acting optimally in partially observable stochastic domains. Proceedings of the Twelfth National Conference on Artificial Intelligence. Menlo Park, CA: AAAI Press.
256
[Cassandra, et al, 97] Cassandra, A.; Littman, M.; and Zhang, N. L. Incremental Pruning: A Simple, Fast, Exact Algorithm for Partially Observable Markov Decision Processes. Proceedings of the Thirteenth Conference on Uncertainty in Artificial Intelligence, San Francisco: Morgan Kaufmann. [Cassandra, 98] Cassandra, A. R. Exact and Approximate Algorithms for Partially Observable Markov Decision Processes. Ph. D. Thesis. Brown University, Providence, RI, 1998. [Chapman, 87] Chapman, D. Planning for Conjunctive Goals, Artificial Intelligence 32, p. 333-77. [Cheuk+Boutilier, 97] Cheuk, A and Boutilier, C. Structured Arc Reversal and Simulation of Dynamic Probabilistic Networks. Proceedings of the Thirteenth Conference on Uncertainty in Artificial Intelligence, San Francisco: Morgan Kaufmann, p. 72-9. [Davis, 94] Davis, E. Knowledge Preconditions for Plans. Journal of Logic and Computation, 4(5):721-66. [Davis, et al, 62] Davis, M.; Logemann, G.; and Loveland, D. A machine program for theorem proving. Communications of the ACM. 5:394-7. [de Kleer, 86] de Kleer, J. An Assumption-based TMS, Artificial Intelligence 28, p. 12762. [Dean, et al., 93] Dean, T; Kaelbling, L.; Kirman, J.; and Nicholson, A. Planning with deadlines in stochastic domains. In The Proceedings of the Eleventh National Conference on Artificial Intelligence, Menlo Park, CA: AAAI Press, p. 574-9. [Dean+Wellman, 91] Dean, T. and Wellman, M. Planning and Control, San Mateo, CA: Morgan Kaufmann.
257
[Dean+Kanazawa, 89] Dean, T. and Kanazawa, K. A model for reasoning about persistence and causation. Computational Intelligence 5(3), p. 142-50. [Dechter, 96] Dechter, R. Bucket elimination: A unifying framework for probabilistic inference. Proceedings of the Twelfth Conference on Uncertainty in Artificial Intelligence, San Francisco: Morgan Kaufmann, p. 211-9. [Doan, 96] Doan, A. Modeling Probabilistic Actions for Practical Decision-Theoretic Planning. Proceedings of The Third International Conference on Artificial Intelligence Planning Systems. Menlo Park: AAAI Press, p. 62-9. [Doan+Haddawy, 95] Doan, A. H. and Haddawy, P. Decision-theoretic refinement planning: Principles and application. Technical Report TR 95-01-01, Department of Electrical Engineering and Computer Science, University of Wisconsin. Available via anonymous FTP from ftp.cs.uwm.edu/pub/tech_reports. [Draper, et al, 93] Draper, D.; Hanks, S.; and Weld, D. Probabilistic Planning with Information Gathering and Contingent Execution. Technical Report 93-12-04. University of Washington. [Draper, et al, 94a] Draper, D.; Hanks, S.; and Weld, D. A Probabilistic Model of Action for Least-Commitment Planning with Information Gathering. Proceedings of the Tenth Conference on Artificial Intelligence, Seattle, August 1994. [Draper, et al, 94b] Draper, D.; Hanks, S.; and Weld, D. Probabilistic Planning with Information Gathering and Contingent Execution, Proceedings of The Third International Conference on Artificial Intelligence Planning Systems. Menlo Park: AAAI Press. [Draper, 96] Draper, D. L., Localized Partial Evaluation of Belief Networks, Ph.D. Dissertation, Technical Report 96-02-02, Department of Computer Science and Engineering, University of Washington, Seattle. 258
[Drummond+Bresina, 90] Drummond, M. and Bresina, J. Anytime synthetic projection: Maximizing the probability of goal satisfaction. Proceedings Eighth National Conference on Artificial Intelligence, Menlo Park, CA: The AAAI Press, p. 138-44. [Einav, 91] Einav, D., Reasoning with Uncertainty and Resource Constraints, Ph.D. Dissertation, Department of Engineering-Economic Systems, Stanford University. [Einav+Fehling, 90] Einav, D. and Fehling, M. R. Computationally-optimal real-resource strategies. Proceedings of the IEEE International Conference on Systems, Man and Cybernetics. Piscataway, NJ: IEEE. p 581-6. [Etzioni, et al., 92] Etzioni, O.; Hanks, S.; Weld, D.; Draper, D.; Lesh, N.; and Williamson, M. An Approach to Planning with Incomplete Information. Proceedings Third International Conference on Principles of Knowledge Representation and Reasoning. [Fikes+Nilsson, 71] Fikes, R. E. and Nilsson, N. J. STRIPS: A New Approach to the Application of Theorem Proving to Problem Solving, Artificial Intelligence 2(3/4), p. 189208. [Geiger, et. al. 90] Geiger, D.; Verma, T.; and Pearl, J. Identifying independence in Bayesian networks. Networks 20, p. 507-34. [Georgeff, 86] Georgeff, M. The Representation of Events in Multiagent Domains. Proceedings of the Fifth National Conference on Artificial Intelligence. Los Altos, CA: Morgan-Kaufmann. [Ginsberg+Smith, 88a] Ginsberg, M. L. and Smith, D. E. Reasoning about Action I: A Possible Worlds Approach. Artificial Intelligence, 35, p. 165-95 [Ginsberg+Smith, 88a] Ginsberg, M. L. and Smith, D. E. Reasoning about Action II: The Qualification Problem. Artificial Intelligence. 35, p. 311-42.
259
[Golden, 97] Golden, Keith. Planning and Knowledge Representation for Softbots (Software Agents, Local Closed World, Sensing Actions, Incomplete Informaiton). Ph.D. Dissertation, Department of Computer Science, University of Washington. [Golden+Weld, 96] Golden, Keith; and Weld, Dan. Representing Sensing Actions: The Middle Ground Revisited, Proceedings of the Conference on Knowledge Representation (KR 96). [Goldman+Boddy, 94a] Goldman, R. P. and Boddy, M. S. Epsilon-safe planning. Proceedings of the Tenth Conference on Uncertainty in Artificial Intelligence, p. 253-61, Seattle, July 1994. [Goldman+Boddy, 94b] Goldman, R. P. and Boddy, M. S. Conditional linear planning. Proceedings of the Second International Conference on Artificial Intelligence Planning Systems. Menlo Park, CA: AAAI Press. [Goldman+Boddy, 94c] Goldman, R. P. and Boddy, M. S. Representing uncertainty in simple planners. Principles of Knowledge Representation and Reasoning: Proceedings of the Fourth International Conference (KR94). San Mateo, CA: Morgan Kaufmann. [Goldman+Boddy, 96] Expressive Planning and Explicit Knowledge, Proceedings of The Third International Conference on Artificial Intelligence Planning Systems. Menlo Park: AAAI Press, p. 110-7. [Goldman+Charniak, 93] Goldman, R.P.; and Charniak, E. A language for construction of belief networks, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.15, no.3, p. 196-208. [Goldsmith, et al., 96] Goldsmith, J.; Lusena, C.; and Mundhenk, M. The complexity of deterministically observable finite-horizon Markov decision processes. Technical Report 268-96, Department of Computer Science, University of Kentucky. 260
[Goldsmith, et al., 97] Goldsmith, J.; Littman, M.; and Mundhenk, M. The Complexity of Plan Existence and Evaluation in Probabilistic Domains. Proceedings of the Thirteenth Conference on Uncertainty in Artificial Intelligence. Morgan-Kaufmann. [Haddawy, 91] Haddawy, P. Representing Plans Under Uncertainty: a Logic of Time, Chance and Action. Ph.D. Dissertation, University of Illinois at Urbana-Champaign. [Haddawy, 94] Haddawy, P. Generating Bayesian networks from probability logic knowledge bases. Proceedings of the Tenth Conference Uncertainty in Artificial Intelligence, p. 262-9. [Haddawy, et al, 95] Haddawy, P.; Doan, A.; and Goodwin, R. Efficient Decision-Theoretic Planning: Techniques and Empirical Analysis. Proceedings of the Tenth Conference on Uncertainty in Artificial Intelligence, San Francisco: Morgan Kaufmann, p. 229-36. [Haddawy+Doan, 94] Haddawy, P and Doan, A., Abstracting Probabilistic Action, Proceedings of the Tenth Conference on Uncertainty in Artificial Intelligence, San Francisco: Morgan Kaufmann, p. 94-101. [Haddawy+Suwandi, 94] Haddawy, P. and Suwandi, M. Decision-Theoretic Refinement Planning using Inheritance Abstraction. Proceedings of the Second International Conference on AI Planning Systems. [Hanks, 90] Hanks, S. Practical Temporal Projection, Proceedings Eighth National Conference on Artificial Intelligence, Menlo Park, CA: The AAAI Press, p. 138-44. [Hanks, 97] Hanks, S. Planning as Decomposition Search, unpublished manuscript. [Hart, et al, 68] Hart, P; Nilsson, N, and Raphael, B. A formal basis for the heuristic determination of minimum cost paths. IEEE Transactions on SSC, SSC-4, p. 100-7. A correction to this paper was published in: Hart, P; Nilsson, N, and Raphael, B. Correction to “A
261
formal basis for the heuristic determination of minimum cost paths.” SIGART Newsletter, 37, p. 28-9, 1972. [Heckerman, et. al., 89] Heckerman, D.E.; Horvitz, E.J.; Ng, K.C.; and Nathwani, B.N. The Pathfinder system, Proceedings: The Thirteenth Annual Symposium on Computer Applications in Medical Care, Washington, DC: IEEE Comput. Soc. Press, p. 952. [Heckerman, et. al., 92] Heckerman, D.E., Horvitz, E.J., and Nathwani, B.N. Toward normative expert systems. I. The Pathfinder Project, Methods of Information in Medicine, vol.31, no.2, p. 90-105. [Heckerman+Nathwani, 92] Heckerman, D.E. and Nathwani, B.N. Toward normative expert systems. II. Probability-based representations for efficient knowledge acquisition and inference, Methods of Information in Medicine, vol.31, no.2, p. 106-16. [Howard, 60] Howard, R. Dynamic Programming and Markov Processes. Cambridge, Massachusetts: MIT Press. [Howard, 66] Howard, R. E., Information value theory, IEEE Transactions on Systems Science and Cybernetics, vol. SSC-2, p. 22-6. [Howard, 68] Howard, R. E., The Foundations of Decision Analysis. IEEE Transactions on Systems Science and Cybernetics, Vol. SSC-4, No. 3, p. 211-19. [Howard, 90] Howard, Ronald A.,”From Influence to Relevance to Knowledge”, Influence Diagrams, Belief Nets and Decision Analysis, p. 3-23, Eds. R.M. Oliver and J.Q. Smith, John Wiley & Sons Ltd. [Howard+Matheson, 81] Howard, R. E. and Matheson, J. E. Influence diagrams. In Principles and applications of decision analysis, Vol 2 (1984). Menlo Park, CA: Strategic Decisions Group.
262
[Jensen et al., 90a] Jensen, F. V., Olesen, K. G., and Andersen, S. K. An Algebra of Bayesian Belief Universes for Knowledge-based Systems. Networks, 20(5), p. 637-59. [Jensen et al, 90b] Jensen, F. V.; Lauritzen, S. L.; and Olesen, K. G. Bayesian updating in causal probabilistic networks by local computations. Computational Statistics Quarterly, 4, p. 269-82. [Jensen+Andersen, 90] Jensen, F. and Andersen, S. K. Approximations in Bayesian Belief Universes for Knowledge-based Systems. Proceedings of the Sixth Workshop on Uncertainty in Artificial Intelligence. [Jensen et al, 94] Jensen, F.; Jensen, F. V.; and Dittmer, S. From Influence Diagrams to Junction Trees. Proceedings of the Tenth Conference on Uncertainty in Artificial Intelligence. San Francisco: Morgan-Kaufmann, p. 367-73. [Joslin+Pollack, 94] Joslin, D. and Pollack, M. E. Least-cost flaw repair: A plan refinement strategy for partial-order planning, Proceedings of the Twelfth National Conference on Artificial Intelligence. Menlo Park, CA: AAAI. p. 1004-9. [Joslin+Pollack, 95] Joslin, D. and Pollack, M. E. Is early commitment in plan generation ever a good idea? Proceedings of the Thirteenth National Conference on Artificial Intelligence. Menlo Park, CA: AAAI. p. 1188-93. [Kambhampati, 91] Kambhampati, S. On the utility of systematicity: Understanding tradeoffs between redundancy and commitment during partial order planning. Proceedings of IJCAI-91. [Kambhampati, 92] Kambhampati, S. Characterizing Multi-Contributor Causal Structures for Planning, Proceedings of the First International Conference on Artificial Intelligence Planning Systems. San Mateo, CA: Morgan-Kaufmann.
263
[Kambhampati, 94a] Kambhampati, S. Design Trade-offs in Partial Order (Plan Space Planning), Proceedings of the Second International Conference on Artificial Intelligence Planning Systems. Menlo Park, CA: AAAI Press. [Kambhampati, 94b] Kambhampati, S. Multi-Contributor Causal Structures for Planning: A Formalization and Evaluation. Artificial Intelligence. Vol. 69. [Kanazawa+Dean, 89] Kanazawa, K. and Dean, T. A Model for Projection and Action, Proceedings of the International Joint Conference on Artificial Intelligence, Detroit, MI, p. 985-93. [Kautz+Selman, 96] Kautz, H. and Selman, B. Pushing the Envelope: Planning, propositional logic, and stochastic search. Proceedings of the Thirteenth National Conference on Artificial Intelligence. Menlo Park, CA: AAAI Press. p. 1194-201. [Kushmerick, et al, 95] Kushmerick, N.; Hanks, S.; and Weld, D. An algorithm for probabilistic planning. Artificial Intelligence, 76:239-286. (An earlier version appeared as University of Washington Technical Report UW-CSE-93-06-03). [Kushmerick, et al, 94] Kushmerick, N.; Hanks, S.; and Weld, D. An algorithm for probabilistic least-commitment planning. In Proceedings AAAI-94, p. 1073-8. [Littman, et al, 95a] Littman, M.; Cassandra, A.; and Kaelbling, L. Learning policies for partially observable environments: Scaling up. Proceedings of the Twelfth International Conference on Machine Learning. Morgan Kaufmann. p. 362-70. [Littman, et al, 95b] Littman, M.; Dean, T.; and Kaelbling, L. On the Complexity of Solving Markov Decision Problems. Proceedings of the Eleventh Annual Conference on Uncertainty in Artificial Intelligence, San Francisco: Morgan Kaufmann.
264
[Littman, et al, 98] Littman, M.; Goldsmith, J.; and Mundhenk, M. The Computational Complexity of Probabilistic Planning, submitted to the Journal of Artificial Intelligence Research. [Littman+Majercik, 97]
Littman, M. and Majercik, S. Large-Scale Planning Under
Uncertainty: A Survey. In Workshop on Planning and Scheduling for Space, pages 27:1--8, 1997 [Lovejoy, 91] Lovejoy, W. S. A survey of algorithmic methods for partially observable Markov decision processes. Annals of Operations Research, 28:47-66. [McAllester+Rosenblitt, 91] McAllister, D. and Rosenblitt, D. Systematic Nonlinear Planning, Proceedings of the Ninth National Conference on Artificial Intelligence, Anaheim. [McCarthy+Hayes, 69] McCarthy, J. and Hayes, P. Some philosophical problems from the study of artificial intelligence. In: Machine intelligence, B. Meltzer & D. Richie (eds.), Volume 4., Edinburgh: Edinburgh University Press. [McDermott, 82] McDermott, D. A temporal logic for reasoning about processes and plans. Cognitive Science 6, p. 101-55. [McDermott, 91] McDermott, D. Regression Planning. International Journal of Intelligent Systems, 6, p. 357-416. [Majercik+Littman, 98a] Majercik, S. and Littman, M. MAXPLAN: A new approach to probabilistic planning. To appear in AIPS-98. [Majercik+Littman, 98b] Majercik, S. and Littman, M. Solving Larger Probablistic Planing Problems by Forgetting. To appear in AAAI-98.
265
[Milani, 94] Milani, A. Splitting Multiple Situations in Conditional Planning. Proceedings of the Second International Conference on Artificial Intelligence Planning Systems. Menlo Park, CA: AAAI Press. p. 317-22. [Moore, 85] Moore, R. C. A Formal Theory of Knowledge and Action. Formal Theories of the Commonsense World. Ablex. [Morgenstern, 87] Morgenstern, L. Knowledge preconditions for actions and plans. In Proceedings of the Tenth International Joint Conference on Artificial Intelligence (IJCAI87), Milan, Italy. Morgan Kaufmann. p. 867-74. [Papadimitriou, 94] Papadimitriou, C. H. Computational Complexity. Addison-Wesley, 1994. [Parr+Russell, 95] Parr, R. and Russell, S. Approximating optimal policies for partially observable stochastic domains. Proceedings of the International Joint Conference on Artificial Intelligence, Menlo Park, CA: AAAI Press. [Pearl, 88] Pearl, J. Probabilistic reasoning in intelligent systems: networks of plausible inference. San Mateo, CA: Morgan Kaufmann Publishers. [Pednault, 88a] Pednault, E.P.D. Extending conventional planning techniques to handle actions with context-dependent effects. Proceedings of the Seventh National Conference on Artificial Intelligence, Palo Alto, CA: Morgan Kaufmann. p. 55-9. [Pednault, 88b] Pednault, E.P.D. Synthesizing plans that contain actions with contextdependent effects. Computational Intelligence. 4:4, p. 356-72. [Pednault, 89] Pednault, E.P.D. ADL: Exploring the middle ground between STRIPS and the situation calculus. In Brachman, R.J. and Levesque, H.J., ed. Proceedings of the First
266
International Conference on Principles of Knowledge Representation and Reasoning, San Mateo, CA: Morgan Kaufmann. p. 324-32. [Penberthy+Weld, 92] Penberthy, J. S. and Weld, D. UCPOP: A sound complete, partial order planner for ADL. Proceedings Third International Conference on Principles of Knowledge Representation and Reasoning, p. 103-14. [Peot+Smith, 92] Peot, M. and Smith, D. Conditional Nonlinear Planning. Proceedings of the 1st International Conference on AI Planning Systems. Menlo Park, CA: AAAI Press. p. 189-97. [Peot+Smith, 93] Smith, D.E. and Peot, M. A. Threat-Removal Strategies for Partial Order Planning. Proceedings of the Eleventh National Conference on Artificial Intelligence, Menlo Park, CA: AAAI Press. p. 492-9. [Pryor, 95] Pryor, L. Decisions, decisions: Knowledge goals in planning. In J. Hallam, ed. Hybrid problems, Hybrid solutions. Proceedings of The Tenth Biennial Conference on AI and Cognitive Science. Sheffield, England. [Pryor+Collins, 93] Pryor, L. and Collins, G. Cassandra: Planning for contingencies. Technical Report 41, Northwestern University, The Institute for the Learning Sciences. [Pryor+Collins, 96] Pryor, L. and Collins, G. Planning for contingencies: A decisionbased approach. Journal of Artificial Intelligence Research. Vol. 4, p. 287-339. [Puterman, 94] Puterman, M. L. Markov Decision Processes–Discrete Stochastic Dynamic Programming. New York: John Wiley & Sons. [Roth, 94] Roth, D. On the Hardness of Approximate Reasoning. Artificial Intelligence. 82(1-2):273-302
267
[Russell+Norvig, 95] Russell, S. J. and Norvig, P. Artificial Intelligence: A Modern Approach, Upper Saddle River, NJ: Prentice Hall. [Sani+Steel, 91] Sani, R.G.; Steele, S. Recursive Plans. Proceedings of the 1st European Workshop on Planning EWSP 1991. Lecture Notes in Artificial Intelligence 522. Springer Verlag. [Schoppers, 89] Schoppers, M. E., Representation and Automatic Synthesis of Reaction Plans, Report No.89-1546, Department of Computer Science, University of Illinois at Urbana-Campaign, 1989. [Shachter, 86] Shachter, R. D. Evaluating Influence Diagrams. Operations Research, 34 (November-December), p. 871-82. [Shachter, 88] Shachter, R. D. Probabilistic Inference and Influence Diagrams. Operations Research, 36 (July-August), p. 589-605. [Shachter, 90] Shachter, R. D. An Ordered Examination of Influence Diagrams, Networks 20, p. 535-63. [Shachter, 97] Shachter, R. D. The Bayes-Ball Algorithm for Determining Irrelevancies and Requisite Information in Belief Networks and Influence Diagrams. Unpublished manuscript. [Shachter+Peot, 92] Shachter, R. D. and Peot, M. A. Decision Making Using Probabilistic Inference Methods, Proceedings of the Eighth Conference on Uncertainty in Artificial Intelligence. San Francisco: Morgan-Kaufmann, p. 276-83. [Smallwood+Sondik, 73] Smallwood, R. D. and Sondik, E. J. The optimal control of partially observable Markov processes over a finite horizon. Operations Research, 21:107188.
268
[Smith+Peot, 93] Smith, D.E. and Peot, M. A. Postponing Threats in Partial-Order Planning. Proceedings of the Eleventh National Conference on Artificial Intelligence, Menlo Park, CA: AAAI Press. p. 500-6. [Smith+Peot, 96] Smith, D.E. and Peot, M. A. Suspending Recursion in Causal-link Planning. Proceedings of The Third International Conference on Artificial Intelligence Planning Systems. Menlo Park: AAAI Press, p. 182-90. [Smith+Williamson, 95] Smith, D. E. and Williamson, M. Representation and Evaluation of Plans with Loops. Working Notes of the AAAI Spring Symposium on Extended Theories of Action: Formal Theory and Practical Applications, Stanford, CA. [Soderland+Weld, 91] Soderland, S. and Weld, D. S., Evaluating Nonlinear Planning, Technical Report 91-02-03, Department of Computer Science and Engineering, University of Washington. [Sondik, 78] Sondik, E. G. The optimal control of partially observable Markov processes over the infinite horizon: Discounted costs. Operations Research, 26(2):85-93. [Sproull, 77] Sproull, R. F. Strategy construction using a synthesis of heuristic and decision-theoretic methods, Ph.D. dissertation, Stanford University. [Tatman+Shachter, 90] Tatman, J. A., and Shachter, R. D. Dynamic Programming and Influence Diagrams. IEEE Transactions on Systems, Man and Cybernetics. 20(2), p. 36579. [Waldinger, 77] Waldinger, R. J., Achieving Several Goals Simultaneously, Machine Intelligence 8, Chichester: Ellis Norwood Limited. [Warren, 76] Warren, D. H. D., Generating Conditional Plans and Programs, in Proceedings of the Summer Conference on AI and Simulation of Behavior, Edinburgh.
269
[Warren, 74] Warren, D. H. D., Warplan: A System for Generating Plans, in Allen, J., Hendler, J, and Tate, A. eds., Readings in Planning, San Mateo, California: Morgan Kaufmann, 1990. [Weld, 94] Weld, D. “An Introduction to Least Commitment Planning,” AI Magazine, 15(4), p. 27-61. [Wellman, 90] Wellman, M. Formulation of Trade-offs in Planning Under Uncertainty. San Mateo, CA: Morgan Kaufmann. [Wellman+Doyle, 92] Wellman, M. and Doyle, J. Modular Utility Representation for Decision-Theoretic Planning. Proceedings of the First International Conference on Artificial Intelligence Planning Systems, p. 236-42. [Wellman+Doyle, 91] Wellman, M. and Doyle, J. Preferential Semantics for Goals, Proceedings of the Ninth National Conference on Artificial Intelligence, Anaheim. [Williamson+Hanks, 94] Williamson, M. and Hanks, S. Optimal planning with a goaldirected utility model. In Proceedings of the Second International COnference on Artificial Intelligence Planning Systems, Chicago, June 1994, p. 176-81. [Williamson+Hanks, 96] Williamson, M. and Hanks, S. Flaw Selection Strategies for Value-Directed Planning. Proceedings of The Third International Conference on Artificial Intelligence Planning Systems. Menlo Park: AAAI Press, p. 237-44. [Zhang+Liu, 96] Zhang, N. L. and Liu, W. Planning in Stochastic Domains: Problem Characteristics and Approximation. Technical Report IIKUST-CS96-31, Department of Computer Science, Hong Kong University of Science and Technology. [Zhang+Poole, 96] Zhang, N. L. and Poole, D. Exploiting Causal Independence in Bayesian Network Inference. Journal of AI Research, 5:301-28.
270
[Zhang+Yan, 97] Zhang, N. L. and Yan, L. Independence of Causal Influence and Clique Tree Propagation. Proceedings of the Thirteenth Conference on Uncertainty in Artificial Intelligence. San Francisco: Morgan Kaufmann, p. 481-8.
271
272
A.
UDTPOP Proofs
A.1
Effective Support Theorem 5 (Effective Support is Necessary for Plan Optimality): If plan P is
optimal, then every action in P (except for S IC and S Goal ) provides effective support. Proof
Say that some action S u ∉ { S IC, S Goal } that does not provide effective support.
Then a plan P′ of strictly greater utility than P can be constructed by deleting S u from the topological sort. The expected utility of the plan is the sum of the expectations of the individual subvalue nodes: V(P) = E ( R ) –
∑ E ( CS )
S∈P
where E ( C Sk ) =
C S (X ( S k- ))P { X ( S k- ) } . ∑ X(S -) k
k
We can define these expectations over any topological sort of the actions in the plan n
{ Si }i = 1 .
Let P { X(S j+) X(S i+) } denote
∑
⎛ ⎜ ⎝
j
∏
X(S i + 1+), …, X(S j – 1+) k = i + 1
⎞ P k { X(S k+) X(S k – 1+) }⎟ ⎠
273
k>u,
When E(C S ) = k
∑ ∑
then
C(X(S k+))P { X(S k+) X(S u+) }P { X(S u+) X(S u – 1+) }P { X(S u – 1+) } .
X(S u+) X(S u – 1+)
A conditional effect does not provide effective support if one of the three cases is true: 1. is not possible, e.g. P { c } = 0 ; 2. is possible, but is not effective; or 3. is possible and effective, but is not pertinent. We can rewrite the sum E(C Sk) as a sum over the three subsets of X(S u+) × X(S u – 1+) corresponding to these three cases (call them L 1 , L 2 , and L 3 , respectively). We will show that the conditional effect of S u can be replaced by a null effect without increasing utility. That is, we will demonstrate that
∑
C(X(S k+))P { X(S k+) X(S u+) }P { X(S u+) X(S u – 1+) }P { X(S u – 1+) }
L1 ∪ L2 ∪ L3
≤
∑
C(X(S k+))P { X(S k+) X(S u+) }Δ { X(S u+) X(S u – 1+) }P { X(S u – 1+) }
L1 ∪ L2 ∪ L3
This equation holds for the terms in L 1 because these terms are all 0 (because P { X(S u – 1+) } = 0 ).
This relationship also is true for the terms in L 2 because these terms
just persist support from the previous action; passive conditional outcomes are null effects. The most interesting terms are in L 3 . A conditional outcome of S u is effective, but cannot be an ancestor of cost function C Sk (because it is not pertinent) and, therefore, cannot affect its expectation (because S u is not in the N π set of C Sk ). Thus, the expectation of CS
k
in P is unchanged whether S u is present or is replaced by a null action.
The same reasoning applies to E(R) , except for terms in L 3 . If some of the conditional effects of S u are ancestors of R , then
274
R(X(S n – 1+))P { X(S n – 1+) X(S u+) }P { X(S u+) X(S u – 1+) }P { X(S u – 1+) } ∑ L 3
= R min
P { X(S n – 1+) X(S u+) }P { X(S u+) X(S u – 1+) }P { X(S u – 1+) } ∑ L 3
≤
R(X(S n – 1+))P { X(S n – 1+) X(S u+) }Δ { X(S u+) X(S u – 1+) }P { X(S u – 1+) } ∑ L 3
This inequality is true because the expectation of R is always greater than or equal to R min .
Since the cost of S u is strictly greater than zero (UDTPOP domain restriction), the plan P' = P – S u
A.2
has strictly greater utility than P .
Pertinence
Theorem 6 provides justification for using relevant P to select candidate actions for addstep and add-link. If a precondition is not relevant P , then supporting it will not increase the utility of the plan.
Theorem 6 (PertinentP and Pertinence): If a plan is complete, relevant(e) ⇒ relevant P(e)
Proof
and possible(e) ⇒ possible P(e) .
possible(e) ⇒ possible P(e)
is obvious.
We will prove relevant(e) ⇒ relevant P(e) by induction on a topological sort of the plan. We will show that relevant(e) ⇒ relevant P(e) holds for the preconditions of the last k actions in the plan, and therefore holds for the last k+1 actions. Let { S i } ni = 1 be any topological sort of the plan.
275
Induction Basis: ( k = 0 ) Let e be a precondition of the goal action.
If e is relevant
then P { R > R min e } > 0 and e is relevant P by definition. Induction Hypothesis: ( k > 0 ) relevant(e) ⇒ relevant P(e) is true for the preconditions of the last k actions in the topological sort. Inductive Step: Let E = e be a precondition of S n – k . If e is relevant , then either e is the precondition of a cost function (in which case it is relevant P by definition) or e is the precondition of a conditional effect distributions that supports one or more causal links. In V
the latter case, for all causal links S n – k →S C , there exists a conditional outcome ,
such that V = v is relevant to S C . (Why? if ¬∃ such that v is
pertinent to S C , then P { R > R min, e } = 0 .) Since relevant(v) ⇒ relevant P(v) , by hypothesis, relevant(e) ⇒ relevant P(e)
because there exists a conditional outcome , such
that V = v is relevant P to S C . By induction relevant(e) ⇒ relevant P(e) for every precondition in the plan.
A.3
Soundness
The following lemma is used to prove that UDTPOP plans have complete causal structure. A plan has complete causal structure if the belief network constructed using Model_CE contains all of the nodes that are pertinent to utility and the expected utility calculated from this belief network is identical to the expected utility realized from the Markov model for the plan. Lemma 1 (The Plan Model and Nπ): Say that one of the topological sorts of the n
actions in the complete UDTPOP plan P = is Q = { S i } i = 1 . Let M m be the Markov model of Q . The plan model of P , M p , is equivalent to
276
⎛ ⎛ n ⎞ ⎞ Nπ ⎜ R ∪ ⎜ C i⎟ , M m'⎟ , ⎝ ⎝i = 1 ⎠ ⎠
∪
where M m' is a modification of the Markov model that has the
same distribution as M m . Thus the utility calculated from M p is identical to the utility calculated from M m . Proof
A sketch of the proof: • First, we will construct M m' from the full Markov Model M m' . ⎛
⎛
⎝
n
⎞
⎞
∪ Ci⎟⎠ , Mm'⎟⎠ . By induction on the actions in Q , we will show ⎝
• Let M π = N π ⎜ R ∪ ⎜
i=1
that M π is equivalent to M p . • Since these two belief networks are identical, it must be the case that the utility
computed from M p is equal to the utility computed from M m' . This, in turn, implies that the utility calculated from M p is identical to that computed from M m .
First we will construct graph M m' : Let denote the nodes and arcs in M m . Let M m' be the graph . The arcs in M m' are constructed from the arcs in M m as follows: For each arc a in A m , • if arc a runs from X i(S ( C – 1 )+) to P S
c,
q
and there is a causal link from S E to S C
protecting X i in P , then add an arc from X i(S E+) to P Sc, q to M m' . •
otherwise add a to M m' .
The effect of this construction is illustrated in Figure 127. The predecessors of some conditional effect distributions are replaced by ones that are closer to their “source”. This transformation allows some arcs to “bypass” an unneccessary string of persistence nodes (the Δ 's).
277
The joint distribution of belief networks M m and M m' is identical. Since the P is complete (no threats) and Q is a topological sort of P , there can be no action between S E and S C that has X i in its post conditions. Thus the value of X i remains unchanged from S E+ through S C- . ⎛
⎛
n
⎞
⎞
⎝
⎝i = 1
⎠
⎠
OK, let M π be the subgraph of M m' that contains the nodes in N π ⎜ R ∪ ⎜ ∪ C i⎟ , M m'⎟ and the arcs connecting those nodes. We now will show that M π is equivalent to M p by induction. This rather complicated induction relies on the fact that both M p and M π both consist of nodes that are either subvalue nodes or the ancestors of subvalue nodes (the first, by design of the planning and plan modeling algorithm; the second by the definition of N π on networks with no evidence.).
X
Δ
X
Δ
Δ
X
X
E
C
X
Δ
X
Δ
Δ
X
X
E
C
FIGURE 127. The transformation from Mm to Mm’. Let M π be the subgraph of M m' derived by deleting ⎛
⎛
⎝
n
⎞
⎞
∪ Ci⎟⎠ , Mm'⎟⎠ . If arc a → b is in Am' ⎝
all of the nodes from M m' that are not in N π ⎜ R ∪ ⎜
i=1
and both a and b are in N π then the arc is also in A π .
278
We will show that M p is equal to M π by induction on the number of actions in Q . Let M p, j =
and M π, j denote the subgraphs of M p and M π that correspond to the
last j actions of Q . We will also label the action j steps from the end of the plan S j . Induction Basis: When k = 0 , both of these sets contain just the reward function R and are, therefore, equivalent. Induction Hypothesis: M p, j = M π, j . Inductive Step: We will show that M p, j + 1 = M π, j + 1 . In order to prove that two graphs are identical, we need to first show that the sets of nodes are identical and then show that the arcs are the same, too. 1. First we will show that N π, j + 1 ⊆ N p, j + 1 . The proof works as follows: If a node is in Nπ ,
then that will imply facts about the structure of the plan. These facts will imply that
UDTPOP will add a causal link from the node and Model_CE will model it. If a node is in M π, j + 1 , then it must be a cost function (CF), outcome variable (OV), or conditional effect distribution (CED). It cannot be a persistence variable Δ because the persistence variable has to be a predecessor of a node in M π, j . Since M π, j = M p, j by hypothesis, then Δ cannot be itself a predecessor of a persistence node (Model_CE cannot add them). This implies that it is a predecessor of a modeled CED or CF. Since these nodes are modeled, then there must be a causal link establishing their preconditions. This implies that Δ cannot be in M π, j + 1 since arcs from persistence variables to ‘pertinent’ CEDs or CFs are replaced in M m' . a. CFs: Since S j + 1 is in the plan, it must be effective. Therefore, it contributes a causal link to the last j actions in the topological sort. Model_CE explores all of the causal links that link to actions in the last j actions in the plan via depth-first search. When each action is explored for the first time, Model_CE adds all of its CFs, therefore every CF will be added to the plan model M p .
279
b. OVs or CEDs: If an OV is in M π, j + 1 then it is a predecessor of some CED or CF in M π, j (Theorem 9). Call this node P k on action S k . Since P k is in M p, j , it must either be a CF or a CED that is the source of a causal link. In either case, UDTPOP must establish the preconditions of P k using causal links. Note that the OV must be the immediate successor of the CED P j + 1 that is used to establish the causal link (construction of M m' ). When Model_CE models this causal link, both the CED and OV are added to the model M p if they weren’t in it already. Thus N π, j + 1 ⊆ N p, j + 1 . 2. Now we will show that N p, j + 1 ⊆ N π, j + 1 . Say that z ∈ M p, j + 1 . Then z ∈ M π, j + 1 . There are three cases: 1) z can be a CF, 2) z can be anOV, or 3) z can be a CED. a. CFs: all CFs are in M π because they are all components of the utility function and are, therefore, pertinent to utility. b. OVs: If z = X i(S E+) is in M p, j + 1 , there must be a causal link from S E to some S C in the last j actions of Q . There is a link from z = X i(S E+) to a conditional effect P SC that must be in M π, j = M p, j . The predecessors of are in M π so z ∈ M π . c. CEDs: If z is a conditional effect in M p, j + 1 , it must condition one of the effect variables in Case 2. M π, j
By induction N π = N p . Now we will show that A π = A p . There is only one successor to each conditional effect node and this arc is identified in the preceding section of the proof. We need to show that the successors of the OVs of M π and M p are the same. Say that P SE is a successor (either a conditional effect or a subvalue node) of X i(S C+) in M p . Then there must be a causal link between S E and S C protecting X i . By construction, X i(S C+) → P SE , is in M m' . Both the head and the tail of this arc are nodes in N π so the arc is in A π . Say that P S is a successor of X i(S E+) in M π . X i(S E+) must be an effect variable of S E c since X i(S E+) ∈ M p . Since P SC ∈ M p , there is some causal link from S ? to S C pro-
280
tecting X i . By construction X i(S ?+) → P SC in M m' . Since there can be only one predecessor of P SC corresponding to X i , then S E = S ? . Therefore M p = M π . The joint distribution of M m is equal to the joint distribution of M m' . Mπ
captures all of the structure required to compute the utility of M m' by the definition of
Nπ .
Thus the utility computed using M p is the same as the utility computed using M m .
A.4
Completeness
We will prove that UDTPOP is complete by using a “clairvoyant proof”. A clairvoyant proof [McDermott, 91] uses a clairvoyantly-known exemplar plan to provide search-control to the planning algorithm. This allows the planner to make all of the ‘right’ choices when duplicating the exemplar plan.
Since the exemplar plan is any arbitrary plan satis-
fying the UDTPOP optimality criterion, UDTPOP is complete. Before demonstrating completeness, we need to derive two “helper lemmas”. The first lemma demonstrates that we can infer the causal structure of a sequence of actions. The second lemma introduces a new “clairvoyant pertinence” function that uses the internal structure derived through the use of the first lemma. We will rely the clairvoyant pertinence function to select between add-support and persist-support during planning.
A.4.1
Identifying Causal Structure
The causal structure of any sequence of actions can be “reverse-engineered” to reveal its underlying causal structure. The algorithm depicted in Figure 128 computes the set of causal influences on utility. In effect, it is constructing the M π set of Lemma 1.
281
Discover_Links( Q = { S i } ni = 1, R ) ( L := ∅ OCs := { V ∈ PreVars(R) }
//add the vars from R
loop for i from n downto 1 //Check for links from S i ( OldOCs := ∅ NewOCs := ∅ loop for CE
in CEs(S i)
( loop for in OCs ( if ( V ∈ OutVars(CE) ) then ( //found a match. Add the link and adjust OCs NewOCs = NewOCs ∪ { V p ∈ PreVars(CE) } OldOCs = OldOCs + V L = L + ( S i →S C )
) ) ) //Update the open conditions list. OCs := ( OCs \ OldOCs ) ∪ NewOCs
//Add OCs for the action cost functions. OCs := OCs ∪ { V is a utility preconditon of S i }
) //Add links from the remaining OCs to the IC node. ⎧ ⎫ V L = L ∪ ⎨ ( S IC →S ) ∈ OCs ⎬ ⎩ ⎭
FIGURE 128. Discover Links. An algorithm for discovering the causal links pertinent to utility in Q . n
Lemma 2 (Identifying Causal Links): Say that Q is a sequence of actions { S i } i = 1
solving problem D = . The Discover_Links algorithm discovers a set of causal links, that in conjunction with the actions in Q and the ordering constraints S i < S i + 1, i ∈ { 1, …, n – 1 } specify a complete plan. Proof
Discover_Links finds a causal link for every precondition of every cost function
and every conditional effect that supports a causal link. Given the ordering constraints of the total order Q , there are also no threats–if an intervening action can threaten a link,
282
Discover_Links would have discovered that fact and would have made it the establisher of the threatened link. There are no open preconditions and no threats in every topological sort of the plan , therefore this plan is complete.
A.4.2
The Clairvoyant Decision Policy and PertinentC
We will prove that UDTPOP is complete using a clairvoyant proof [McDermot 91]. This proof style demonstrates completeness by showing that a clairvoyantly-known exemplar plan can be used to direct the planner to construct an equivalent plan. In this section, we introduce a new pertinence function pertinent C and prove that if a precondition value is pertinent C
in the exemplar plan, then it will be pertinent P when the planner is attempting
to support this precondition. This allows us to assert that the planner can establish a link because the link is present in the exemplar plan. In order to do this, however, we need to restrict the ways that the planner can add links to the plan–if we use persist-support to establish a link that could have been established by add-link or add-step, then a precondition value that pertinent C in the exemplar may not be pertinent P when the planner attempts to support the precondition. First of all, we need to recognize that UDTPOP is not systematic. There are three ways to add a link to a plan: persist-support, add-step and add-link. When actions have a combination of persistent and effective outcomes, there may be several ways to assemble the same plan by using various combinations of these three plan construction operations. Consider the two actions defined in Figure 129.
283
SA :
Precondition
Effect
Effect
X = x1
X = x2
X = x3
X = x1
1.0
1.0
0.5
X = x2
0.0
0.0
0.5
Y = y1
SB :
Precondition Z = z1
Z = z2
Z = z3
Effect X = x 1
1.0
0.0
0.0
X = x2
0.0
1.0
1.0
FIGURE 129. Action schemata for SA and SB.
Action S A has both persistent and effective outcomes that are pertinent to X and has an effective outcome that is pertinent to Y . Action S B has effective outcomes that are pertinent to X . Both X and Y are precondition variables of the goal action S Goal . The reward function of S Goal is 1.0 when ( X = x 1 ) ∧ ( Y = y 1 ) and is 0.0 otherwise. The action sequence [ … ,S B, S A, S Goal ] can be assembled in two distinct ways. Sequence 1: X
1. Use
add-step to add S A and S A →S Goal .
2. Use
add-step to add S B and S B →S A .
3. Use
add-link to add S A →S Goal .
X
Y
Sequence 2:
284
X
1. Use
add-step to add S B and S B →S Goal
2. Use
add-step to add S A and S A →S Goal .
3. Use
persist-support to resolve threat S A ⊗ ( S B →S Goal ) resulting in two new
Y
X
X
X
links, S B →S A and S A →S Goal . There is a problem with the second sequence of plan construction operations. If we insert SA
X
into S B →S Goal using persist-support, we can make more of the preconditions of S B
pertinent P SB
X
than were pertinent P before S A was spliced into S B →S Goal . Immediately after
is added to the plan, the set of pertinent values for precondition Z is { z 1 } . After S A is X
spliced into S B →S Goal , the set of values { z 1, z 2, z 3 } are pertinent. If we attempt to resolve S B 's open condition before we use persist-support, then our clairvoyant algorithm cannot rely on the pertinent conditions of Z in the exemplar plan for the proper guidance for controlling action selection–the set of pertinent states in the plan may not be “large enough” at the time when we need to decide on the source for the causal link. The solution to this problem is to use a special clairvoyant decision policy in conjunction with a new form of pertinence, pertinent C . pertinent P is insensitive to whether the effective outcomes that make a precondition pertinent are, themselves, possible. This means that many conditions are identified as pertinent P when they aren’t pertinent, that is, despite being pertinent P , they do not contribute to attaining the objectives of the plan. pertinent C is identical to pertinent P save that pertinence is only computed using the possible conditional outcomes of the final complete plan (the exemplar).
Definition 35 (PertinentC): Eliminate the conditional outcomes from a plan P if
they not possible in the exemplar plan P E . Call this modified plan P' . pertinent C(V, S, P) ≡ pertinent P(V, S, P') .
285
This clairvoyant decision policy guarantees that if a precondition value is pertinent C in the exemplar plan, then it will be pertinent P when the planner decides how to link to the precondition. The strategy: • If any conditional outcome of a potential establisher S provides effective sup-
port to a pertinent C precondition then use add-link or add-step to establish the causal link from S to the target precondition. • Otherwise, use persist-support to resolve the threat that arises from not linking
to S and support the target precondition using an earlier action that satisfies the pertinence condition listed above. The following theorem justifies our use of pertinent C in UDTPOP. Lemma 3 (Clairvoyant Decision Policy): Assume that we make causal link deci-
sions using the clairvoyant decision policy described above. The clairvoyantlyknown exemplar plan that we use to guide search is P E . The partial plan that we are attempting to complete using this guidance is P R . If a precondition V = v is pertinent C in P E and there is an open condition in P R , then V = v is pertinent P in the P R . Proof
Define per C(V, S, P E) to be the set of values for V that are pertinent C in the exem-
plar plan. We will prove by induction on the trace of plan construction operations that per C(V, S, P E) ⊆ per P(V, S, P R) .
Let per C, k(V, S) be the set of pertinent C values for precondition V collected k plan construction steps after is first added to the plan. per C, 0 is the set of values that are pertinent
when
per C, k + 1(V, S) ⊆ per C, k(V, S)
is
added
to
the
plan.
We
will
show
that
for all k ≥ 0 .
Pertinencepertinence is defined in terms of the actions that are descendents S in the graph of causal links. Plan construction operations that affect actions that are nondescendents of
286
S
cannot affect per C 1.
Thus for most plan construction operations in the trace,
per C, k + 1(V, S) = per C, k(V, S) .
The only two operations that can affect per C are add-link
and persist-support. Case 1: Add-link A precondition is pertinent P when it is the precondition of a conditional outcome that is pertinent P to every outgoing causal link. The set of conditional outcomes that are pertinent P after a causal link is added is a subset of the original conditional outcomes because each of these outcomes has to be pertinent P to all of the other outgoing causal links in addition to the new one. Therefore, the set of preconditions that are pertinent P
after add-link is necessarily the subset of the set that were pertinent P before the
add-link. per C, k + 1 ⊆ per C, k since ( A ⊆ B ) ⇒ ( A ∩ C ⊆ B ∩ C ) . Since the series of pertinence sets, per C, k , for this precondition are monotonically inclusive, then the set of pertinence sets for causal predecessors2 of this action are monotonically inclusive. Case 2:
Persist-Support V
S A ⊗ ( S →S E )
Say that persist-support is used to resolve the threat
. Ultimately, the effective outcomes of S A that are pertinent to V will be
impossible. Thus the set per C(V, S A) of S A 's V must be a subset of per C(V, S E) . Therefore per C, k + 1(P, S) ⊆ per C, k(P, S)
for all of the preconditions of S . The preconditions of S A
are also monotonically inclusive for the same reason given in Case 1–for S A , persist-support is analagous to add-link.
1. Remember that relevnat P is a function only of nodes that are causal descendents of the target node. Remember also that the set of states that are possible is not a function of the plan construction operations–this is a value that is known clairvoyantly. 2. We will say that an action is a causal predecessor of another action if there is a directed chain of causal links from the first action to the second action.
287
At every point in time during plan construction the sets of preconditions values that are pertinent C
are monotonically inclusive, e.g. per C, k + 1(V, S) ⊆ per C, k(V, S) . It is always the
case that per C ( V, S, P R ) ⊆ persistent P, k(V, S, P R) where P R is a partial plan. By induction, it must also be the case that per C(V, S, P E) ⊆ persistent P(V, S, P R) . This implies that when a precondition is pertinent C in the exemplar plan, it is also pertinent P from the time that the first open condition is added to this precondition until the plan is complete.
A.4.3
Completeness
Now we can actually prove completeness. Theorem 11 (Completeness): Let Q be a sequence of actions that is an optimal
solution to the planning problem D = . The cost functions for the possible actions in A are all greater than zero. UDTPOP with the appropriate search control strategy can identify a plan P' that has the same topological sort Q . Proof
This proof is modelled on Penberthy and Weld’s [92] excellent completeness
proof for UCPOP. The proof proceeds by induction. Let IC k = Result({ S i } ni =– 1k + 1) and let D k = be a series of planning domains with initial conditions given by IC k . Let S ICk be an initial conditions action with conditional effect IC k . We will show that UDTPOP can use the “planner execution trace” for the derivation of P k solving D k = in order to generate a plan for P k + 1 solving D k + 1 = . Thus UDTPOP with the appropriate search control can find the plan P n = P' solving D = D n = that has the topological sort { S i } ni = 1 .
288
Basis ( k = 2 ) : { S IC2, S n } is a plan that consists of only two actions, the initial conditions action S IC2 with conditional effect IC 2 and the goal action S Goal . Since we know that there are no other actions in this plan, and that an action cannot threaten itself, the only applicable plan construction operation is add-link. We can derive a 2-action plan P 2' that is identical to P 2 by adding links from S IC2 to S Goal for all of the open variables in S Goal .3
Inductive Hypothesis ( k > 2 ) : Given the totally ordered plan { S ICk, S n – k + 2, …, S n } , UDTPOP can find a plan P k solving D k = . Inductive Step: S IC
k+1
Given the inductive hypothesis and the totally ordered plan
, S n – k + 1, …, S n },
we will show that UDTPOP can find a plan P′ k + 1 solving
D k + 1 = . Pk
is very close to the actual solution for D k + 1 except that the first action S ICk in P k
serves the same purpose as the initial two actions S ICk + 1 and S n – k + 2 in P k + 1 . Our job is to derive a strategy for reassigning the links that originate in S ICk in plan P k to S ICk + 1 and Sn – k + 2 .
First we will define notation for two properties of causal links:
R IC
V
k+1
(S IC →S C) k
means that S ICk + 1 has an effective outcome that is pertinent C to
variable V of action S C . We can determine whether the action is pertinent C because we can always infer the causal structure of the exemplar sequence Q using Lemma 2. V
R n – k + 2(S IC →S C) k
means that S n – k + 2 has an effective outcome that is pertinent C
to variable V of action S C . V
V n – k + 2(S IC →S C) k
means that V is an outcome variable of S n – k + 2 .
3. S IC (and therefore S IC, k ) has a CED over every outcome variable used in the plan.
289
Using this notation, we categorize the causal links into 4 mutually exclusive and collectively exhaustive categories4: 1. L 1 = { L R n – k + 2(L) } 2. L 2 = { L R IC
k+1
3. L 3 = { L R IC
k+1
4. L 4 = { L ¬R IC
(L) ∧ ¬V n – k + 2(L) }
(L) ∧ V n – k + 2(L) ∧ ¬R n – k + 2(L) }
k+1
(L) ∧ ¬R n – k + 2(L) }
First of all, note that there must be at least one link in L 1 . Q is the optimal plan, therefore Sn – k + 2
must provide effective support for a pertinent condition of one of the actions that
occurs after S n – k + 2 . Add-step will be called to add this action in order to add the first link in L 1 . L4 S IC
is empty. If neither S n – k + 2 or S ICk + 1 is pertinent to the consumer of a link L , then k
won’t be pertinent to the consumer of L either. This means that the link L would not
have been added to P k in the first place. Let T k be the execution trace of the planner. The execution trace is a record of the choices that the planner made while deriving P k . In order to construct P k + 1 , use this execution trace to construct the links, add actions, and resolve threats except when add-link is called to add a link to S ICk . In this case, we will do one of the following: 1. If
add-link would have added a link in the set L 1 , then we will add a link to S n – k + 2 , instead. The first time that we add this link, we will call add-step to add S n – k + 2 to the plan. No other action can threaten this new link because the link was not threatened in P k . Let L k be the set of all of the links in P k . S n – k + 2 may threaten one of the links in L k – ( L 1 ∪ L 2 ∪ L 3 ) , but we know that we can immediately resolve this links through demotion because S n – k + 2 can occur
4. Remember that R n – k + 2(L) ⇒ V n – k + 2(L) .
290
before the establisher of all of these threatened links (because S n – k + 2 precedes them in the topological sort). 2. If add-link would have added a link in the set L 2 , then we will add a link to S IC , instead, using the add-link operation. No action can threaten this link k+1 because the original link was not threatened in P k and the variable protected by the link is not an outcome variable of S n – k + 2 . 3. If
add-link would have added a link in the set L 3 , then we will add a link L new to S ICk + 1 , instead, using the add-link operation. S n – k + 2 threatens L new , but we
know that we can resolve this threat by using persist-support to splice S n – k + 2 into L new at any point in time after both S n – k + 2 and L new have been added. After the execution trace has executed as we have described, the only remaining work to perform is to resolve the open conditions introduced by the addition of S n – k + 2 . add-link is used to add a link between the initial conditions action S ICk + 1 and S n – k + 2 for each open condition of S n – k + 2 . Let’s make sure that we can do this for each conditional effect CE
of S n – k + 2 . There are two possibilities: • Say that an outcome variable of CE is protected by a link added either by add-
link or add-step. This means that at least one of the outcomes of CE is pertinent, therefore, there are pertinent preconditions for the CE . Now by Theorem 6, if all of these pertinent preconditions are not possible then P { R > R min } = 0.0 contradicting one of the restrictions of the theorem. Therefore, there must be outcomes in S ICk + 1 that are pertinent (and pertinent C to CE ). • Alternatively, the CE may have been spliced into a pertinent link by persist-sup-
port. We know that persist-support must persist at least one pertinent outcome from S ICk + 1 , again, because of Theorem 5. Thus, one of the persistent conditional effects of CE is both possible and pertinent. The preconditions of this conditional effect can be linked into S ICk + 1 .
291
We have just shown that we can use P k with its topological sort to derive a plan for P k + 1 . We have shown that we can derive a plan for P 2 . By induction, we can derive a plan for P′ = P n ,
A.5
therefore the theorem is true. UDTPOP is complete.
Admissible Upper Bound
This proof is presented out of order in the text. In order to verify that the final plan model is correct, we will rely on Lemma 1 and on the completeness proof. Theorem 8 (Upper Bound): If there is only one source of support for every precon-
dition in plan P , EvaluateUB( P ) can compute an upper bound, UB(P) ,on expected utility and this bound is admissible. When the plan is complete, the expected value of the plan V(P) and the upper bound UB(P) are equal. Proof
Lemma 1 proves the second half of this theorem. When there are no open condi-
tions, no interval probabilities are introduced into the plan model and the plan model reduces to a normal (non-interval) belief network that exactly models the utility of the plan. In order to prove the first part of the proof, we will argue that the interval belief network constructed by Model_CE M I is a superset of an active set for belief network M Q that would be constructed to model any completion of Model_CE that contains the same threats (possibly more) as the current plan, but no open conditions other than those that might be introduced by persist-support. Threat resolution on this plan can never increase the expected utility of the underlying plan model. Since the probability distribution over utility outcomes for the reward utility function and each cost function in M Q is bounded by the interval belief network M I , and the utility calculated using M Q is at least the utility of a “de-threated” completion of the plan, EvaluateUB is admissible.
292
The proof: We know that the only way that the pre-existing structure of the plan graph (the existing causal links) can be changed is through persist-support. We will, therefore, forget about using persist-support until the plan is otherwise complete. Say that our current partial plan is P . Given any completion of the plan P′ , we can find a quasi-complete plan Q′ that is complete except for threats that we plan to resolve via persist-support. Such a plan can be found as follows: If we need to add a causal link to support open precondition in this graph, we will search backwards over the set of V
causal links S i →S j that support in the exemplar plan P′ to find the latest action that provides effective support for . We know that there must be one step that provides this because, at the very least, the initial conditions contains only effective conditional outcomes. All of the later actions in the causal link chain that provide only passive support to will be linked via persist-support. This situation is illustrated in Figure 130.
SE
SE
SE
V
V
V
S2
V
S1
V
S
S2
S1
S
S2
S1
S
FIGURE 130. Constructing Q. Using the exemplar plan (top) find the earliest step that provides effective support for S , that is, the step that could link to S via add-support. S 1 and S 2 will threaten V
link S E →S , but we know that we can resolve these threats in the future via persist-support.
293
The plan model M Q for this quasi-complete plan has all of the arcs and conditional effect distributions of M I plus additional arcs and conditional effect distributions for the actions that support the open conditions in the plan. To simplify matters in the following proof, we will refer to the conditional effect distributions (non-vacuous distributions) as CEDs, vacuous likelihood distributions as LDs and vacuous prior distributions as PDs. The CEDs and arcs of M I are all contained in M Q : arcs may only be removed by persistsupport and there is no way to remove a CED. Let M A consist only of the CEDs and the arcs between CEDs in M I . M A is an active set of M Q . We have to hang LDs and PDs off of this graph (forming graph M AI ) in order to evaluate it via LPE. We will show that each of the LDs and PDs in M AI is also in M I , the belief network constructed by Model_CE. First, we will prove the subset relationship for PDs. If there is a PD X pointing to node N in M AI , it must be because there is an inbound arc into N that is not in the active set M A . This must correspond to an open condition in P .5 Open conditions are modeled using PDs in Model_CE so X is also in M I . Now we will show that all of the LDs in M AI are also in M I . If an LD X and arc ( N, X ) is in M AI , one or more of the outbound arcs of M Q must be missing from the active set M A . These arcs must run to belief network components that are rooted in belief network nodes that are the predecessors of either the utility node, a subvalue node that is in M A or a subvalue node that hasn’t been added to M A .6 In any case, these missing nodes or arcs cannot be added to the plan model without using add-link or add-step from one of the open conditions in P . This means that the action that contains N must be executed before the action 5. Because if there were a link establishing the precondition corresponding to N , then Model_CE would have found a CED predecessor for N and this node (with its arc) would be in M A instead of the PD. 6. Because the missing nodes must be in the N π sets of the utility nodes.
294
containing that open condition. Model_CE will add an LD to any belief network node if it corresponds to an action that can be executed before an open condition, therefore LD X will be in M I . The belief network M I consists of M AI plus possibly some extra LDs. These LDs can only increase the bounds in M I so ALL of the probability intervals for the nodes in M AI are contained in the equivalent bounds in M I . We also know that the probability distribution for any node in M Q is in the probability bound for the same node in M I (from the proof of correctness for LPE [Draper, 96]). We know that threat resolution actions like promote and demote have no impact on the utility calculated from the plan model. Now we will show that applying persist-support to plan Q can never increase the utility of the plan model. In order to do so, we will rely on V
Theorem 7: we can always find a threat S T ⊗ ( S E →S C ) to resolve such that either V
1. S E →S T 2. there
Case 1:
is in Q or V
is no link S →S T in Q (but neither is open).
V
S E →S T
is contained in Q . The belief networks in Figure 131 represent the
before and after situations. V + and V - represent the outcome of steps S T and S E , respectively. These variables can influence the utility U indirectly (through delta functions Δ 1 and Δ 2 )7 or directly (via arc ( V -, U ) before persist-support and via arc ( V +, U ) after persistsupport.).
7. The delta functions are present to avoid the awkwardness of having two arcs connecting the same two variables.
295
U
U
Δ2
Δ2
Δ1 V+
Δ1 V+
V-
C
V-
C
FIGURE 131. Belief networks for Case I. Dark circles represent groups of nodes.
The expression for the utility of the plan before persist-support is:
∑
uP { u V -, C, Δ 1, Δ 2 }P { V + C, V - }P { Δ 1 V - }P { Δ 2 V + }P { V -, C }
(7)
V -, V + , C , Δ 1 , Δ 2 , u
After persist-support, the utility of the plan is:
∑
uP { u V +, C, Δ 1, Δ 2 }P { V + C, V - }P { Δ 1 V - }P { Δ 2 V + }P { V -, C }
(8)
V -, V +, C, Δ 1, Δ 2, u
Term by term, the terms on the inside of the sum in Equation 8 are less than or equal to the terms in Equation 7. Divide V + , V - and C into three non-overlapping subsets: In subset 1, P { V -, C } = 0 ( is not possible). In subset 2, P { V -, C } > 0 but P { u > u min V +, C, Δ 1, Δ 2 } = 0 ( is possible, but not pertinent). In subset 3, P { V -, C } > 0 and P { u V + } > 0 .
296
All of the terms in subset 1 are equal to zero for both Equations 7 and 8 because P { V -, C } = 0
.
For the terms in subset 2:
∑
uP { u V +, C, Δ 1, Δ 2 }P { V + C, V - }P { Δ 1 V - }P { Δ 2 V + }P { V -, C }
V -, V +, C, Δ 1, Δ 2, u
∑
=
u min P { V + C, V - }P { Δ 1 V - }P { Δ 2 V + }P { V -, C }
V -, V + , C , Δ 1 , Δ 2
≤
∑
uP { u V -, C, Δ 1, Δ 2 }P { V + C, V - }P { Δ 1 V - }P { Δ 2 V + }P { V -, C }
V -, V +, C, Δ 1, Δ 2, u
For the terms in subset 3, it must be the case that V + is equal to V - in Equation 8 (because cannot
ST
∑
be
effective),
uP { u V +, C, Δ 1, Δ 2 }P { V + C, V - }P { Δ 1 V - }P { Δ 2 V + }P { V -, C }
V -, V +, C, Δ 1, Δ 2, u
=
∑
thus
. uP { u V -, C, Δ 1, Δ 2 }P { V + C, V - }P { Δ 1 V - }P { Δ 2 V + }P { V -, C }
V -, V + , C , Δ 1 , Δ 2 , u
Case 2: The analysis for case 2 is similar except that there is the possibility that more actions might be added in order to support the open precondition of S T . Adding actions always reduces utility (UDTPOP domain restriction). The conclusion: persist-support can never increase utility. Putting it all together: After resolving all threats in a completion, the utility of the completion will be the same or lower than before the threats were resolved (plan model M Q ). The upper bound on the utility calculated from the model of the partial plan M I is greater than or equal to that calculated from M Q , therefore the upper bound calculated from M I is greater than the utility of this arbitrary completion. Therefore, EvaluateUB is admissible.
297
298
B.
DTPOP Proofs
B.1
Notation
M
[ Ai ]i = 1 = [ A ]
and [ B ] are sequences. The concatenation of these two sequences is [ A, B ] .
fC denotes the step execution policy for SC. fC is a function from the literals (values) corre-
sponding to the values of observations made prior to the execution of SC. These observations include values for all of the observable conditional effects in [S1, ..., SC-1] as well as the choice made for all step execution decisions prior to SC. If the argument to fC is a larger sequence of observations, the observations after SC-1 are ignored,if [ o i ] Ni = 1 is the sequence representing all possible observations in the plan, f C([ o i ] Ni = 1) = f C([ o i ] Ci =–11) where oC-1 is the last observation made before the step execution decision for SC.
B.2
Eliminating Decisions Theorem 17 (No Intervening Observation): Sj and Sk are two actions in the topoV
logical sort T = [ S i ] ni = 1 for plan P. Sj and Sk are linked by a causal link S j →S k . If there are no observation actions in [Sj, Sj+1, ..., Sk-1] and the reasons for Sj and Sk are the same, then the execution policies for Sj and Sk are the same. Proof
Suppose that the execution policies for Sj and Sk are fj and fk, respectively. Let O
+
be the set of observation combinations for the observable actions in the plan that have a probability
greater
than
zero.
Because
of
the
single
support
restriction,
299
+
∀( o ∈ O ), f k(o) → f j(o) .
∀o, f j(o) → f k(o) ,
also. If the reasons for the two steps are the
same, then if it is impossible to achieve the goals in the reason, we should execute neither action. Not executing Sj makes it impossible to achieve the goal in Sk’s reason, therefore the theorem follows. Theorem 18 (Equivalence Classes): Let Pbe a partial order generated from topo_classes. The expected value of every plan model generated for every topological sort consistent with P is the same. Proof
topo_classes only adds between compatible steps that are observable and between
observable steps and steps that are not observable. The only compatible steps that are not ordered are steps that are not observable. Swapping the order of decisions cannot change the expected value of a decision problem if there is no intervening observation.
B.3
Observation Relevance Theorem 19 (Active Subgraphs on Information Relevance Networks): If A is
probabilistically relevant to B given C and some value for the decisions D, then there is an active subgraph of the information relevance network supporting this relation that consists entirely of nodes that are mutually compatible. Proof
Say that the minimal active subgraph has some node A that is not compatible with
some other node B. Given D, either A or B is functionally determined (to be na) and is therefore independent of its neighbors. Such a node cannot support an active path [Geiger, et al, 90], so the minimal active subgraph is not, in fact, minimal (a contradiction).
Theorem 19 (Observation Relevance is NP-complete): The following problem is NP-complete: Given an information relevance network for value node Vi determine whether O is an open uncertainty.
300
We will show that observation relevance in information relevance networks is at
Proof
least as hard as 3SAT by reduction. We can encode the disjunctive clauses of a 3SAT problem with the following gadget [Papadimitriou, 94]: each of the three propositions j = 1, 2, 3
in each 3SAT disjunction k is modeled by three steps, S1,k, S2,k, and S3,k that each have
an observable outcome that has two preconditions and another “nuisance” precondition representing the literal. Two additional steps, S0,k-1 and S0,k , provide causal links to each of the Sj,k. For each of the literals, l, (variables) in the 3SAT problem, we add two steps, S¬l and Sl, related by a mutual exclusion constraint {S ¬l, S l }⊥ .
If literal l is in disjunction k,
then there is an arc from Sl to the “nuisance” precondition of one of the steps, S1,k, S2,k, or S3,k.
If literal ¬l is in disjunction k, then this arc will originate from S¬l. Sa
S¬a
Sb
S¬b
Sc
S0,k-1
S¬c
S0,k
S1,k
S2,k
S3,k
FIGURE 132. Gadget for Representing a Disjunct in 3SAT. The figure represents the disjunct a ∨ ¬b ∨ c .
301
Sa
S0,0
S¬a
Sb
S¬b
S0,1
Sc
...
S¬c
S0,2
Sz
S¬z
S0,n-1
S0,n
... ( a ∨ ¬b ∨ c ) ∧ ( b ∨ ¬c ∨ z ) ∧ … ∧ ( l 1, n ∨ l 2, n ∨ l 3, n )
FIGURE 133. Representation for 3SAT.
We will connect the gadgets for each disjunction through the steps S0,k as shown in Figure 133. Once we have constructed this structure, we can ask the question, “Is S0,0 relevant to S0,n?” The answer is true only if we can make choices for each of the literal mutex constraints that will enable an active path [Geiger, et al, 90] from S0,0 to S0,n. Therefore, Observation Relevance is NP-hard. We can show inclusion in NP by picking values (execution policies) for the literals and confirming the presence of an active path in linear time [Shachter, 1998].
B.4
Soundness
DTPOP is sound, but it can produce some plans that are of relatively low quality. It is possible for DTPOP to generate optimized plans with • irrelevant observations, • “aborted” branches with underestimated utility, and • ineffective steps. 302
We can change the completion criterion for DTPOP to ‘artificially’ prune these plans. DTPOP can generate plans with irrelevant observations because it cannot tell whether it is worth pursuing an observation until the plan optimization phase. At that point, it may discover during plan optimization that the outcome of the observation is not used to condition a decision policy. If the observation is terminal, it will have a non-contingent ¬execute policy. I allow DTPOP to abort a plan if every branch of the plan seems to have an expected utility worse than the worst outcome of the reward utility function. If a plan is aborted, the reward function is assumed to be zero, but may, coincidentally, have a pretty good outcome. It is relatively easy to enhance the utility modelling component of DTPOP to compute the utility of the plan at each point where the plan might be aborted, but this computation adds a lot of extra structure to the plan model increasing the time required for evaluation. I think that this is unnecessary. The plan mechanism is guaranteed to find a good plan branch that correctly models the “aborted plan.”
The best plan in the search
space always has the correct utility. Finally, DTPOP can develop plans that have ineffective steps. This is a consequence of delaying computation of the step execution policies. The planning algorithm essentially uses UDTPOP to craft each contingent branch. UDTPOP guarantees that each branch will be effective, at least some of the time. The mechanism cannot guarantee that each step will be effective after the plan is optimized.
Definition 36 (Admissible Decision Policy): The execution policy for the steps in a
plan is said to be admissible iff the following holds: V
• If S j →S k is a causal link in the plan, then the execution policy for Sk is ¬execute
whenever the execution policy for Sj is ¬execute,1 and
1. This is the same as the single support restriction mentioned on page 150.
303
• The execution policies for all of the steps in the plan are consistent with the
mutex constraints.
Definition 37 (Terminated Trace): A terminated trace is an admissible sequence of
step execution decisions that includes the decision to execute a goal step. Definition 38 (Aborted Trace): An aborted trace is an admissible sequence of step
execution decisions where every goal execution decision is ¬execute. Definition 39 (Supported Conditional Effect or Cost Function): If
there is a causal link for each precondition of a conditional effect or cost function, that conditional effect or cost function is supported.
Definition 40 (Correct Causal Structure): A contingent plan has correct causal structure for a given set of decisions, if for any topological sort, the probability distribution over the outcomes of supported conditional effects or cost functions is identical to that computed from the Markov model of the plan. Definition 41 (Weakly Effective): Every step in a contingent plan is effective given
some admissible decision policy.
Theorem 20 (Soundness of Plan Model Construction): A
complete DTPOP plan is weakly effective and has correct causal structure for any admissible decision policy. For aborted traces DTPOP can underestimate plan utility.2
2. This criterion for completeness resembles that of Goldman+Boddy [94c]. Their criterion for soundness in Plinth is (in part) that every uncontingent branch in a contingent plan be “well-formed.” For DTPOP, well-formed means that the branch has “correct causal structure” and is “effective.” If the former property is true, we can show that DTPOP can correctly compute the probability of any state and, therefore, we can use the models for the branches in a decision-making procedure in order to come up the the optimal plan. For example, the Cooper procedure [Cooper, ***] for solving influence diagrams does not require that more than one branch be considered at one time–with Cooper’s transformation and algorithm, a probabilistic evaluation procedure of the same complexity as that used for UDTPOP is sufficient to build a full decision-theoretic evaluator.
304
Proof
(Weak Effectiveness) DTPOP uses the UDTPOP mechanism for maintaining the
effectiveness of plan branches that achieve goals. If a step supports only an observation plan, it will always be effective because all of its outcomes are pertinent. One of the admissible execution policies is to unconditionally execute every step in a particular plan branch, therefore every step is effective for some admissible execution policy. (Correct Causal Structure) Given any set of execution decisions, every supported conditional effect and value function has satisfied preconditions and no threats to “inbound” causal links. Given any topological sort of the steps, we can show by induction similar to the proof for correct causal structure in UDTPOP that the joint probability of the supported conditional effects in the plan is correct. (Underestimate for Aborted Traces) The underestimated trace is a UDTPOP plan without a goal reward function. Since the all of the outcomes of the reward utility function are positive or zero, the utility estimated by DTPOP is an underestimate. If a reward goal function is executed for a particular trace, the utility of the goal is correctly computed and the estimate is exact.
B.5
Completeness
There are two components to the proof of completeness. First I will show that the optimal plan can be represented as a DTPOP plan. Second, I will show that this optimal DTPOP plan will always be found by DTPOP.
B.5.1
Identifying the Contingent Plan
The causal structure of any sequence of contingent actions can be “reversed-engineered” in order to extract the causal links. I will first identify all of the causal links associated with any contingent plan. This contingent plan may have more than one causal link to satisfy a single precondition–multiple links come from steps with different execution poli-
305
cies. After we discover this non single-support plan, I will split actions until the plan is a single-support plan that can be represented by DTPOP. Our ultimate goal will be to discover the best possible plan of a given length. In order to demonstrate that we can always do this, I need to demonstrate that the optimal plan can be represented as a DTPOP plan, that complete-plan can construct this plan and that optimize can find the correct step execution policies. First I need to show that we can discover all of the causal links in the clairvoyantly-known optimal exemplar. This plan is a sequence of plan steps with execution policies. Since the plan is the optimum, this is easy–all we need to do is: • Determine the conditional effects that contribute causal links or observable out-
comes that used in later execution policies. • Determine all of the sources of support for the conditional effects listed above.
The tool that I will used to discover causal links is similar to the Discover_Links algorithm described in Appendix A. The input to the algorithm is the desired goal reward function, the initial conditions, and a sequence of contingent steps. Let step return the exemplar step that contains a particular conditional effect. Let f(Si, o) be the decision policy for step Si.
Assume that the decision policies for Q have been simplified by dropping variables that
306
are not essential for the decision policy. That is, the dropped variable can be omitted from the decision policy without decreasing the expected value of the plan. n
Discover_Contingent_Links(toposort Q = { S i } i = 1 , reward R, SIC) { Useful_CEs := { R } Useful_CLs := ∅ +
O := the set of observation values such that +
for all o ∈ O , P{o} > 0 and E[ R | o ] > 0 for j from n down to 1 { for all CEe in CEs(Sj) { // Check for observations if (an execution policy for [Sj+1, Sn] is a function of a DEC of CEe) { Useful_CEs := Useful_CEs + CE e } for all ce c ∈ Useful_CEs //now check for effects that support future preconditions. for all V in PreVars(cec) { if (there exists some combination of +
observations o ∈ O such that 1. CEe has an outcome variable V that is the same as a precondition variable in cec. 2. f(step(cec), o) = true 3. f(Sj, o) = true 4. There does not exist a step S in [Sj+1, Sn] such that S has an outcome variable V and f(S, o) = true.) { Useful_CEs := Useful_CEs + CE e ⎧ V ⎫ Useful_CLs := Useful_CLs + ⎨ S j →step(ce c) ⎬ ⎩ ⎭ } } } } } }
FIGURE 134. Discover_Contingent_Links.
307
B.5.2
No Fusion
More than one causal link can support a single precondition in the “general contingent plan” analyzed by Discover_Contingent_Links. If more than one causal link can support a single precondition, we say that the plan is fusing the effects of two or more steps in order to support a later step. This is illegal in DTPOP–every precondition only has one source of support. Fortunately, we can turn a general contingent plan into a DTPOP plan.
Definition 42 (Fusion): Let fA(o) denote the step execution policy for any step SA. Let O denote the set of all possible combinations for the observation values and decisions V
in plan P. If a plan Phas a causal link S E →S C and ∃( o ∈ O ) such that f C(o) = true and f E(o) = false then the support for precondition V of step SC is the fusion of the results of two or more steps.
Lemma 2 (No Fusion): There is an unfused contingent plan with the same expected value as that of a given fused plan. Proof
Consider the following algorithm for converting a topological sort of a general
contingent plan into a DTPOP plan. Look for any instance of fusion in the plan using DefV
V
inition 42. If one is found, at least two links S E1 →S C and S E2 →S C will support the same precondition V of the some step SC in the plan. Say that S E1 < S E2 in the total order. We can split S C into two mutually exclusive steps S C1 and S C2 , each inheriting just one of the fusing causal links. The execution policy for S C1 is f C1(o) ∧ ( f E1(o) ∧ ¬f E2(o) ) and X
the execution policy for S C2 is f C1(o) ∧ f E2(o) . If there is a causal link S C →S ? , we remove X
X
this link and add S C1 →S ? ( S C2 →S ? ) if the execution policy for the consumer S ? can be true when the new execution policy for S C1 ( S C2 ) is true. If an observable DEC of S C is used in a future execution policy, we need to replace any generalized proposition of the
308
form S C = xxx with either S C1 = xxx or S C2 = xxx . Note that there is a one-to-one map between the feasible (P(o) > 0) observation combinations for steps up to and including S C . The combination of observations that allow S C1 to execute are mutually exclusive to the combination of observations that allow S C2 to execute. Since exactly one copy of SC is executed in exactly the same circumstances as before using the same support, the affect of splitting SC is merely one of renaming SC to S C1 in some situations and to S C2 in others. Expected value is not affected. There can only be a finite number of fused links to split. Each split relies on mutually exclusive sets of observations. The number of combinations of observations is not affected by the splitting process. Therefore, the maximum number of splits for each step is bound above by the number of feasible combinations for observations.
B.5.3
Completeness
Given the optimal bounded-length action sequence, we can discover an equivalent DTPOP plan.
Theorem 11 (Completeness): Let Q be a sequence of contingent actions that is
an optimal solution of length N to the planning problem D = . The cost functions for the possible actions in A are all greater than zero. DTPOP with the appropriate search control strategy can identify a plan P' that has the same expected value as Q . Proof
This proof is an edit of the proof for DTPOP. The only difference is that we need
to devise a construction order that identifies whether arcs are added to satisfy an open uncertainty or an open condition. Given any contingent plan Q, we can derive an equivalent DTPOP plan Q' (Lemma 2).
309
The proof proceeds by induction. Let IC k = Result({ S i } ni =– 1k + 1) and let D k = be a series of planning domains with initial conditions given by IC k . Let S ICk be an initial conditions action with conditional effect IC k . We will show that DTPOP can use the “planner execution trace” for the derivation of P k solving D k = in order to generate a plan for P k + 1 solving D k + 1 = . Thus DTPOP with the appropriate search control can find the plan P n = P' solving D = D n = that has the topological sort { S i } ni = 1 .
Open Condition or Open Uncertainty?
The tricky part of this proof is that there are now two different reasons to add causal links: open uncertainties and open conditions. In order to build our planner execution trace, we need to identify whether a link is added from an open uncertainty or an open condition. We will identify the ordering by establishing all the links that we can through open conditions and then try to discover links that must be added via an open condition. One possible ordering can be derived by running the following algorithm. If the exemplar plan is the optimal plan, then all of the observable conditional effects that appear in step execution policies are relevant to value functions in the plan. 0. Set i to 1. Mark all of the goal reward functions. 1. Increment i. Repeat until no further conditional effects can be marked: mark conditional effects that support a marked conditional effect or the value functions of a step that contains a marked conditional effect. Call the causal links that support these marked conditional effects as backward arcs and label then with . This marked set represents the portions of the exemplar plan that can be identified using only the UDTPOP mechanisms. Mark the steps that contain these marked conditional effects and value functions. 310
2. If all of the observable, relevant, conditional effects (ORC’s) have been marked, stop. Otherwise, find an active path that links one of ORCs to a value function on one of the marked steps. This active path must contain one ORC, CEi, that is the descendent of a marked step in the plan (Why? If there were no such active path, then the remaining steps can be deleted from the plan without decreasing its expected value.). The directed chain of arcs connecting the marked portion of the plan to CEi are called forward arcs. Label these arcs with . Go to 1 and repeat. This procedure must mark all of the steps in the optimal plan. If the step is not marked, then it is possible to increase the value of the plan by eliding the step. Every arc in the plan can be added by the normal UDTPOP mechanisms except for the forward arcs. We can reconstruct all of the causal links in the plan as follows: 1. add all 2. add all 3. add all 4. add all 5. etc...
of the backward arcs labelled with <1, backward>. of the forward arcs labelled with <1, forward>. of the backward arcs labelled with <2, backward>. of the forward arcs labelled with <2, forward>.
Call this labelling scheme for the causal links the clairvoyant link labelling scheme.
Completeness
We can reconstruct the original exemplar sequence using the clairvoyant arc labelling scheme with the clairvoyant decision policy discussed in Section A.4.2. Basis ( k = 2 ) : { S IC2, S n } is a plan that consists of only two actions, the initial conditions action S IC2 with conditional effect IC 2 and the goal action S Goal . Since we know that there are no other actions in this plan, and that an action cannot threaten itself, the only applicable plan construction operation is add-link. We can derive a 2-action plan P 2' that
311
is identical to P 2 by adding links from S IC2 to S Goal for all of the open variables in S Goal .3
Inductive Hypothesis ( k > 2 ) : Given the totally-ordered plan { S ICk, S n – k + 2, …, S n } , DTPOP can find a plan P k solving D k = . Inductive Step: S IC
k+1
Given the inductive hypothesis and the totally-ordered plan
, S n – k + 1, …, S n },
we will show that DTPOP can find a plan P′ k + 1 solving
D k + 1 = . Pk
is very close to the actual solution for D k + 1 except that the first action S ICk in P k
serves the same purpose as the initial two actions S ICk + 1 and S n – k + 2 in P k + 1 . Our job is to derive a strategy for reassigning the links that originate in S ICk in plan P k to S ICk + 1 and Sn – k + 2 .
Let’s take any one link L that originates from S ICk . One of the following cases holds: Case 1: Sn - k + 2 is not compatible with the consumer of L in the exemplar plan: In place of adding an arc from S ICk to the consumer of L, replace the operation used to add the original L with the same operation to add a link from S ICk + 1 to the consumer of L. Any threats with Sn - k + 2 can be resolved via branching. If Sn - k + 2 is a goal step, we need to add an action to add this goal step to the plan. Case 2: Sn - k + 2 does not have the outcome variable protected by L. In this case, use the same operation used to add L to add a new link from S ICk + 1 to the consumer of L.
3. S IC (and therefore S IC, k ) has a CED over every outcome variable used in the plan.
312
Case 3: Sn - k + 2 has the outcome variable protected by L and L was added by add-stepforward. Use add-step-forward to add the new link from Sn - k + 2 to the consumer of L. Case 4: Sn - k + 2 has an effective conditional outcome that is clairvoyantly pertinent (pertinentC) to the variable protected by the causal link. Replace the old add-link operation with an add-step to add Sn - k + 2 if the step has not been added already or use add-link to add a link to a pre-existing Sn - k + 2. Case 5: Sn - k + 2 had an outcome variable that is the same as that protected by L, but does not have an effective conditional outcome that is also pertinentC. Replace the old add-link operation with an add-link to S ICk + 1 . Resolve the threat with Sn - k + 2 using persist-link. After the execution trace has been modified as described, we have to decide how to add the links that connect Sn - k + 2 to S ICk + 1 . In this case, we have to pay careful attention to the clairvoyant link labelling scheme. Pick any link L. Case 1: Link L is labelled : An add-forward operation to add this link can be added after every add-backward operation that has an index less than i. This restriction ensures that there will be an open uncertainty available to serve as the root for this link. If the step has not been added yet, use add-step-forward to add this link. Otherwise, use add-link-forward. Case 2: Link L is labelled . We can add the causal link to establish L at any time after the time that the a precondition exists for L. Eventually one will be added, possibly after an add-forward operation adds all of the arcs of precedence i - 1. There are two possibilities for adding this link: • Say that an outcome variable of CE is protected by a link added either by add-
link or add-step. This means that at least one of the outcomes of CE is pertinent, therefore, there are pertinent preconditions for the CE . Now by Theorem 6, if all of these pertinent preconditions are not possible then P { R > R min } = 0.0 contra-
313
dicting one of the restrictions of the theorem. Therefore, there must be outcomes in S ICk + 1 that are pertinent (and pertinent C to CE ). • Alternatively, the CE may have been spliced into a pertinent link by persist-sup-
port. We know that persist-support must persist at least one pertinent outcome from S ICk + 1 , again, because of Theorem 5. Thus, one of the persistent conditional effects of CE is both possible and pertinent. The preconditions of this conditional effect can be linked into S ICk + 1 . We have just shown that we can use P k with its topological sort to derive a plan for P k + 1 . We have shown that we can derive a plan for P 2 . By induction, we can derive a plan for P′ = P n ,
therefore the theorem is true. DTPOP is complete.
314