Optimal Threshold Policy for Sequential Weapon Target Assignment Krishnamoorthy Kalyanam ∗ Sivakumar Rathinam ∗∗ David Casbeer ∗∗∗ Meir Pachter ∗∗∗∗ ∗

InfoSciTex Corporation, a DCS Company, Dayton, OH 45431 USA (e-mail: [email protected]). ∗∗ Mechanical Engineering Department, Texas A&M University, TX 77843 USA (e-mail: [email protected]) ∗∗∗ Autonomous Control Branch, Air Force Research Laboratory, WPAFB, OH 45433 USA (e-mail: [email protected]) ∗∗∗∗ Electrical & Computer Engineering Department, Air Force Institute of Technology, WPAFB, OH 45433 USA (e-mail: [email protected])

Abstract: We investigate a variant of the classical Weapon Target Assignment (WTA) problem, wherein N targets are sequentially visited by a bomber carrying M homogenous weapons. We are interested in the optimal assignment of weapons to targets. A weapon launched by the bomber destroys the j th target with probability pj and upon successful elimination, the bomber earns a positive reward rj . There is feedback in that the bomber, upon deploying a weapon, is notified whether or not it successfully destroyed the target. Whereupon, it decides whether to move on to the next target or allocate an additional weapon to the current target. We provide a tractable solution method for computing the optimal closed loop control policy that results in maximal expected total reward. Moreover, we show that a thresholding policy is optimal, wherein a weapon from an inventory of k weapons is dropped on the j th target iff k exceeds a stage dependent threshold value, κ(j). Keywords: Weapon-Target Assignment, Dynamic Programming, Threshold Policy 1. INTRODUCTION The operational scenario is the following. A bomber with M identical weapons travels along a designated route/ path and sequentially encounters enemy (ground) targets numbered 1 to N . A weapon dropped on the j th target will destroy it with probability pj . Upon successful elimination, the bomber receives a positive reward rj . Furthermore, upon releasing a weapon, the bomber is alerted as to whether or not the deployed weapon was successful. If the engagement was successful, the bomber moves on to the next target. If it was not successful, the bomber can either re-engage the current target or move on to the next target in the sequence. We compute the optimal feedback policy that results in maximal total expected reward for the bomber.

where qj = 1 − pj . So, the assignment problem is given by: max

V¯i (ni ) subject to

(2)

i=1 N X

nj = M and nj ∈ {0, . . . , M }, ∀j.

(3)

j=1

Alternatively, one can rewrite the above as a minimization problem: min

N X

n

rj qj j subject to

(4)

i=1 N X

nj = M and nj ∈ {0, . . . , M }, ∀j.

(5)

j=1

2. WEAPON-TARGET ASSIGNMENT PROBLEM Upon inspecting the scenario considered herein, it is clear that, if there was no feedback, the problem collapses to the static Weapon-Target Assignment (WTA) problem. Indeed, in the absence of feedback information, the bomber might as well decide how many weapons to be dropped on each target, at the start of the mission. Towards that end, let nj be the number of weapons assigned to target j. The expected reward from this assignment is given by: n V¯j (nj ) = rj (1 − qj j ), (1)

N X

The above problem is a special case of Flood’s static Weapon-Target Assignment (WTA) problem - see Manne (1958) and the optimal solution is obtained via the Maximum Marginal Return (MMR) algorithm - see denBroeder et al. (1959). Algorithm MMR 1. 2. 3. 4.

Initialize nj = 0 and Vj = rj ∀j = 1, . . . , N for i = 1, . . . , M Find k = arg maxN j=1 pj Vj Update nk = nk + 1 and Vk = Vk qk

Algorithm MMR has a time complexity of O(N + M log(N )). Note that under the additional complexity that the weapons are not homogenous i.e., if pij indicates the probability that target j gets destroyed upon being hit by weapon of type i, the resulting static assignment problem is NP-complete - see Lloyd and Witsenhausen (1986). Exact and heuristic algorithms to solve this version of the WTA problem are provided in Ahuja et al. (2007). An approximate algorithm for a dynamic WTA problem, wherein not all targets are known to the decision maker at the start of the mission, is provided in Murphey (1999). A geometry based optimal weapon target assignment for convoy protection is provided in Leboucher et al. (2013). 3. MODEL We consider a dynamic variant of the WTA problem, wherein the targets are visited sequentially by the bomber. Furthermore, we also incorporate feedback, in that the bomber is informed about the success/failure of a weapon upon deployment. This allows for dynamic decision making, where a decision is made as to whether a) an additional weapon is deployed on the current target or b) the bomber moves on to the next target in the sequence. In this regard, the decision rules for a hunter hunting targets whose reward takes values from a known (continuous valued) distribution is provided in Sato (1997). The scenario considered in this paper is simpler in that the target values are deterministic and known a priori. Let V (j, w) indicate the optimal cumulative reward (“payoff” to go) that can be achieved when the bomber arrives at the j th target with w > 0 weapons in hand. It follows that V (j, w) must satisfy the Bellman recursion: V (j, w) = max {pj (rj + V (j + 1, w − 1)) u=0,1

+ qj V (j, w − 1), V (j + 1, w)} , j = 1, . . . , (N − 1), (6) where the control action, u = 0, 1 indicates whether the bomber should stay and deploy a weapon or move on to the next target. In (6), decision u = 0 results in the j th target being destroyed with probability pj and u = 1 results in the bomber moving on to the next target in the sequence. If target j is destroyed, the bomber receives an immediate reward of rj . The optimal policy is therefore given by: µ(j, w) = arg max {pj (rj + V (j + 1, w − 1)) u=0,1

+ qj V (j, w − 1), V (j + 1, w)} , j = 1, . . . , (N − 1). (7) If the bomber runs out of ammunition, there is no more reward to be gained i.e., the boundary condition is: V (j, 0) = 0, j = 1, . . . , N, (8) Furthermore, if the bomber is at the N th target and still has weapons at hand, the expected reward is given by: w V (N, w) = rN (1 − qN ), w > 0. (9) w qN

th

In other words, represents the probability that the N w target is not destroyed by any of the w weapons. So, 1−qN is the probability that it gets destroyed; thereby yielding the reward (9). The optimal policy can be obtained by the Backward Dynamic Programming (BDP) algorithm detailed below. Algorithm BDP

1. 2. 3. 4. 5. 6.

Initialize V (j, 0) = 0, j = 1, . . . , N k Initialize V (N, k) = rN (1 − qN ), k = 1, . . . , M Initialize µ(N, k) = 0, k = 1, . . . , M for j = (N − 1) to 1 for k = 1 to M Compute V (j, k) as per (6)

If there was only 1 weapon left, we have from (6): V (j, 1) = max {pj rj , V (j + 1, 1)} , j < N, u=0,1

(10)

and V (N, 1) = pN rN . So, a greedy policy is optimal and, N

V (1, 1) = max pj rj . j=1

(11)

The bomber, as expected, deploys the weapon on the target with the maximum expected reward. Note that Algorithm BDP has a time complexity of O(N M ). However, for practical scenarios, one is unlikely to encounter large values of N or M , that make the algorithm computationally infeasible. In any case, it would be beneficial to understand the structure (if any) in the optimal policy. In the next section, we show that the optimal policy has a special structure. Indeed, it is a threshold policy (or a control limit policy - see Sec. 4.7.1 in Puterman (2005) for details). We show that a weapon from an inventory of k weapons is dropped on the j th target iff k exceeds a threshold or control limit, κ(j).

4. THRESHOLD POLICY Let ∆j (k) := V (j, k + 1) − V (j, k) indicate the marginal reward yielded by assigning an additional weapon over and above k weapons to the downstream targets numbered j to N . Proposition 1. ∆j (k) is a monotonically decreasing function of k. We shall prove the above proposition later. Notice however that the marginal reward yielded by the last (N th ) target in the sequence, k ∆N (k) = V (N, k + 1) − V (N, k) = pN rN qN ,

(12)

is clearly a decreasing function of k given that qN < 1. Suppose Proposition 1 is true i.e., ∆j+1 (k) is a monotonically decreasing function of k. Then, we can define κ(j) = mink=0,1,··· k such that pj rj ≥ ∆j+1 (k). The following result shows that a thresholding policy is optimal for the j th target. Lemma 2. If ∆j+1 (k) is a monotonically decreasing function of k,  1, k < κ(j), µ(j, k + 1) = 0, otherwise. Proof. From the Bellman recursion (6), we have: V (j, k) ≥ V (j + 1, k). It follows that: pj (rj + V (j + 1, k)) + qj V (j, k) ≥ pj rj + V (j + 1, k) ≥ V (j + 1, k + 1), (13) ∀k ≥ κ(j). where (13) follows from the definition of κ(j). Recall the Bellman recursion (6):

V (j, k + 1) = max {pj (rj + V (j + 1, k)) + qj V (j, k), u=0,1

V (j + 1, k + 1)} , ⇒ V (j, k + 1) = pj (rj + V (j + 1, k)) + qj V (j, k), k ≥ κ(j) (14) ⇒ µ(j, k + 1) = 0, k ≥ κ(j). We shall prove the second part of the result, i.e., µ(j, k + 1) = 1, k < κ(j) by induction on k. Recall the definition of κ(j), which gives us: pj rj + V (j + 1, k) < V (j + 1, k + 1), k < κ(j).

(15)

If κ(j) ≤ 0, there is nothing left to prove. So, suppose κ(j) > 0. From the Bellman recursion (6) and boundary condition (8), we have:

V (j, k + 1) = pj rj

i=0

+ pj ⇒ ∆j (k) = pj

(16)

where (16) follows from (15). Suppose V (j, t) = V (j + 1, t) for some t < κ(j). Using the induction argument, the Bellman recursion (6) yields: V (j, t + 1) = max {pj (rj + V (j + 1, t)) + qj V (j, t), u=0,1

V (j + 1, t + 1)} , = max {pj rj + V (j + 1, t), V (j + 1, t + 1)} , u=0,1

(17)

where (17) follows from (15). Hence, V (j, k + 1) = V (j + 1, k + 1), k < κ(j), ⇒ µ(j, k + 1) = 1, k < κ(j).

(18)

The above result tells us that 1 out of the current inventory of (k + 1) weapons is deployed on the current target j iff the immediate expected reward, pj rj is no less than the marginal reward obtained by assigning an additional weapon over and above k weapons to the downstream targets. This is intuitively appealing and results in a thresholding policy being optimal. Theorem 3. For j = 1, . . . , (N − 1), ∆j (k) is a monotonically decreasing function of k. Proof. We prove the result by (backward) induction on the target index j. From (12), we know that ∆N (k) is a decreasing function of k. Let us suppose that ∆j+1 (k) is also a decreasing function of k. Combining (14) and (18), we can write:  V (j + 1, k + 1), k < κ(j)   pj rj + V (j + 1, k), k = κ(j) (19) V (j, k + 1) =   pj (rj + V (j + 1, k)) + qj V (j, k), k > κ(j), So, the marginal reward is given by: ∆j (k) = V (j, k + 1) − V (j, k)  ∆j+1 (k), k < κ(j), = pj rj , k = κ(j).

`−1 X i=0 `−1 X

qji V (j + 1, k − i), qji ∆j+1 (k − i − 1) + pj qj` rj .

(21)

i=0

We proceed to show that ∆j (k) as prescribed in (20) and (21) is a decreasing function of k. By our induction argument, ∆j (k) = ∆j+1 (k) decreases as k goes from 0 to κ(j) − 2. From the definition of κ(j), we have: pj rj < ∆j+1 (κ(j) − 1). For k ≥ κ(j) and ` = k − κ(j),

u=0,1

= V (j + 1, t + 1),

qji + qj` V (j + 1, κ(j))

∆j (k + 1) − ∆j (k) = pj qj` (∆j+1 (κ(j)) − pj rj )

V (j, 1) = max {pj rj , V (j + 1, 1)} , ⇒ V (j, 1) = V (j + 1, 1),

` X

+ pj

`−1 X

qji (∆j+1 (k − i) − ∆j+1 (k − i − 1)) < 0,

i=0

(22) since ∆j+1 (k − i) < ∆j+1 (k − i − 1) per the induction argument and ∆j+1 (κ(j)) ≤ pj rj as per the definition of κ(j). Hence, ∆j (k) is a decreasing function of k. Corollary 4. If p1 r1 ≥ · · · ≥ pN rN , then κ(j) = 0, j = 1, . . . , (N − 1) and so, µ(j, k) = 0, j = 1, . . . , (N − 1), k = 1, . . . , M . Proof. pN −1 rN −1 ≥ pN rN = ∆N (0) implies that κ(N − 1) = 0. From (20), ∆N −1 (0) = pN −1 rN −1 ≤ pN −2 rN −2 and so, κ(N − 2) = 0. Proceeding in a similar fashion, we have ∆j+1 (0) = pj+1 rj+1 and so, κ(j) = 0, j = 1, . . . , (N − 1). In other words, if the targets happen to sequenced in the decreasing order of expected return, the optimal policy is to keep bombing the current target until one of two things happen: a) the target is destroyed or b) the bomber runs out of ammunition. 5. CONCLUSION We consider a dynamic variant of the Weapon-Target Assignment (WTA) problem, wherein the targets are sequentially visited by a bomber equipped with homogenous weapons. Feedback is available in the form of Battle Damage Assessment (BDA), i.e., the bomber is promptly informed about the failure or success of a deployed weapon. The dynamic weapon allocation problem is be solved by backward Dynamic Programming. The optimal policy is a threshold policy, wherein a weapon is deployed iff the current inventory level exceeds a certain threshold. In the future, we shall consider the added complexity wherein the targets can retaliate against the bomber with some probability of success. ACKNOWLEDGEMENTS

(20)

For k ≥ κ(j) and ` = k − κ(j), we have by repeated application of (19):

The authors gratefully acknowledge Prof. Pravin Varaiya, UC Berkeley, for the motivating problem formulation and for providing valuable insights on the solution.

REFERENCES Ahuja, R.K., Kumar, A., Jha, K.C., and Orlin, J.B. (2007). Exact and heuristic algorithms for the weapon-target assignment problem. Operations Research, 55(6), 1136– 1146. denBroeder, G.G., Ellison, R.E., and Emerling, L. (1959). On optimum target assignments. Operations Research, 7, 322–326. Leboucher, C., Le Menec, S., Kotenkoff, A., Shin, H.S., and Tsourdos, A. (2013). Optimal weapon target assignment based on an geometric approach. In 19th IFAC Symposium on Automatic Control in Aerospace, volume 19, 341–346. Wuerzburg, Germany. Lloyd, S.P. and Witsenhausen, H.S. (1986). Weapons allocation is NP-complete. In Proceedings of the 1986 Summer Conference on Simulation. Reno, NV. Manne, A.S. (1958). A target-assignment problem. Operations Research, 6, 346–351. Murphey, R.A. (1999). Approximation and Complexity in Numerical Optimization: Continuous and Discrete Problems, chapter An Approximate Algorithm for a Weapon Target Assignment Stochastic Program, 1–12. Kluwer Academic Publishers. Puterman, M.L. (2005). Markov Decision Processes: Discrete Stochastic Dynamic Programming. Wiley Series in Probability and Statistics. Wiley Interscience. Sato, M. (1997). On optimal ammunition usage when hunting fleeing targets. Probability in the Engineering and Informational Sciences, 11, 49–64.

Optimal Threshold Policy for Sequential Weapon Target ...

Institute of Technology, WPAFB, OH 45433 USA (e-mail: [email protected]). Abstract: We investigate a variant of the classical Weapon Target Assignment (WTA) problem, wherein N targets are sequentially visited by a bomber carrying M homogenous weapons. We are interested in the optimal assignment of weapons ...

151KB Sizes 0 Downloads 183 Views

Recommend Documents

Monotone Optimal Threshold Feedback Policy for ...
2 Research Engineer, Autonomous Control Branch, Air Force Research Laboratory, Wright-Patterson AFB, OH. 45433 and ... 3 Professor, Electrical & Computer Engineering Department, Air Force Institute of Technology, Wright-Patterson ..... target assignm

Optimal policy for sequential stochastic resource ...
Procedia Computer Science 00 (2016) 000–000 www.elsevier.com/locate/procedia. Complex Adaptive Systems Los Angeles, CA November 2-4, 2016. Optimal ...

Optimal Sequential Delegation
Mar 3, 2016 - Crucially, however, we also show that the sequential arrival of information may call for richer forms of restricting the agent's discretion beyond ...

Optimal Threshold for Locating Targets Within a ...
Much recent work has been focused on maximum likelihood target localiza- ...... from binary decisions in wireless sensor networks,” Technometrics, vol. 50, no. ... Third International Conference on Information Technology, New Gen- erations ...

Optimal sequential treatment allocation
Dec 12, 2017 - treatment) and obtaining more precise information later (by postponing the measurement). In addition, we argue that the desirability of a treatment cannot be measured only by its ex- pected outcome. A sensible welfare function must tak

Optimal Mobile Actuation Policy for Parameter ...
Optimal Mobile. Actuation Policy for. Parameter Estimation of. Distributed Parameter. Systems. ∗. Christophe Tricaud and YangQuan Chen†. 1 Introduction.

Optimal Blends of History and Intelligence for Robust Antiterrorism Policy
Jan 28, 2011 - agent model in which a terrorist cell and a security agency interact in a .... We use MARS–NORA to create a model of the 1995 car bomb attack ...

Optimal corporate pension policy for defined benefit ...
K state variables 7 , and Bi (t) a standard Brownian motion, instantaneously correlated ... fundamentally regarded solely as a hedging tool, used to optimize the.

Optimal Policy for Software Vulnerability Disclosure
According to Symantec Internet Security Threat Report (2003), each ... But many believe that disclosure of vulnerabilities, especially without a good patch is ..... It is easy to see that the first two games lead to rather trivial outcomes (See ....

An Optimal Approach to Collaborative Target Tracking ...
Semidefinite programming·Second-order cone programming·Multi-agent systems ...... Advanced Optimization Laboratory: Addendum to the SeDuMi User Guide Version 1.1 ... Taipei, Taiwan. http://control.ee.ethz.ch/~joloef/yalmip.php (2004).

Delegating Optimal Monetary Policy Inertia.
gap growth target, a nominal income growth target and an inflation contract. .... the nature of optimal delegation that addresses this, the basic source of distortions in the ...... each policy regime and map it back into the general form used in the

Globally Optimal Target Tracking in Real Time using ...
target's optimal cross-camera trajectory is found using the max-flow algorithm. ...... frames per second and 24-miunte duration) on desktop PCs with 2.8 GHz Intel ...

Openness and Optimal Monetary Policy
Dec 6, 2013 - to shocks, above and beyond the degree of openness, measured by the .... inversely related to the degree of home bias in preferences.4 Our ...

Optimal Fiscal and Monetary Policy
optimal fiscal and monetary policy. 149 hold. Then the budget constraints can be written with equality as4 r t. Q(s Fs ) c r r r. {P (s )[C (s ). C (s )]}. (18). 1. 2.

Optimal Monetary Policy Conclusions
Data uncertainty: – Certainty ... Data uncertainty and model uncertainty have larger effects. – Data and model ... Active learning computationally intensive.

Delegating Optimal Monetary Policy Inertia.∗
This papers shows that absent a commitment technology, central banks can nev- ... What are the appropriate objectives of a central bank trying to act in the best ..... mented if the central bank commits to follow the targeting rule (6) for any date .

Weapon Groups.pdf
Campaign Setting (marked ECS), Final Fantasy d20 (marked FFd20), Final Fantasy d20 Technology Book. (marked FFTB), Frostburn (marked Fr), Ghostwalk ...

Weapon Cards.pdf
Page 1 of 6. Page 1 of 6. Page 2 of 6. Page 2 of 6. Page 3 of 6. Page 3 of 6. Weapon Cards.pdf. Weapon Cards.pdf. Open. Extract. Open with. Sign In. Details.

Optimal Monetary Policy with an Uncertain Cost Channel
May 21, 2009 - Universities of Bonn and Dortmund, the 2nd Oslo Workshop on Monetary ... cal nature of financial frictions affect the credit conditions for firms, the central bank .... are expressed in percentage deviations from their respective stead

Optimal Redistributive Policy in a Labor Market with Search and ...
∗Email addresses: [email protected] and [email protected]. ... Our baseline optimization in the benchmark case suggests that applying the.

Optimal fiscal policy with recursive preferences - Barcelona GSE Events
Mar 25, 2017 - Overall, optimal policy calls for an even stronger use of debt ... Fabra, the University of Reading, and to conference participants at the 1st NYU .... more assets and taxing less against high spending shocks, the planner raises ...

Optimal fiscal policy with recursive preferences - Barcelona GSE Events
Mar 25, 2017 - A separate Online Appendix provides additional details and robustness ... consumption and leisure, u(c,1−h), and from the certainty equivalent of ...... St. dev. 4.94. 104.28. 98th pct. 40.6. 397.3. St. dev. of change. 0.17. 12.72.

Optimal Monetary Policy under Incomplete Markets and ...
complete risk$sharing, providing a rich yet tractable framework for monetary policy ...... We will show that for a big set of parameter values, this last term is relatively .... 19Our estimates of σ from long$term U.S. and U.K. GDP data, range from 

Optimal Monetary Policy with Heterogeneous Agents
horse for policy analysis in macro models with heterogeneous agents.1 Among the different areas spawned by this literature, the analysis of the dynamic aggregate ef ...... Under discretion (dashed blue lines in Figure 1), time-zero inflation is 4.3 p