Likelihood Robust Optimization for Data-Driven ...

Viewer
Transcript

Likelihood Robust Optimization for Data-Driven Newsvendor Problems Zizhuo Wang

∗

Peter W. Glynn†

Yinyu Ye‡

July 4, 2009

Abstract In this paper, we introduce a new type of robust optimization which we call the likelihood robust optimization(LRO) and apply it to the well-known newsvendor problem. In contrast to previous works which studied the robustness with respect to fixed mean, variance or support of the demand, we exploit the historical data to define the accessible distribution set, i.e., we consider the distributions that make the observed data achieve a certain level of likelihood. Then we formulate the targeting problem as one of maximizing the expected profit under the worst case distribution in that set when the number of scenarios is finite. We show that this model is convex and can be efficiently solved by up-to-date optimization method. We then present detailed studies of the statistical meaning of the likelihood constraint. We establish its relationship with a corresponding chance constraint and mean constraint and show that the likelihood set is asymptotically a Gaussian-type region. The studies are based on Bayesian statistics analysis and the empirical likelihood theory. We also extend our approach to multiitem newsvendor problem and show that the multi-item problem is still highly tractable under this model. We then show an empirical likelihood band approach for the extension to the continuous state space problem. Finally, we justify our model by comparing the performance of ∗

Department of Management Science and Engineering, Stanford University.

[email protected]. Research supported in part by Boeing. † Department of Management Science and Engineering,

Stanford

Stanford, CA 94305, Email:

University,

Stanfod,

CA

94305,

Email:[email protected]. ‡ Department of Management Science and Engineering, Stanford University, Stanford, CA 94305, Email: [email protected]. Research supported in part by Boeing.

1

our approach with the classical Scarf’s model. The numerical results show that our approach is indeed robust yet more practical.

1

Introduction

In this paper, we consider the newsvendor problem which can be described as follows. A certain newsvendor sells several types of products. However, the newsvendor has to decide how many products to stock before observing the demand. The newvendor faces an overage cost if he orders too much and an underage cost if he orders too little. The newsvendor problem is therefore to decide the size of order to be placed for each product such that the profit is maximized. The newsvendor problem is one of the most fundamental problem in inventory management. It is simple, but serves as a building block for various complicated inventory and management problems [11, 12, 15, 23]. Especially, the robust version of this problem, which studies the maximin approach, arised recently in many literatures [3, 5, 9, 17, 20, 24]. The first study of the maximin approach on stochastic newsvendor problem was done by Scarf [17], and later was extended by Gallego and Moon [9]. In [17] and [9], they derived the optimal order quantity that maximizes the worst case expected profit under the distribution that has a fixed mean and variance. They gave an explicit formula of the optimal order quantity and showed that for any given order quantity, the worst case distribution is a two point distribution. Although their results are very nice in theory, in practice, there are two defects in their model. First, the mean and the variance is usually not exactly known. More often, we are given a set of historical observations rather than the exact mean and variance. The other problem is that the demand usually is not a two-point distribution, which makes the worst case distribution discussed in their model unrealistic and overly conservative. Besides maximizing the expected profit, another line of research focused on minimizing the maximum regret. In [20], Yue et al tried to minimize the worst case absolute regret for all distributions with certain mean and variance. Then in [14], Perakis and Roels gave a thorough discussion of the regret model under various conditions. Later, Zhu et al [24] extended this theory by minimizing the relative regret in the cases that we have both mean and variance information or mean and support information. The regret model is reasonable and could be useful in some cases. However, this paper focuses on maximizing the expected profit of the newsvendor.

2

To overcome the conservativeness of the maximin model, several other robust optimization models were proposed. Ben-Tal and Nemirovski [1] introduced several uncertainty sets from which the worst case scenario is chosen. They showed that the robust counterparts are convex and tractable. Later, Bertsimas and Sim [2] introduced the concept of “price of robustness” and showed a robust counterpart which is a linear program. Both of them analyzed the statistical meaning of the uncertainty set. Furthermore, Bertsimas and Thiele [3, 5] applied this idea into newsvendor and other inventory problems. However, these approaches mainly aimed to solve general multidimensional robust optimization and became trivial when dealing with the single item newsvendor problem. Also, there are some data-driven approaches for stochastic optimizations, e.g. [3, 4, 18]. They studied how to directly inject historical data into the problem formulation. However, the robustness is not well studied for these approaches. In [21] and [22], Nilim et al and Iyengar studied a robust version of Markov Decision Process(MDP). In their papers, both of them mentioned an uncertainty set which is called the likelihood region. They studied some of the properties of this region and applied it to the MDP. In this paper, we will further investigate this kind of uncertainty set. We develop a robust optimization technique, the likelihood robust optimization (LRO) and apply it to the newsvendor problem. We show that the LRO has several advantages comparing to the previous models. First, the LRO fully exploits the historic data, and can also use the information about the mean and the variance of the underlying distribution. Second, the LRO is easily solvable even for multi-item newsvendor problem, which many previous methods can not deal with. Third, LRO has a clear statistical meaning, which allow people to adjust the robust level according to their needs. The LRO could have applications in many other different problems thus is also of independent interest for future study. The paper is organized as follows. First, we introduce how to incorporate the likelihood information into the optimization problem. In Section 2, we give our first LRO model for the singleitem newsvendor problem, that is, we maximize the expected profit while maintaining a certain level of likelihood of the data. In Section 3, we justify the LRO model using Bayesian statistics analysis and empirical likelihood theory, showing that the likelihood constraint can be viewed as a certain type of chance constraint and establishing the connections between likelihood constraint 3

and mean constraint. Then we show several asymptotic properties of the likelihood robust regions. In Section 4, we generalize the LRO to multi-item newsvendor problems. In Section 5, we generalize the LRO for the cases of a continuous state space. By using an empirical likelihood band, we show that the problem can be transformed into a linear program, which is highly tractable. In Section 6, we present numerical results, showing that our model is indeed robust and is less conservative than Scarf’s model. We conclude our results in Section 7.

2 2.1

The Likelihood Robust Optimization Introduction to the Likelihood Robust Model

We start with the single-item newsvendor problem which can be formulated as: minx Gf (x) = bE(d − x)+ + hE(x − d)+

(1)

In (1), x is the decision variable of the order quantity, d is the random demand with a probability distribution f , and b, h > 0 is the ‘underage’ and ‘overage’ cost. In this formula, s+ = max(s, 0) is the positive part of a number. The idea of this problem is to choose an order quantity x that achieves the minimum cost under certain assumptions of the demand distribution. In this sectoin, we assume that the support of all possible demands is S = {0, 1, ..., n}, i.e., the demand takes integer values between 0 and n. Then any distribution on the demand space can P be represented by a vector p = (p0 , p1 , ...pn ) with the constraint i pi = 1, pi ≥ 0. Now assume we are given some historical data of the demand drawn from its underlying distribution. We use an n + 1 dimensional vector Ψ = (N0 , N1 , ..., Nn ) to denote the times of P occurrence of each demand in the observed historical data with total observations N = i Ni . For any given distribution p, the likelihood of the data is given by n Y

pi Ni

i=0

and the maximum likelihood estimation(MLE) of the underlying distribution is p¯i =

Ni . N0 + N1 + ... + Nn 4

Now we consider a robust version of the newsvendor problem. Generally, a robust version of the newsvendor problem has the following form: minx max{p∈H,

P

P i

i pi [b(di

pi =1, pi ≥0}

− x)+ + h(x − di )+ ]

(2)

The inside problem of (2) is to find the worst case distribution for any given decision. Typically, H is a set that restricts the mean and variance of the demand distribution, i.e.,

H = {p ≥ 0| µ1 ≤

X

di pi ≤ µ2 , v1 ≤

X

d2i pi ≤ v2 }

i

i

However, this approach does not fully exploit the historical data. In our approach, instead of restricting the mean and variance, we guarantee a certain likelihood level for the data. We solve: minx maxp s.t.

P

i pi [b(di

P

− x)+ + h(x − di )+ ]

i Ni log pi

≥γ

P

(3)

i pi = 1

pi ≥ 0 We call the model in (3) the Likelihood Robust Optimization(LRO) model. Note that the first Q constraint in (3) is equivalent to i pi Ni ≥ eγ . Therefore this model minimizes the worst case cost under the set of all distributions that maintain a certain level of likelihood of the observed data. Note that the maximum likelihood of the data is given by Q Y N Ni Ni γ¯ = p¯i i = i N N i

We can then choose γ in (3) to be log γ¯ − log s in order to achieve 1/s of the maximum likelihood. This level is adjustable for different requirement of robustness and we will analyze it in detail in Section 3. Remark 1. There is another way to think about the likelihood constraint. If we subtract a constant P i Ni log Ni from both sides of the likelihood constraint and make a simple transformation, we will P i Ni get i N ˜ . The left hand side of this inequality is just the Kullback-Leibler divergence N log N pi ≤ γ from distribution p to the empirical distribution { NN1 , ..., NNn }. Therefore, the likelihood constraint is equivalent to restrict the K-L divergence between the underlying distribution and the empirical distribution. 5

Now we investigate how to solve (3). Note that the inside of (3) is a convex program. We can write down its Lagrangian dual. P P P minλ,µ µ + ( i Ni log Ni − i Ni − γ) · λ + i (Ni λ log λ − Ni λ log(µ − ci )) s.t.

λ≥0

(4)

ci = b(di − x)+ + h(x − di )+ µ ≥ ci Then combining the two ‘min’ signs together, we can rewrite (3) as: P P P minλ,µ,x µ + ( i Ni log Ni − i Ni − γ) · λ + i (Ni λ log λ − Ni λ log(yi )) s.t.

b(di − x) + yi ≤ µ, ∀i

(5)

h(x − di ) + yi ≤ µ, ∀i λ ≥ 0, y ≥ 0 Note that (5) is a convex program with linear inequality constraints. One can efficiently solve it with standard interior point method [6]. The computational efforts is almost the same as solving a linear program. Solving (5), we will get an optimal solution (x∗ , λ∗ , µ∗ , yi∗ ). To find out the worst case distribution, one can plug x∗ back into (3) and solve the inside maximization program. However, we need not to do so. We have the following simple method: Lemma 2. If (x, λ, µ) is optimal for (5), then pi =

λNi µ − ci

is the corresponding worst case distribution. Here ci = b(di − x)+ + h(x − di )+ . Proof. The KKT condition for (3) is as follows: pi P

=

λNi µ−ci

i Ni log pi

≥γ P λ · ( i Ni log pi − γ) = 0 P =1 i pi pi

≥0

Since (λ, µ) is optimal for (5), the claim holds. 6

(6)

An important advantage of this approach is that we can easily integrate the mean, variance and other information that is linear in p into this model. In that case, (3) can be geralized as follows: minx maxp s.t.

P

i pi [b(di

P

− x)+ + h(x − di )+ ]

i Ni log pi

≥γ

(7)

Ap = β pi ≥ 0 where Ap = β includes all the linear constraints on p. The constraints could be any available moment information of p. By the same method, we can transform (7) into P P P minλ,µ,x,y β T µ + ( i Ni log Ni − i Ni − γ) · λ + i (Ni λ log λ − Ni λ log(yi )) s.t.

b(di − x) + yi ≤ aTi µ, ∀i

(8)

h(x − di ) + yi ≤ aTi µ, ∀i λ ≥ 0, y ≥ 0

which is again a convex program and readily solvable. By the same method, we can also easily deal with linear inequality constraints on p. In the next subsection, we introduce a variant of the above model. Instead of making the likelihood a constraint, we put it into the objective. By doing this, we only impose an implicit constraint of the likelihood.

2.2

Likelihood Information as Barrier Function

In the previous section, we discussed the case where we restrict the distribution by adding a constraint regarding the likelihood of the historical data. In this section, we will show another approach which is similar to the one above but only enforces a ‘soft’ constraint on the likelihood. Recall that we use barrier function when solving convex programs, we now use the likelihood of a set of data as a barrier to rule out those unlikely distributions. We will show that this approach can be explained from a game theory point of view. We still use the notation introduced in the previous section. Instead of putting the likelihood information in constraints, we put it in the objective function:

7

minx maxp s.t.

P

i pi [b(di

− x)+ + h(x − di )+ ] + Ω

P Ni N

log pi

Ap = β

(9)

eT p = 1 pi ≥ 0

In this formulation, Ω is the “budget of robustness”([2],[1]). The smaller the Ω is, the more robust the model is. When Ω = 0, it goes back to the original robust newsvendor problem without taking the likelihood into account. Ap = β includes all the possible constraints regarding p, such P i Ni as the mean, standard deviation, etc. Note that by subtracting a constant Ω i N N log N from the objective function, we get a new objective: P i N pi − x)+ + h(x − di )+ ] + Ω N N log Ni P = minx (maxp∈{p|p≥0,Pi pi =1,Ap=β} i pi [b(di − x)+ + h(x − di )+ ] − DKL (h||p)) minx maxp

P

i pi [b(di

(10)

where h = ( NN1 , ..., NNn ) is the empirical probabilities and DKL (h||p) is the Kullback-Leibler divergence from p to h, which is minimized when p = h. Therefore, (10) can be interpreted as choosing x such that the cost of this order is low for a distribution chosen by an adversary. This resembles a two-person game but the adversary incurs a cost DKL (h||p) when he chooses a distribution that differs from the observed distribution. The parameter Ω is the control of how heavy the penalty is to be imposed when the adversary chooses distribution other than the observed one. Remark 3. It is worth noting that (9) can also be viewed as a risk minimization problem where a convex risk measure is employed [8]. The inside of (9) defines a continuum of convex measures parameterized by Ω, which represents the willingness of the newsvendor to accept a probability that is different from the empirical probability. Like the one in the previous section, (9) is easily solvable by taking dual of the inner maximization and combining the two ‘min’ signs together. We get the following dual problem of (9):

miny,x,λ λ − Ω s.t.

P

Ni i N

log(si )

AT y + λe − s ≥ c ci = b(di − x)+ + h(x − di )+ x ≥ 0, y ≤ 0, s ≥ 0 8

(11)

To write it into a standard form, we have

miny,x,λ λ − Ω s.t.

P

Ni i N

log(si )

(AT y + λe)i + b · x − ti1 = b · di (AT y + λe)i − h · x − ti2 = −h · di

(12)

si ≤ ti1 , si ≤ ti2 x ≥ 0, y ≤ 0, s, t ≥ 0 To get an optimal order quantity, we can solve (12) and find the optimal x.

3

Statistical Illustration of the Likelihood Robust Optimization

In this section, we are going to study the theories behind the likelihood robust optimization. Specifically, we are going to investigate the statistical meanings of the likelihood robust region and establish the relationship of the LRO with robust optimization with respect to uncertain means. Furthermore, we will study the asymptotic behavior of this approach, we show that asymptotically, the distribution set converges to the true distribution and the rate of convergence is O(n−1/2 ). These studies will give us important insights for understanding the LRO and applying it to various practical problems.

3.1

Bayesian statistics illustration

Consider a random variable taking value in S = {0, 1, ...n}. Assume we observe historical data Ψ = {N0 , N1 , ..., Nn } in which Nk represent the number of occurrence of the event that the random variable takes value k. By simple calculation, the maximum likelihood estimation of the distribution p = {p0 , p1 , .., pn } is given by p¯i = where N =

P

i Ni

Ni N

is the total number of observations and we define L∗ =

Q

Ni Ni i( N )

to be the

maximum likelihood value. Now we want to examine the class of distribution p such that the likelihood of Ψ using p exceeds a certain threshold. We use concepts drawn from Bayesian statistics [10]. That is, instead 9

of thinking the probability is given and the data is random, we define a random vector X = (p0 , p1 , ..., pn ) taking values on the n-dimensional simplex with probability P ∗ . And the probability density of X at point p = {p0 , p1 , ..., pn } is proportional to the likelihood n Y

pi Ni

i=0

In statistics, this distribution is called the Dirichlet distribution. A Dirichlet distribution with parameter Ψ(denoted by Dir(Ψ)) has the density function as follows: n

f (p; Ψ) =

1 Y Ni −1 pi B(Ψ) i=0

The normalizing factor is Z

n Y

B(Ψ) =

p∈∆,pi ≥0 i=0

pi

Ni −1

Qn Γ(Ni ) Pn dp0 dp1 ...dpn = i=0 Γ( i=0 Ni )

where ∆ = {p|p0 + p1 + ... + pn = 1, p ≥ 0} is the n-dimensional simplex. The Dirichlet function is used to estimate the hidden parameters of a discrete probability distribution given a collection of K samples. Intuitively, if the prior is represented as Dir(α), then Dir(α + β) is the posterior following a sequence of observations with histogram β. For a detailed discussion on the Dirichlet function, we refer the reader to [10]. In the LRO, we assume that every demand is equally possible before we observe historical data, i.e. the prior is Dir(e), where e is the unit vector. After observing the historical demands Ψ, the posterior follows Dir(Ψ + e). Note that this process can be adaptive as we observe more and more data. Now we turn to examine the likelihood constraint. We proceed by considering the probability Q P ∗ ( ni=0 pi Ni > γL∗ ). By definition, this value can be computed by the following integral: Z n Y 1 pi Ni dp0 dp1 ...dpn B(Ψ + e) {p∈∆,Qni=0 pi Ni ≥γL∗ } i=0

The denominator has the closed form Z

n Y

{p∈∆} i=0

N P0 !N1 !...Nn ! ( Ni +n−1!)

pi

Ni

and what we need to do is to estimate

n Y · I( pi Ni ≥ γL∗ )dp0 dp1 ...dpn

(13)

i=0

Q We want to choose a minimum threshold γ such that P ∗ ( i pi Ni ≥ γL∗ ) ≥ 1 − α, where α is the robust level determined beforehand. Note that two extreme cases are α = 1 and α = 0, which 10

correspond to the case where γ = 1 and γ = 0 respectively. For example, if we choose α = 0.05, then it means that when using the corresponding γ, the probability that the likelihood achieves γL∗ is greater than 95%. Unfortunately, it is hard to explicitly compute this integral. However, we have the following estimate: Lemma 4. Assume γ <

1 L∗

· α(n + 1)! · B(Ψ + e). Then Y P ∗ ( pi Ni ≥ γL∗ ) ≥ 1 − α i

Proof. Q P ∗ ( i pi Ni ≥ γL∗ ) = = ≥

1 B(Ψ+e)

R

Q Ni ≥γL∗ } {p∈∆, n i=0 pi

1 B(Ψ+e) (B(Ψ + e) γL∗ 1 − (n+1)!B(Ψ+e)

−

R

Qn

i=0 pi

Ni dp dp ...dp 0 1 n

Q Ni ≤γL∗ } {p∈∆, n i=0 pi

Qn

i=0 pi

Ni dp

0 dp1 ...dpn )

(14)

≥1−α

Let γ˜ = − log γ, we have X P ∗( Ni · log pi ≥ log L∗ − γ˜ ) ≥ 1 − α

(15)

i

However, the above estimation is not tight. In practice, we can use Monte Carlo method to estimate the integral and decide the corresponding robust level. We illustrate the above discussion in the following simple example: Example 1: We consider the demand data as shown in Figure 1. The maximum likelihood L∗ is 3.06×10−34 . The bound computed by Lemma 4 is 1.20×10−41 , i.e., γ˜ ≈ 7.4. If we use Monte Carlo sampling with 100,000 samples, we get the approximate bound is 5.00 × 10−39 , i.e. γ˜ ≈ 4.8.

3.2

Empirical likelihood illustration

Another way to explain the likelihood constraint is to use the idea of empirical likelihood which is discussed in [13]. In traditional parametric inference, one usually looks at a family of distributions 11

Histogram of Demands 5

4.5

4

Occurance(Times)

3.5

3

2.5

2

1.5

1

0.5

0

0

2

4

6

8

10 12 Demand(Units)

14

16

18

20

Figure 1: The histogram of demand

parameterized by η. Given the set of observed data X1 , X2 , ..., XN , we compute the maximum likelihood estimation ηˆ with likelihood L(ˆ η ). Then we reject the hypothesis H0 : η0 = η if the ratio between L(η) and L(ˆ η ) is too small, and exclude it from the confidence interval of η0 . In contrast to parametric maximum likelihood inferences, the empirical likelihood is used to give statistical inferences when people don’t know which parametric distribution function the underlying one follows. Therefore, given observed data X1 , X2 , ..., XN and a statistical hypothesis H0 : X ∼ F one can define the likelihood ratio by the nonparametric form: R(F ) =

L(F ) L(FN )

(16)

where FN is the empirical distribution given by FN (t) =

N 1 X I(Xi ≤ t) N i=1

Note that the empirical distribution is also the maximum likelihood estimation, therefore R(F ) is less than 1 for any F . In the same way, we reject H0 if this ratio is too small and exclude F from the confidence interval of the true distribution. Note that the likelihood constraint in (3) exactly means that the estimate likelihood should not be too small compared to the empirical likelihood, which coincides with the idea shown above. 12

3.3

The relation between LRO and the mean robust optimization

Now we continue the above discussion and show an important relationship between the LRO and the mean robust optimization. We will still use the empirical likelihood idea. Generally speaking, if we are interested in a parameter θ = T (F ) for some function T of the distributions, where T can be the mean or the variance, etc. Then we can define the profile likelihood ratio function [13] R(θ) = sup{R(F )|T (F ) = θ}

(17)

where R(F ) is defined by (16). Then the empirical likelihood test rejects hypothesis H0 : T (F ) = θ0 , if R(θ0 ) < r0 , where r0 is a predetermined threshold. And the empirical likelihood confidence region is {θ|R(θ) ≥ r0 } Specifically, we consider the case where T (F ) is the expected value operator, i.e., T (F ) = E[F ]. In this case, the profile likelihood ratio function is: R(θ) = sup{R(F )|E[F ] = θ}

(18)

We have the following univariate Empirical Likelihood Theorem(ELT) (See Theorem 2.2 in [13]) Theorem 5. (Univariate ELT) Let X1 , ...XN be independent random variables with common distribution F0 . Let µ0 = E[X1 ], and suppose that 0 < V ar(X1 ) < ∞. Then −2 log(R(µ0 )) converges in distribution to χ2(1) as N → ∞. The relationship between the likelihood ratio and the mean is established by this theorem. In terms of probability, the theorem says that P r(2 log R(µ0 ) ≤ −χ2,1−α )→α (1)

(19)

To compute R(µ) for each µ, we solve the following optimization problem: z = max s.t.

P

i Ni log pi

P

i pi

P

di pi = µ

p≥0 13

=1

(20)

and let R(µ) = ez /L∗ . Then for each threshold α, we can find the the corresponding 1 − α credible interval for the mean by implementing bisection on µ. By this way, we can interpret the likelihood robust constraint as a mean robust constraint. In fact, from the above discussion, we can see that the mean robust optimization is a relaxation of the corresponding LRO. This gives us important insight in applying the likelihood robust optimization. To conclude this section, we revisited the Example 1 and show how the likelihood robust constraint relates to the mean. Example 1 Revisited: Now we examine Example 1 again. We use the method illustrated above to compute the empirical likelihood confidence region of the mean. We take α = 0.05. Then χ2,1−α = 3.84. Therefore we need to choose the critical value µ0 such that (1) log R(µ0 ) = e−1.92 = 0.1466 By solving (20) iteratively, we get 95% confidence region for µ is [8.7, 12.2], while the sample mean is 10.5.

3.4

Asymptotic Behaviors

In this subsection, we will investigate the asymptotic behavior of the likelihood robust region. We assume that the true distribution of the data is p¯ = {¯ p0 , p¯2 , ..., p¯n }. And the number of each P scenarios observed are N0 , N2 , ..., Nn ( i Ni = N ). Define the likelihood robust distribution set Ni Ni Q Q by {p| ni=0 piN ≥ γ} and we choose suitable γ such that P ( ni=0 piN ≥ γ) ≥ 1 − α. Define Ni Q γN = sup{γ|P ( ni=0 piN ≥ γ) ≥ 1 − α}. We will first show the asymptotic behavior of the likelihood level γN and then the asymptotic behavior of the likelihood robust distribution set. We start by showing what follows: 1. γN converges to

Qn

¯pi¯i i=0 p

almost surely;

√ Q 2. γN = Ω( ni=0 p¯pi¯i · e−C/ N )

Formally, we prove

14

Proposition 6. With probability 1, the credible region converges to the single point p¯, i.e., for any ² > 0, i.e., P{

X 1 X Ni log pi ≥ p¯i log p¯i − ²} → 1, a.s. N i

i

Proposition 7. The convergence rate is of order

√ N , namely, for any α > 0, there exists C(C <

0),such that X √ 1 X p¯i log p¯i ) ≥ C} ≥ 1 − α, a.s. Ni log pi − limN →∞ P { N ( N i

i

The notation here should be carefully treated. In the above two propositions, the P ’s are also variables, they change with the observed data. They are the probability over the d-dimensional simplex defined by empirical likelihood. And the almost surely convergence is with respect to the probability space of observed data, it is defined by the underlying probability.

Now we prove proposition 6. Proof. It suffices to prove for any α < 1: RQ

Ni Ni Q N p p I( ≥ αLN )dp0 dp1 ...dpn i i = 1 a.s. R Qi iNi i pi dp0 dp1 ...dpn

lim

N →∞

where Ln =

Q Ni NNi N

. Then since Ln → L =

Q

¯pi¯i , ip

(21)

the assertion will follow.

To show (21), it suffices to show that for any α < 1, ² > 0, RQ lim

N →∞

Ni Ni Q N p I( p ≥ αLN )dp0 dp1 ...dpn i i ≥ 1 − ² a.s. R Qi iNi i pi dp0 dp1 ...dpn

(22)

We divide (αLN )N in both the numerator and denominator of (3) and use AN to denote the numerator after this transformation. Then the left hand side of (22) can be written as

AN +

R

Q

(

Ni /N i pi αLN

AN Q Ni )N I( i piN < αLN )dp0 dp1 ...dpn

R Q Ni Note that AN is greater than I( i piN ≥ αLN )dp0 dp1 ...dpn which converges to the constant R Q A¯ = I( i pi p¯i ≥ αL∗ )dp0 dp1 ...dpn with probability 1. Therefore almost surely there exists M1 15

large enough such that AN > A¯ for N > M1 . Using the same argument, with probability 1, we can choose δ sufficiently small, such that there exists M2 , such that for N > M2 Z Y Ni /N Y Ni p ²A¯ ²AN ( i )N I((1 − δ)αLN < piN < αLN )dp0 dp1 ...dpn < < αLN 4 4 i

i

And we can also take M3 such that Z Y δi /N Y δi pi (1 − δ)N ²A¯ ²AN ( )N I( piN < (1 − δ)αLN )dp0 dp1 ...dpn < < < αLN (n + 1)! 4 4 i

i

for N > M3 . Therefore we showed that with probability 1, (22) holds, and the proposition follows.

Next we prove proposition 7. We prove the following two lemmas. Lemma 8. For any α > 0, there exists C < 0, such that √ 1 X Ni X limN →∞ P { N ( − p¯i log p¯i ) ≥ C} ≥ 1 − α, a.s. Ni log N N i

i

Proof. Define f (x) = x log x. Then we know that X √ X Ni √ Ni pi )) = f 0 (ξi ) N ( − p¯i ) N (f ( ) − f (¯ N N i

i

By the Central Limit Theorem we know that √ Ni N( − p¯i ) ⇒ N (0, p¯i (1 − p¯i )) N And we also know from the law of large numbers and proposition 1 that ξi → p¯i . Therefore, there exists C < 0 such that the claim holds. Lemma 9. For any α > 0, C < 0 √ 1 X Ni 1 X limN →∞ P { N ( Ni log pi − Ni log ) ≥ C} = 1, a.s. N N N i

i

16

Proof. We have

N √i

Q √ p N 1 X 1 X Ni N( Ni log pi − Ni log ) = log i √i N N N L N i

i

Then it suffices to show that

Q

N

N √i N

pi i√ LN N

log

→ 0 a.s.

To show this, we use similar techniques as were used to prove proposition 1. It suffices to show that α < 1, ² > 0, N √ √i Ni Q N N p ≥ αL I( p i i i i N )dp0 dp1 ...dpn ≥ 1 − ² a.s. RQ Ni i pi dp0 dp1 ...dpn

RQ lim

N →∞

(23)

We use the same technique. Define N

AN

√i Z Y N √ pi N √N Y √Ni = ( √ ) I( pi ≥ αLN N )dp0 dp1 ...dpn N i i αLN

Then the left hand of (23) can be written as AN AN +

RQ

i(

N √i pi N √ αLNN

√

)

N I(

Q

N √i N

i pi

√

< αLN N )dp0 dp1 ...dpn

Then we can still divide the other term into N

√i Z Y √ √ √ Y √Ni Y √Ni pi N √ N pi N < αLN N ) + I((1 − δ)αLN N > pi N ))dp0 dp1 ...dpn ( √ ) (I((1 − δ)αLN N < N i i i αLN

Each term goes to zero as N goes to infinity and AN will converge to a positive constant as N goes to infinity. Therefore (23) holds and the claim follows. By combing Lemma 8 and Lemma 9, Proposition 7 follows. We further have the following proposition regarding the asymptotic behavior of the likelihood robust region. Proposition 10. For any p ∈ {p|

Q

Ni N

i pi

≥ γ},

√ N (p1 − p¯1 , ..., pd − p¯d ) ⇒ N (0, Σ) 17

where Σ is the covariance matrix of a random variable taking value on {1, 2, ..., d} with probability {¯ p1 , ..., p¯d }, namely Σii = p¯i (1 − p¯i ) ∀i, Σij = −¯ pi p¯j ∀i 6= j

(24)

Proof. In Lemma 9, we showed that in the likelihood confidence set, √ Ni 1 X (Ni log pi − Ni log N( )) → 0 a.s. N N i

which further implies

√ Ni N (pi − ) → 0 a.s. , ∀i N

On the other hand, we also have that by multivariate central limit theorem, √ N1 Nd N( − p¯1 , ..., − p¯d ) ⇒ N (0, Σ) N N where Σ is defined by (24). Therefore we have

√ N (p1 − p¯1 , ..., pd − p¯d ) ⇒ N (0, Σ)

Remark 11. From the above propositions, we showed that the behavior of the robust region is like this. Almost surely, the robust region converges to a single point set. However, this single point is asymptotically Gaussian with mean at p¯.

3.5

The Distribution on the Objective Function

In the previous sections, we minimized the objective function under the worst case distribution in P the distribution set defined by {p| di=1 Ni log pi ≥ γ}. We showed that we can choose appropriate γ such that this set contains a 95% empirical likelihood confidence region. In other words, we imposed a chance constraint on the empirical likelihood of the distribution. In fact, given a set of observed data, it directly forms a probability distribution on the function of interest. In what follows, we will consider the problem from this perspective.

18

~ = (N0 , N2 , ..., Nn )(P Ni = Again, suppose we observed N samples, the numbers of each scenarios is N i N ). This data defines an empirical likelihood probability density on p by n

~) = f (p; N

1 Y Ni pi ~) B(N i=0

We know that the objective function of the original problem can be represented as

P

i ci pi ,

where ci is a function of x (In the newsvendor example, ci = b(di − x)+ + h(x − di )+ ). Then the observed data defines a distribution F (z) on the objective value by F (z) = P (Z ≤ z) =

1 ~) B(N

R

p∈∆ I(

P

i pi ci

≤ z)

Q

Ni i pi dp0 dp1 ...dpn

(25)

where B is the normalizing factor. Based on this observation, we can directly perform optimization on the objective value. 1. Minimize Expected Value The expected value of the objective using the distribution defined by (25) is EZ = =

R Pd Qd Ni 1 i=1 pi ci i=1 pi dp0 dp1 ...dpn ~ ) p∈∆ B(N P 1 i Ni ci N +n−1

(26)

Therefore, the expectation is equivalent to simply using the empirical distribution as the true distribution. This analysis supplies another interpretation for using the empirical distribution(sampling) in stochastic optimization. 2. Minimize a Percentile Besides using the expected value, another criterion that is widely used is to minimize a percentile. For example, we can minimize the 95 percentile of this distribution:

minimizex,β β s.t.

1 ~) B(N

R

P

p∈∆ I(

i pi ci

≤ β)

Q

Ni i pi dp0 dp1 ...dpn

≥ 0.95

(27)

x∈X It is not yet clear whether (27) is a convex program. However, we can show that this percentile optimization problem has a nice asymptotic behavior. 19

Proposition 12. With probability 1, Nβ P → 1 as N → ∞ i Ni ci Remark: This proposition simply says that asymptotically, the percentile optimization problem is equivalent to the expectation optimization problem. This is because the probability distribution defined in (25) has very dense mass around the empirical probability. Proof. It suffices to show that for any ² > 0, Z X Ni Y N X Ni 1 ci ≤ pi ci ≤ (1 + ²) ci ) pi i dp0 dp1 ...dpn → 1 a.s. I((1 − ²) ~ N N B(N ) p∈∆ i

i

i

To show this, we consider Z X X Ni Y N ci ) pi i dp0 dp1 ...dpn I( pi ci ≥ (1 + ²) N p∈∆ i

If

P

i pi ci

≥ (1 + ²)

P

Ni i N ci ,

i

i

then since ci is always bounded(we assume the demand distribution is

i bounded), there must exist i such that pi ≥ (1 + ²) N N . Therefore, with probability 1, there exists

η, such that N

X X Ni Y Ni Y Ni Ni N {p|( pi ci ≥ (1 + ²) ci )} ⊂ {p| pi < η ( ) } forN sufficiently large N N i

since

Q

i

Ni

pi N has unique maximizer at pi =

i

Ni N

and

Ni N

→ p¯i , a.s. .

However, as we showed in the previous report, for any η < 1, Z Y Y Ni 1 i lim I( piN ≤ ηLN )dp0 dp1 ...dpn = 0 a.s. pN i ~) N →∞ B(N i i Therefore 1 ~) B(N

Z

X X Ni Y N I( pi ci ≥ (1 + ²) ci ) pi i dp0 dp1 ...dpn → 0 a.s. N p∈∆ i

i

i

And we can use the same method to prove that Z X X Ni Y N 1 I( pi ci ≤ (1 − ²) ci ) pi i dp0 dp1 ...dpn → 0 a.s. ~ ) p∈∆ N B(N i

i

Therefore, the proposition holds. 20

i

4

Likelihood Robust Optimization for Multi-item Newsvendor Problems

In this section, we generalize the LRO to multi-item newsvendor problem. The multi-item newsvendor problem was not treated as well as the single item problem due to the intractability of high dimensional distributions and the ‘curse of dimensionality’. In the multi-item newsvendor problem, the decision variable is a K-dimensional vector x = (x1 , x2 , ...xK ). Each component of x corresponds to the order quantity of each item and is usually subject to some linear budget constraints. Without loss of generality, we assume that the demand for each item varies from 1 to n, i.e, the demand vector d lies in the set Ξ = {1, ..., n}K . Thus, there are totally nK possible values in Ξ. If we assign each of them a probability, the decision space will become very large. However, we will see in the following that, if we apply the LRO with polynomial data set, the problem will still be polynomial in size. In the sequel, we will first extend the original LRO model to this case and show that in this case, the problem is still highly tractable. Then we show another approach which uses the marginal distribution of each item as intermediate variables and implement robust optimization with respect to the marginal probabilities. We show that both approaches finally lead to the same formula. This LRO approach only uses the historical data and thus maintain the potential relationship(covariance) between different items. In the following context, we use σ to denote a demand profile and N to denote the number of historical data we observed. Recall the model (3) we built for the single item case, a direct extension of (3) will be minx maxp s.t.

P σ

P

pσ ·

σ δσ

P

σ

PK

i=1 [b(dσ(i)

− xi )+ + h(xi − dσ(i) )+ ]

log pσ ≥ γ (28)

pσ = 1

pσ ≥ 0 Ax ≤ β In (28), pσ is the probability assigned to the demand profile σ, σ(i) is the demand of the ith item in the profile σ, xi is the decision variable for the ith item. Note that the first constraint of (28) is exactly the extension of the likelihood constraint we introduced before. The last constraint stands for all the potential constraints on x, such as the limit of the total budget, or certain quantity

21

requirements. Now we take the dual of the inside program and get: minλ,µ µ + ( s.t.

P

σ δσ

log δσ −

P

σ δσ

− γ) · λ +

P

σ (δσ λ log λ

− δσ λ log(µ − cσ ))

λ≥0 P + + cσ = K i=1 (b(dσ(i) − xi ) + h(xi − dσ(i) ) )

(29)

µ ≥ cσ Writing out cσ explicitly and combining the two ‘min’ signs together, we can rewrite (28) as follows: P P P minλ,µ,x,y,c µ + ( σ δσ log δσ − σ δσ − γ) · λ + σ (δσ λ log λ − δσ λ log(yσ )) P s.t. µ ≥ yσ + K i=1 cσi , ∀σ cσi ≥ b(dσ(i) − xi ), ∀σ, i = 1, 2, ..., K

(30)

cσi ≥ h(xi − dσ(i) ), ∀σ, i = 1, 2, ..., K λ ≥ 0, y ≥ 0 Ax ≤ β It appears that there are exponentially many variables in this model since the number of σ’s is exponential. An important observation is that in the objective function only those σ’s with δσ > 0 will take effect. We define this set to be Ψ. For σ’s which is not in Ψ, The corresponding terms will automatically be 0. Also, for the constraints µ ≥ cσ , ∀σ though it appears to have exponentially many, for each x, the constraint is equivalent to µ≥

PK

j=1 cj

cj ≥ b(n − xj ), j = 1, 2, ..., K cj ≥ h(xj − 1), j = 1, 2, ..., K

22

(31)

Therefore, (30) is equivalent to P P P minλ,µ,x,y,c µ + ( σ∈Ψ δσ log δσ − σ∈Ψ δσ − γ) · λ + σ∈Ψ (δσ λ log λ − δσ λ log(yσ )) P s.t. µ ≥ yσ + K i=1 cσi , ∀σ ∈ Ψ cσi ≥ b(dσ(i) − xi ), ∀σ ∈ Ψ, i = 1, 2, ..., K cσi ≥ h(xi − dσ(i) ), ∀σ ∈ Ψ, i = 1, 2, ..., K P µ≥ K i=1 cj

(32)

cj ≥ b(n − xj ) j = 1, 2, ..., K cj ≥ h(xj − 1) j = 1, 2, ..., K λ ≥ 0, y ≥ 0 Ax ≤ β Note that |Ψ| ≤ N , thus (30) is a convex program with at most N K +N +2K +2 variables. The program is easily solvable for medium sized cases and share the same properties as the single-item case shown in Section 2. Next we show another approach which leads to the same model as above. In the following, we use the marginal probabilities as intermediate variables and considers the robustness of the model with respect to the marginal probabilities. We use pij to denote the marginal probability of the demand of each item, i.e., pij = P {The demand of item i is j} Note that pij has the following relationship with the pσ defined above: X pij = pσ ∀i = 1, 2, ..., K, j = 1, 2, ...n σ∈Ξ,σ(i)=j

The key observation is that for multi-item newsvendor problem, the objective function is linear in the marginal probabilities pij . Therefore, we can use the marginal probabilities as intermediate variables to connect the historical data and the underlying distribution. By the idea of LRO, we can minimize the expected cost while maintaining a certain level of the ‘likelihood’ of the marginal probabilities. And this ‘likelihood’ can be defined using the same tool introduced in the third part of Section 3. In this case, the parameters in (17) are the marginal probabilities, and the profile likelihood ratio is defined by R({pij }) = sup{R(F )|The marginal probability of F is {pij }} 23

(33)

And

Q R(F ) =

σ

pδσσ

L∗

(34)

where L∗ is the maximum likelihood which is a constant. Then we define the following likelihood robust optimization for the multi-item newsvendor problem: minx maxp s.t.

PK Pn

j=1 pij

i=1

· [b(j − xi )+ + h(xi − j)+ ]

R({pij }) ≥ γ˜ P j pij = 1, ∀i

(35)

p≥0 Ax ≤ β Substitute the first constraint by (34), we have the following equivalent form of (35): minx maxp s.t.

PK Pn

· [b(j − xi )+ + h(xi − j)+ ]

i=1

j=1 pij

σ δσ

log pσ ≥ γ

P P

σ∈Ξ,σ(i)=j

P

pσ ≤ pij

(36)

j pij = 1, ∀i

p≥0 Ax ≤ β where γ = log γ˜ − log L∗ . Note that in (36), the top two constraints combined together is automatically equivalent to R({pij }) ≥ γ˜ . By the same observation as in (30), we only need to consider the variables which corresponds to the occurred scenarios. We can simplify (36) to the following: minx maxp s.t.

PK Pn i=1

P

j=1 pij

σ,σ∈Ψ δσ

P

log pσ ≥ γ

σ∈Ψ,σ(i)=j

P

· [b(j − xi )+ + h(xi − j)+ ]

pσ ≤ pij

(37)

j pij = 1, ∀i

p≥0 Ax ≤ β Now we can see that (37) is essentially the same with (30). This means that we have two explanations for the LRO approach in the multi-item cases. One is directly the extension of LRO 24

from the single item case, the other passes through the marginal probabilities as the intermediate. The above model also shares the same advantage with the model for the single item case, that is, we can easily integrate the mean, variance or covariance information into this model. In fact, we can add any linear constraints that restrict only polynomially many pσ while maintaining the tractability of the problem.

5

Likelihood Robust Optimization in Continuous State Space

In the above discussions, we assume that the probability space is discrete. In this section, we extend our previous discussions into continuous state space. It is tempting to directly extend the previous likelihood region into continuous case. But in that case, the likelihood constraint will be defined on finitely many points of the probability density functions. Therefore, it will not make any restriction to the distribution at all. This suggests us not to use the density function to define the likelihood region. In what follows, we suggest an empirical likelihood bands approach for the continuous case. In this approach, we define the likelihood region by restricting certain points of cumulative distribution function, namely finding a band for the cumulative distribution function. Then by applying the duality theorem for a semi-infinite programming, we get a linear program for the robust program. First, we will introduce the empirical likelihood bands. Define the profile likelihood by[13]: d d+1 d+1 Y X X R(p, q) = max{ nwi | wi Zi (p, q) = 0, wi ≥ 0, wi = 1} i=1

i=0

i=0

with Zi (p, q) = 1{Xi ≤q} − p, X0 = −∞, Xd+1 = ∞. Intuitively, R(p, q) means the maximum likelihood ratio of having F (q) = p. Then by [13], the band (L(x), U (x)) with L(x) = min{p| − log R(p, x) ≤ c1−α n } U (x) = max{p| − log R(p, x) ≤ c1−α n }

(38)

is a 100(1 − α)% empirical confidence band for F (x). Using this method, we can construct a set of constraints on the cumulative distribution F (xi ). The constraints are 0 ≤ F (xi ) ≤ F (xi+1 ) ≤ 1 for i = 0, 1, ..., d L(xi ) ≤ F (xi ) ≤ H(xi ) 25

for i = 0, 1, ..., d + 1

(39)

where L(xi ) and U (xi ) are computable[13]. For simplicity, we denote L(xi ) and U (xi ) by Li and Ui respectively in the follows. Now the robust program becomes:

minx∈X

max EF (h(x, ξ)) s.t.

(40)

Li ≤ F (xi ) ≤ Ui

∀i

We can actually write this as a semi-infinite programming. Assume f (·) is the density function. Then (40) can be rewritten as: minx∈X

max s.t.

R∞

−∞ h(x, ξ)f (ξ)dξ R xi Li ≤ −∞ f (ξ)dξ ≤ R∞ −∞ f (ξ)dξ = 1

Ui

(41)

∀i

By using the duality theorem, we can write the dual of the inside program as: min s.t.

P − i zi Li + λ P P h(x, ξ) − ki=1 yi + ki=1 zi − λ ≤ 0 ∀ξ ∈ [xk−1 , xk ], P

i yi Ui

k = 1, 2, ..., d + 1

(42)

yi ≥ 0, zi ≥ 0, λ ≥ 0 However, the first constraint in (42) simply means that supξ∈[xk−i ,xk ] h(x, ξ) ≤

Pk

i=1 yi

−

Pk

i=1 zi

+λ

(43)

For the newsvendor problem, h(x, ξ) is simply b(x − ξ)+ + h(ξ − x)+ , hence (43) is equivalent to P P b(x − xk−i ) ≤ ki=1 yi − ki=1 zi + λ P P h(xk − x) ≤ ki=1 yi − ki=1 zi + λ

(44)

Therefore, in this case, the likelihood robust problem can be written as: minx,y,z,λ s.t.

P

i yi Ui

−

P

i zi Li

+λ

Pk

P b(x − xk−i ) ≤ i=1 yi − ki=1 zi + λ ∀k P Pk h(xk − x) ≤ ki=1 yi − i=1 zi + λ ∀k yi ≥ 0, zi ≥ 0, λ ≥ 0 x∈X

This is a linear program which is very easy to solve. 26

(45)

Historical Demands Data 20

18

16

Occurence(times)

14

12

10

8

6

4

2

0 500

600

700

800 900 Demand(units)

1000

1100

1200

Figure 2: The histogram of demand

6

Numerical Results

In this section, we will provide numerical results to illustrate our LRO model. We test the performance of our approach on the benchmark problem introduced by Silver and Peterson [16]. This problem is cited and compared in many literatures [9, 20, 24]. In this problem, the unit production cost is $35.10, the unit selling price is $50.30, and the unit salvage value is $25.00. Note that after proper transformation, the problem is equivalent to problem (1) with underage and overage costs b =$15.20 and h =$10.10, respectively. In the original problem, the mean and the standard deviation are assumed to be 900 and 122, respectively. We generate 100 historical datas from N (900, 122) for our LRO approach. The data is shown in Figure 2. We assume that the demand range is between 600 and 1200, which is about 99% of the normal distribution confidence region. Note that our LRO approach can either assume the mean and the standard deviation are known or not. We will consider both cases seperately in the following. First, based on these data, we would like to find out the corresponding likelihood level such that the probability that the data achieves the likelihood level exceeds 1 − α. We choose α = 0.05. By using the Monte Carlo method mentioned in Section 3, we find out that γ˜ in (15) is approximately 205.

27

1550 Scarf LRO

Worst Case Expected Costs

1540

1530

1520

1510

1500

1490 900

910

920 930 Order Quantity

940

950

Figure 3: Comparison between LRO with gamma ˜ = 205 and Scarf’s model with fixed mean and variance

Table 1: Comparison between LRO and Scarf’s model when then mean and variance is known quantity

worst case cost

G· (qs∗ )

∗ qLRO,205

924

1495.16

1495.22

∗ qLRO,100

922

1467.09

1467.51

∗ qLRO,50

921

1427.20

1427.91

∗ qLRO,10

920

1333.10

1334.13

∗ qLRO,0

928

1209.37

1210.19

qs∗

925

1511.62

1511.62

∗ qN

931

1192

1193.34

Note that this bound is rather loose from intuition. It says that we admit distribution which makes the data to achieve at least e−205 of the maximum likelihood. However, we will see in the sequel that even for such a loose constraint, it can reduce the worst case cost some bit. We will also show how the solutions change with the likelihood level we choose. First we assume that we know the exact mean and the variance. In this case, the only difference between LRO and Scarf’s model is that LRO requires a likelihood constraint. Therefore, the LRO model is less conservative. The result comparing Scarf’s model and LRO is shown in Figure 3. Then we study the effect of using different likelihood levels, we test different γ˜’s from 0 to 205. The optimal solution with the objective value under the worst case distribution is shown in Table 1.

28

∗ ∗ In Table 1, qLRO,β means the optimal solution using LRO with likelihood level γ˜ = β, qs∗ and qN

means the optimal solution under Scarf’s law and the solution when the underlying distribution is normal. The function G· (qs∗ ) is the expected cost if we use Scarf’s decision under the corresponding model. From Table 1, we see that in this case, the optimal solution to the LRO under given mean and variance is not far away from the optimal solution using Scarf’s law. However, the worst case cost decreases as γ˜ decreases and it is always smaller than that by using Scarf’s law. This is because the smaller γ˜ is, the smaller the accessible distribution set is, and when γ˜ = 0, it becomes the pure data driven approach, or in another view, the sample average approach. In that case, we assume that the true distribution is exactly the empirical distribution and choose the best decision variable based on that. Therefore, the LRO is a relaxation of the pure data driven approach. On the other hand, note that the Scarf’s model is actually equivalent to the case that γ˜ = −∞, thus the LRO is a tighter form for Scarf’s model. Hence the LRO is a model lying between these two approaches. And since LRO rules out unrealistic distributions while maintaining robustness, it helps people to better estimate the worst case profit. Now we look at the worst case distribution for each case with ordering quantity 925. The probability mass function is plotted in Figure 4. Among these figures, the first one is for Scarf’s model and the last one is the distribution of the empirical data. We know that the smaller the γ˜ is, the less freedom the distribution has. Therefore, as γ˜ goes larger, the worst case distribution becomes more like two point distribution, while as γ˜ goes smaller, the worst case distribution becomes closer to the empirical distribution of the observed data. From this figures, one can also see that it is reasonable to use a relative smaller γ˜ than 205. Next we consider the case where we do not assume the knowledge of the mean and the variance. We still choose the set of γ˜ we used above. The results are shown in Table 2. In Table 2, we see that if we do not assume the range of µ and γ˜ , the model is very conservative. This is because in the worst case, the mean and the standard deviation could be far away from the observed ones. This is applicable when one really wants to protect his model from potential losses. However, if one wants to avoid this overly conservativeness, one can either increase the level of likelihood by decreasing γ˜ , or to specify the mean and variance as we did in the above. We can choose the proper way according to the robust level we require. 2. Problems with non-normal data 29

0.5 0 600

800 1000 Demand gamma=50

0.05 800 1000 Demand gamma=2

0.2 0 600

0.1

0 600

0.4

1200

Probability

Probability

gamma=205 Probability

Probability

Scarf 1

1200

800 1000 Demand gamma=0

1200

800 1000 Demand

1200

0.1 0.05 0 600

1200

Probability

0.05

Probability

0.05

800 1000 Demand gamma=10

0 600

800 1000 Demand

0 600

1200

Figure 4: Worst Case Distribution for Different Likelihood Levels

Table 2: Comparison between LRO and Scarf’s model when the mean and variance is not known quantity

worst case cost

worst case profit

G· (qs∗ )

∗ qLRO,205

956

3310.61

10369

3708.39

∗ qLRO,100

957

2846.38

10834

3121.04

∗ qLRO,50

956

2353.05

11327

2506.57

∗ qLRO,10

946

1664.27

12016

1689.63

∗ qLRO,2

936

1399.71

12280

1406.37

∗ qLRO,0

928

1209.37

12471

1210.19

qs∗

925

1511.62

12168

1511.62

∗ qN

931

1192

12488

1193.34

30

Historical Demand Data 4

3.5

3

Occurance(time)

2.5

2

1.5

1

0.5

0

4

10

16

22

28

34 40 Demand(unit)

46

52

58

Figure 5: The histogram of demand

In the above example, the likelihood robust solution is very close to the solution given by Scarf’s model when the mean and variance is given. This is mainly because we assume the underlying distribution is close to normal. The following example show that if the underlying distribution is not normal, the two approaches can indeed make a difference. In the following example, we assume that the unit production cost is $6, the unit selling price is $5, and the unit salvage value is $4. Note that after proper transformation, the problem is equivalent to problem (1) with underage and overage costs both equal to $1. We assume the underlying distribution is close to an exponential distribution with parameter λ = 20 and we only have 10 historical observations, the data histogram is shown in Figure 5. By Scarf’s law, the optimal ordering quantity is equal to the mean of the data if b = h (no matter how the historical data distributes). In this case, Scarf’s optimal ordering quantity is 18.7. However, if we compute the 95% confidence LRO solution, we find that the optimal ordering quantity is 11 as shown in Figure 6. This solution is much closer to the solution if we use the exponential distribution as the underlying distribution of demand. In this problem, the LRO shows its advantage by optimizing robustly with respect to the historical data.

31

18.85

18.8

Worst Case Cost

18.75

18.7

18.65

18.6

18.55

18.5

6

8

10

12 14 Order Quantity

16

18

20

Figure 6: Cost Vs Ordering

7

Conclusion

In this paper, we introduced a new robust optimization approach, the likelihood robust optimization and applied it in the newsvendor problem. This approach exploits the historical data and it is robust with respect to the distributions that potentially support the observed data. We showed that this approach is tractable and gave a statistical justification of the likelihood constraint based on Bayesian statistic and empirical likelihood theory. We also showed that this approach can be extended to multi-item newsvendor problem. We compared the performance of our approach with other approaches and verified the performance of our approach. Besides the application in newsvendor itself, the methodology itself could be of interest. Future researches would be on the deeper theories of this approach as well as its potential applications. In fact, we are currently trying to apply this method to portfolio selection problem and have already got some results. Also, the relationship between this model and other robust approaches will be as well appealing.

8

Acknowledgement

The author thanks Erick Delage, Dongdong Ge, and Zhisu Zhu for valuable insights and discussions. 32

References [1] A. Ben-Tal, A. Nemirovski. Robust solutions of uncertain linear programs. Operations Research Letters, 25(1):1C13, 1999. [2] D. Bertsimas and M. Sim. The Price of Robustness. Operations Research, 52:35-53, 2004. [3] D. Bertsimas and A.Thiele. Robust and Data-Driven Optimization: Modern Decision-Making Under Uncertainty. Technical Report, Massachusetts Instituion of Technology, 2006. [4] D. Bertsimas and A. Thiele. A Data-Driven Approach to Newsvendor Problems. Technical Report, Massachusetts Instituion of Technology, 2006. [5] D. Bertsimas and A. Thiele. A robust optimization approach to inverntory theory. Operations Research, 54(1):150-168,2006. [6] S. Boyd and L. Vandenberghe. Convex Optimization. Combridge University Press, 2004. [7] E. Delage and Y. Ye. Distributionally Robust Optimization under Moment Uncertainty with Application to Data-Driven Problems. To Appear in Operations Research, 2008. [8] H. Follmer and A. Schied. Convex Measures of Risk and Trading Constraints. Finance and Stochastics, Volume 6, 2002. [9] G. Gallego and I. Moon. The distribution Free Newsboy Problem: Review and Extension. J. Oper. Res. Soc., 44:825-834, 1993. [10] A. Gelman, J. B. Carlin, H. S. Stern and D. B. Rubin. Bayesian Data Analysis. Chapman Hall Press, 1995. [11] M. Hariga and M. Ben-Daya. Some stochastic inventory models with deterministic variable lead time. European Jounal on Operations Research, 113: 42-51,1999. [12] M. Khouja. The Single Period Newsvendor Problem: Literature review and suggestions for future research. Omega: International Journal on Management Science, 27:537-553, 1999. [13] A. B. Owen. Empirical Likelihood. Monographs on Statistics and Applied Probability, Chapman Hall Press, 2001. 33

[14] G. Perakis and G. Roels. Regret in the newsvendor Model with Parital Information Operations Research, 56:188-203, 2008. [15] K. Shang and J. S. Song. Newsvendor bounds and heuristic for optimal policies in serial supply chains. Management Science, 49:618-638, 2002. [16] E. Silver and R. Peterson. Decision Systems for Inventory Management and Production Planning. 2nd ed, John Wiley, New York, 1985. [17] H. Scarf. A min-max solution of an inventory problem. Studies in The Mathematical Theory of Inventory and Production, 201-209, Stanford Press, Stanford, 1958. [18] H.Scarf. Bayes solutions of the statistical inventory problem. Annal of Mathematical Statistics, 30(2):490C508, 1959. [19] G. L. Vairaktarakis. Robust multi-item newsboy models with a budget constraint. International Journal on Production Economy, 66:213C226, 2000. [20] J. Yue, B. Chen and M. Wang. Expected Value of Distribution Information for the Newsvendor Problem. Operations Research 54:1128-1136, 2006. [21] A. Nilim and L. Ghaoui. Robust Control of Markov Decision Processes with Uncertain Transition Matrices. Opertaions Research 53:780-798, 2005. [22] G. N. Iyengar. Robust Dynamic Programming. Mathematics of Operations Research 30:257280, 2005. [23] Y. S. Zheng. On properties of stochastic inventory systems. Management Science 38:87-103, 1992. [24] Z. Zhu, J. Zhang, Y. Ye. Newsvendor Opimization with Limited Distribution Information. Technique Report, 2006.

34

Bayesian Optimization for Likelihood-Free Inference