THE LIDSKII THEOREM FOR THE DIRECTIONAL ...

Viewer
Transcript

THE LIDSKII THEOREM FOR THE DIRECTIONAL DERIVATIVES OF THE EIGENVALUES OF SYMMETRIC MATRICES

A Thesis Presented to The Faculty of Graduate Studies of The University of Guelph

by DAVID E. FIELDHOUSE

In partial fulfilment of requirements for the degree of Master of Science December, 2006

c

David E. Fieldhouse, 2006

ABSTRACT

THE LIDSKII THEOREM FOR THE DIRECTIONAL DERIVATIVES OF THE EIGENVALUES OF SYMMETRIC MATRICES

David E. Fieldhouse University of Guelph, 2006

Advisor(s): Dr. H.S. Sendov Dr. H. Bauschke

Matrix perturbation theory is a topic that continues to receive considerable attention in the mathematical literature. We present much of the classic symmetric matrix perturbation theory. Our interest lies in a majorization theory by Lidskii (1950) relating the sum of the k largest eigenvalues of matrices, which remains a predominant perturbation tool. We present three proofs of Lidskii’s theorem, each of which utilizes a different result. Wielandt’s Minimax, Courant-Fischer, and the Lebourg Mean Value theorem as presented by Lewis (1999). We give seven equivalent statements of majorization and outline classical eigenvalue relationships between symmetric matrices. Hiriart-Urruty and Ye (1995) explicitly state the directional derivative of the eigenvalue function. We offer a simple generalization of Schur’s theorem. Without the formula we show that partition majorization holds over equal eigenvalue blocks with two conditions. Finally using the formula we show the result holds without any condition.

i Acknowledgments I will reflect back on this period of my life with a deep sense of pride. Never have I been challenged the way I was in this program. It is only when one overcomes obstacles that one truly appreciate the journey. I chose my graduate degree in mathematics in order to work under Dr. Hristo S. Sendov. Dr. Sendov you were an excellent advisor with tremendous energy and passion. Additionally you have been a phenomenal mentor. Each week you taught me something new. Professionally this consisted of background material I was lacking, new theory, and how to organize my thoughts and work. I hope your lessons about writing well, working hard, and being disciplined last me forever. It was an honour to work under you. To my co-advisor Dr. Heinz Bauschke, I greatly appreciate your comments and suggestions. You were extremely thorough in your readings and your comments were valuable and insightful. You are also an excellent teacher. If it were not for my success and enjoyment in your Linear Algebra classes I would not have chosen to pursue my B.A. in Mathematics. Dr. Pal Fischer, I want to thank you for serving on my committee. I am very appreciative of your comments and your support. You are one of the nicest and most knowledgeable men I know. I am thankful to the department for funding me and to the secretaries for handling my administrative concerns. To my friends, you have all contributed greatly to my development and I have learned from each of you. Among others, I deeply appreciate the support of Becky, Bryan, Chris, Daniella, Justin, Kei, Kristen, Mike, Nathan, Nina, Ryan, and Sarah. I look forward to seeing more of you than I have recently.

ii Chantale, you are wonderful to be with and have been so supportive and understanding in this process. I am lucky to have you. To my family: Robert your best success is still to come and I look forward to enjoying it with you. Mom, you keep the household together, which is no easy task with two sons in masters programs. Dad, I have never appreciated your accomplishments as much as I have now. Your encouragement over the years has been most valuable. I could not ask for more in a father.

iii

Table of Contents

1 Introduction

1

2 Linear operators 2.1 Linear maps and matrix representations . . . . 2.2 Eigenvalues and eigenvectors of linear operators 2.3 Inner-product spaces and orthonormal sets . . . 2.4 The adjoint of a linear map . . . . . . . . . . . 2.5 Self-adjoint operators . . . . . . . . . . . . . . . 2.6 Orthogonal operators . . . . . . . . . . . . . . . 2.7 The trace of an operator . . . . . . . . . . . . . 2.8 Compressions . . . . . . . . . . . . . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

5 5 8 10 13 16 21 22 24

3 Convex and nonsmooth analysis 3.1 Convex sets . . . . . . . . . . . . . . . 3.2 Sublinear functions . . . . . . . . . . . 3.3 Lipschitz continuity of convex functions 3.4 The Clarke subdifferential . . . . . . . 3.5 The Michel-Penot directional derivative

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

27 27 33 40 44 53

4 Majorization 4.1 Basic definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Majorizations of vectors . . . . . . . . . . . . . . . . . . . . . . . . .

56 56 57

5 Theorems of Courant-Fischer, Schur, Fan & Lidskii 5.1 Inequalities of Courant-Fischer, Schur & Fan . . . . . . . . . . . . . 5.2 Lidskii’s theorem via Wielandt’s minimax principle . . . . . . . . . . 5.3 Lidskii’s theorem via the Courant-Fischer theorem . . . . . . . . . . .

69 69 77 79

6 Spectral functions 6.1 Derivatives of spectral functions . . . . . . . . . . . . . . . . . . . . . 6.2 Lidskii’s theorem via nonsmooth analysis . . . . . . . . . . . . . . . .

81 81 88

7 Lidskii theorem for spectral directional derivatives 7.1 Partition majorization . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2 Perturbations of the spectral directional derivatives . . . . . . . . . .

93 93 98

8 Conclusions

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

104

iv Bibliography

105

A Proof of Wielandt’s minimax principle 109 A.1 Supporting propositions and definitions . . . . . . . . . . . . . . . . . 109 A.2 Proof of Wielandt’s minimax principle . . . . . . . . . . . . . . . . . 114

1

Chapter 1 Introduction

Consider the vectors x = ( 12 , 21 )T and y = (1, 0)T . Notice that vector x is less “spread out” than vector y or, as we will define precisely later, vector x is majorized by vector y. The study of majorization began in the early twentieth century with Muirhead (1903) and applications of majorization were seen by Lorenz (1905) when he studied income inequalities. Schur (1923) showed that the vector of diagonal elements of a given real symmetric matrix is majorized by the vector of its eigenvalues. Hardy, Littlewood and P´olya (1929) found that majorization can be characterized by convex functions and doubly-stochastic transformations between vectors. The same three authors wrote the first systematic treatment [8] in 1934. Horn (1954) showed that majorization can be characterized by orthostochastic transformations, a subset of the doubly-stochastic transformations. The book [18] in 1979 presents a much more detailed and exhaustive work on the subject and includes six equivalent statements of majorization. More results are presented in Chapter 4. Majorization is one of the central notions in this work. It finds many applications ranging from Risk Aversion theory in Economics [19, Chapter 6] to Quantuum Computing [23]. This thesis is almost entirely self-contained but due to space limitations

2 some results are stated without proofs. Before we embark to study majorization we provide substantial background in Operator Theory and Convex & Nonsmooth Analysis. Chapter 2 begins with vector spaces and Euclidean spaces. Although most of this work deals with matrices it will be important to understand the connection between matrices and operators. Chapter 3 introduces the convex analysis notion of subdifferential of a convex function, followed by the notion of Clarke subdifferential of a Lipschitz function. This required from us to develop a deeper understanding of sublinear functions, locally Lipschitz functions, convex sets and their support functions. We give a proof of the Lebourg mean value theorem for Lipschitz functions and a result concerning the equality between the Michel-Penot, Clarke, and the regular directional derivatives. Self-adjoint transformations can be described solely by their eigenvalues and eigenvectors, thus the latter are of great importance. Calculating eigenvalues may be quite difficult, but often it is sufficient to know that the eigenvalues lie in some specific intervals. Denote by α, β, γ ∈ Rn the vectors of the eigenvalues (ordered nonincreasingly) of the n × n symmetric matrices A, B, C, respectively, where C = A + B. It has been an outstanding problem for a long time to describe how γ is related to α and β. One aspect of the importance of this relationship is simple to describe: if B has a small norm then A+B is a small perturbation of A. The twentieth century has seen great attempts at solving this problem which became known as the Horn’s conjecture(1962). He conjectured that a certain set of inequalities completely captures the relationship between α, β and γ. Particular inequalities, among those conjectured by Horn, have been proved to be necessary relationships. Fischer (1905) developed the first minimax theorem involving eigenvalues, which directly leads to a perturbation theorem. Weyl (1912) was able to provide the foundations for a theorem

3 on interlacing eigenvalues. That is if A is a n × n symmetric matrix we can construct a new symmetric (n − 1) × (n − 1) matrix B by removing the ith row and the ith column of A. Then eigenvalues interlacing, since the ith largest eigenvalue of B is between the ith and i + 1th eigenvalue of A for i = 1, . . . , n. Schur’s theorem (1923) is another addition to that theory. Fan (1949) proved that the sum of the k largest coordinates of γ is less than the sum of the k largest elements of α and the k largest elements of β. Lidskii (1950) showed γ is in the convex hull formed by α added to all possible permutations of β. Wielandt did not understand Lidskii’s proof, but showed the equivalent statement γ − α is majorized by β [2, pg. 79]. A generalization of this theorem, capturing both Lidskii’s theorem and Weyl’s theorem was found by Thompson and Freede (1971). Horn’s conjecture remained unsolved until Knutson and Tao (1999) proved that the set of inequalities, proposed by Horn, between the coordinates of α, β and γ are necessary and sufficient for there to exist three matrices A, B and C with C = A+B and having vectors of eigenvalues α, β and γ respectively. Lidskii’s theorem is still one of the more powerful perturbation theorems connecting the eigenvalues of a matrix and its perturbations. Three different proofs of Lidskii’s theorem are presented in Chapters 5 and 6. Wielandt’s theorem is proved in the Appendix. Chapter 6 states a formula for the directional derivatives of the eigenvalues of a symmetric matrix derived by HiriartUrruty and Ye (1995). Subsequent work by Lewis (1996) provides us with a formula for the Clarke subdifferential of spectral functions on symmetric matrices. These two formulas are the only results that we state without a proof. Lewis in [15] offers a nonsmooth variational proof of Lidskii, which we present. The first section of Chapter 7 generalizes the notion of majorization, referred to as partition majorization or π-majorization. With this language, we rephrase a

4 result by Lewis [14] and make it clear that it is a generalization of Schur’s theorem (1923). All results in the second section are original and they concern Lidskii-type inequalities for the directional derivatives of the vector of eigenvalues, when the direction is perturbed. Without using Hiriart-Urruty and Ye’s formula we show that Lidskii’s theorem holds for directional derivatives of eigenvalues. We provide three proofs that a partition majorization analogue to Lidskii holds under certain conditions. Finally using their formula we prove that there in fact is a partition majorization analogue to Lidskii in regards to directional derivatives of eigenvalues.

5

Chapter 2 Linear operators

2.1

Linear maps and matrix representations Let X, Y and Z be finite dimensional, real vector spaces. It is assumed that

the reader has some familiarity with basic notions from Matrix Theory such as rank, determinant, inverse of a matrix, etc. Definition 2.1.1. A map A : X → Y is linear if for all x, y ∈ X and c ∈ R, we have

A(x + y) = Ax + Ay and A(cx) = cAx.

Denote the zero vector by 0. Definition 2.1.2. Let A : X → Y be a linear operator. The range of A is the linear subspace of Y: R(A) := {Ax | x ∈ X}, and the nullspace the linear subspace of X:

N (A) := {x ∈ X | Ax = 0}.

6 Definition 2.1.3. The linear map A is called singular if Ax = 0 for some nonzero vector x. Proposition 2.1.1. If the nullspace of the linear operator A : X → Y is {0} then A is nonsingular. Definition 2.1.4. If {xi }ni=1 and {yi }m i=1 are bases for X and Y respectively then the matrix representation of A : X → Y is the m × n matrix defined by

Axi =

m X

Aji yj .

j=1

It is easy to see that if the coordinates of x ∈ X with respect to the basis {xi }ni=1 are (α1 , . . . , αn )T then the coordinates of Ax with respect to the basis {yi }m i=1 are A(α1 , . . . , αn )T . Proposition 2.1.2. Let A : X → Y and B : Y → Z be two linear maps and let {xi }ni=1 , {yi }ni=1 and {zi }ki=1 be bases for X, Y and Z respectively. Suppose A and B are the matrix representations of A and B, the composition BA : X → Z has the matrix representation BA. Definition 2.1.5. A linear map A : X → X is called a linear operator on X. The following proposition is Exercise 4.7.18 in [20]. Proposition 2.1.3. For any linear operator A : X → X the following statements are equivalent: (i) A is invertible; (ii) A is a one-to-one map; (iii) The nullspace of A is {0};

7 (iv) A is an onto map. Proof. We show the chain of implications (i) ⇒ (ii) ⇒ (iii) ⇒ (iv) ⇒ (ii). Suppose that A−1 exists. If Ax = Ax0 for x, x0 ∈ X, then A−1 (Ax−Ax0 ) = x−x0 = 0 showing that x = x0 . Thus, A is one-to-one. To show (ii) ⇒ (iii) suppose that Ax = Ax0 for x, x0 ∈ X implies x = x0 . If there is a nonzero x ∈ X such that Ax = 0 then since Ax = A0 we reach the contradiction. To show (iii) ⇒ (iv) suppose that the nullspace of A is {0}. A well known result is dim R(A) = dim X − dim N (A). Hence the dimension of the R(A) is equal to the dimension of X implying that A is an onto map. To show (iv) ⇒ (ii), notice that if there were x, y ∈ X with x 6= y such that Ax = Ay then A(x − y) = 0 or N (A) 6= 0 so the dimension of R(A) is less than the dimension of X and the map is not onto. Finally, we show that (ii) combined with (iv) implies (i). Let y ∈ X then by (iv) there exists an x such that Ax = y. If there is another x0 ∈ X such that Ax0 = y then by (ii) we must have x0 = x. It follows that A is invertible. Definition 2.1.6. The identity operator Id on X is defined by Idx = x for all x ∈ X. Definition 2.1.7. Denote by Ik the k × k identity matrix. For simplicity denote by I the matrix In . By (X, {xi }ni=1 ) we denote an n dimensional linear space X with a fixed basis {xi }ni=1 . The matrix representation of Id : (X, {xi }ni=1 ) → (X, {xi }ni=1 ) is I. Proposition 2.1.4. Let the nonsingular operator A : (X, {xi }ni=1 ) → (X, {xi }ni=1 ) have a matrix representation A. Then A−1 is the matrix representation of A−1 . Proof.

Since AA−1 = Id and the matrix representation of Id is I, if the matrix

representation of A−1 is B then by Proposition 2.1.2 AB = I. Therefore B = A−1 .

8 Corollary 2.1.5. The operator is invertible if and only if every matrix representation is invertible. Corollary 2.1.6. For two different bases {xi }ni=1 and {yi }ni=1 of X the matrix representation of Id : (X, {xi }ni=1 ) → (X, {yi }ni=1 ) is the matrix B defined by xi = Pn

j=1

B ji yj . The matrix representation of Id : (X, {yi }ni=1 ) → (X, {xi }ni=1 ) is B −1 .

Proposition 2.1.7. Let {xi }ni=1 and {yi }ni=1 be two bases for X. Let Ax and Ay be the matrix representations of A : (X, {xi }ni=1 ) → (X, {xi }ni=1 ) and A : (X, {yi }ni=1 ) → (X, {yi }ni=1 ) respectively. Then Ax = BAy B −1 where B is defined in Corollary 2.1.6. Proof. Observe that the operator A : (X, {xi }ni=1 ) → (X, {xi }ni=1 ) can be represented as the composition of

Id

A

Id

(X, {xi }ni=1 ) −→ (X, {yi }ni=1 ) − → (X, {yi }ni=1 ) −→ (X, {xi }ni=1 ).

Then by Proposition 2.1.2 and Corollary 2.1.6 we must have Ax = BAy B −1 .

2.2

Eigenvalues and eigenvectors of linear operators

Definition 2.2.1. Let A be a linear operator on X. Then λ ∈ R is an eigenvalue of A if Ax = λx for some nonzero vector x ∈ X. Let λ be an eigenvalue of A. (a) Any nonzero vector x such that Ax = λx is called an eigenvector of A corresponding to the eigenvalue λ. (b) The set of all eigenvectors corresponding to λ and {0} is called the eigenspace associated with λ. It is easy to check that an eigenspace is indeed a subspace of X.

9 Definition 2.2.2. The geometric multiplicity of λ is the dimension of its eigenspace. Proposition 2.2.1. Let A be a linear operator on X with matrix representation A. Then the following are equivalent: (i) λ is an eigenvalue of A; (ii) The operator (A − λId) is not invertible; (iii) det(A − λI) = 0. Proof. To prove (i) ⇒ (ii) suppose that λ is an eigenvalue of A. Then there exists a nonzero vector x such that Ax = λx or (A − λId)x = 0. The operator (A − λId) is non invertible since it is not one-to-one. To prove (ii) ⇒ (i) suppose that (A−λId) is not invertible. Then there is a nonzero x belonging to its nullspace: (A − λId)x = 0, or Ax = λx. Finally, to show (ii) ⇔ (iii) notice by Corollary 2.1.5 the operator A − λId is not invertible if and only if has a matrix representation A − λI that is not invertible, which is equivalent to det(A − λI) = 0. Definition 2.2.3. The algebraic multiplicity of λ is the number of times λ is repeated as a root of det(A − xI) = 0. Proposition 2.2.2. For any nonsingular matrix X, the matrices A and X −1 AX have the same eigenvalues counting multiplicities. Proof. Since det(X −1 ) = 1/ det(X), by the product rule of determinants we have

det(X −1 AX − λI) = det(X −1 ) det(A − λI) det(X) = det(A − λI).

Therefore, A and X −1 AX have the same eigenvalues counting algebraic multiplicity. Assume that A and X are n×n matrices and that u ∈ Rn is an eigenvector for X −1 AX

10 corresponding to eigenvalue λ. Since X −1 AXu = λu is equivalent to AXu = λXu, then Xu is an eigenvector of A corresponding to eigenvalue λ. Since X is nonsingular, the eigenspace corresponding to λ and its image under X has the same dimensions. Therefore the eigenvalues of X and X −1 AX have the same geometric multiplicity.

2.3

Inner-product spaces and orthonormal sets

Definition 2.3.1. A map (x, y) ∈ X × X → hx, yi ∈ R is called an inner product if (a) hx, xi ≥ 0 for every x ∈ X and hx, xi = 0 if and only if x = 0, (b) hx, αyi = αhx, yi for all α ∈ R and x, y ∈ X, (c) hx, y + zi = hx, yi + hx, zi for all x, y, z ∈ X, (d) hx, yi = hy, xi, for all x, y ∈ X. Definition 2.3.2. A Euclidean space is a vector space endowed with an inner product. Let E be a Euclidean space. The inner product defines a norm k · k on E by

kxk :=

p

hx, xi , ∀x ∈ E.

Proposition 2.3.1 (Cauchy-Schwarz). Let x, y ∈ E, then

|hx, yi| ≤ kxkkyk.

Equality holds if and only if ykxk2 = hx, yix.

11 Proof. α :=

When x = 0 the result follows immediately, so assume x is nonzero. Let

hx,yi . kxk2

Notice that

0 ≤ hαx − y, αx − yi = α2 kxk2 − 2αhx, yi + kyk2 |hx, yi|2 |hx, yi|2 − 2 + kyk2 kxk2 kxk2 −|hx, yi|2 + kyk2 kxk2 = kxk2 kyk2 kxk2 − |hx, yi|2 = , kxk2 =

which implies |hx, yi| ≤ kykkxk. We have equality if and only if αx − y = 0 or ykxk2 = hx, yix. Definition 2.3.3. Define the Kronecker delta symbol

δij :=

   1, if i is equal to j   0, if i is not equal to j.

Definition 2.3.4. The subset {u1 , u2 , . . . , uk } ⊆ E is called orthonormal if hui , uj i = δij for all 1 ≤ i, j ≤ k. Definition 2.3.5. An orthonormal basis for E is a basis for E that is also an orthonormal set. The following proposition and proof can be found in [20, pg. 307]. Proposition 2.3.2 (Gram-Schmidt sequence). If {xi }ni=1 is a basis for E, then the sequence {ui }ni=1 defined inductively by

u1 =

x1 and kx1 k

12

uk =

xk − kxk −

Pk−1

Pi=1 k−1

hui , xk iui

i=1 hui , xk iui k

for k = 2, . . . , n

is an orthonormal basis for E. Lemma 2.3.3. Let {xi }ki=1 be an orthonormal subset of E. If y belongs to the span of {x1 , x2 , . . . , xk }, then y=

k X

hy, xi ixi .

i=1

Proof.

Since y =

Pk

i=1

ai xi for some scalars a1 , . . . , ak , then for j = 1, . . . , k we

have k k X X hy, xj i = h ai x i , x j i = ai hxi , xj i = aj hxi , xi i = aj . i=1

i=1

The result follows. Proposition 2.3.4. An orthonormal subset {x1 , x2 , . . . , xk } is linearly independent. Proof.

Suppose that

Pk

i=1

ai vi = 0. Setting y = 0 in Lemma 2.3.3 implies that

aj = h0, xj i = 0 for j = 1, . . . , k. Let M be another Euclidean space. Proposition 2.3.5. Let {xi }ni=1 and {yi }m i=1 be orthonormal bases of E and M respectively then the matrix representation A of A : (E, {xi }ni=1 ) → (M, {yi }m i=1 ) satisfies

Aij = hAxj , yi i.

Proof. By Lemma 2.3.3 we have Axj =

Pm

i=1

Aij yi =

Pm

i=1 hAxj , yi iyi .

13

2.4

The adjoint of a linear map

Proposition 2.4.1. Let g : E → R be a linear function. Then there exists a unique vector y ∈ E such that g(x) = hx, yi for all x ∈ E. Proof. Let {xi }ni=1 be an orthonormal basis of E. Such a basis exists by Proposition 2.3.2. Define y :=

Pn

i=1

g(xi )xi and a linear function h : E → R by h(xj ) = hxj , yi.

Notice that for j = 1, . . . , n we have

h(xj ) = hxj , yi = hxj ,

n X

g(xi )xi i =

i=1

n X

g(xi )hxj , xi i =

i=1

n X

g(xi )δji = g(xj ).

i=1

Thus, g and h are equal on the basis {xi }ni=1 and by linearity they are equal on E. To show that y is unique, suppose that g(x) = hx, y 0 i for all x ∈ E. Then hx, yi = hx, y 0 i for all x ∈ E which results in y = y 0 . Proposition 2.4.2. Let A : E → M be a linear map, then there exists a unique linear map A∗ : M → E such that

hAx, yiM = hx, A∗ yiE ∀x ∈ E, ∀y ∈ M.

The map A∗ is called the adjoint map of A. Proof. First we show that the adjoint map exists. Fix y ∈ M. Then

x ∈ E 7→ hAx, yiM ∈ R

14 is a linear function in x. By Proposition 2.4.1 there exists a unique v ∈ E such that

hAx, yiM = hx, viE .

(2.1)

Define A∗ : M → E by A∗ y := v. Next, we show that A∗ is linear. For any y1 , y2 ∈ M and a scalar α using Equation (2.1) twice we have

hx, A∗ (y1 + αy2 )iE = hAx, (y1 + αy2 )iM = hAx, y1 iM + hAx, αy2 iM = hAx, y1 iM + αhAx, y2 iM = hx, A∗ y1 iE + αhx, A∗ y2 iE = hx, (A∗ y1 + αA∗ y2 )iE .

Since this holds for every x ∈ E it implies that A∗ (y1 + αy2 ) = A∗ y1 + αA∗ y2 . Finally, we show that that adjoint map is unique. Suppose that

hAx, yiM = hx, A∗1 yiE ∀x ∈ E, ∀y ∈ M and hAx, yiM = hx, A∗2 yiE ∀x ∈ E, ∀y ∈ M.

By subtracting the bottom equation from the top we see that

0 = hx, (A∗1 − A∗2 )yiE ∀x ∈ E, ∀y ∈ M ⇔ 0 = (A∗1 − A∗2 )y ⇔ 0 = A∗1 − A∗2 .

∀y ∈ M

15 Thus, A∗1 = A∗2 and the result is fully established. Proposition 2.4.3. Let {xi }ni=1 and {yi }m i=1 be orthonormal bases for E and M respectively. If A : (E, {xi }ni=1 ) → (M, {yi }m i=1 ) has matrix representation A then the n T matrix representation of A∗ : (M, {yi }m i=1 ) → (E, {xi }i=1 ) is A .

Proof. Let B be the matrix representation of the adjoint map A∗ : (M, {yj }m j=1 ) → (E, {xi }ni=1 ). Then Aij = hAxj , yi i = hxj , A∗ yi i = B ji , by Proposition 2.3.5. Thus, AT = B. Since A and B are operators it is evident that (A + B)∗ = A∗ + B ∗ . Let H be a Euclidean space. Proposition 2.4.4. Any linear maps A : E → M and B : M → H satisfy

(BA)∗ = A∗ B ∗ .

Proof. For any x ∈ E and y ∈ H we have

hx, (BA)∗ yiE = hBAx, yiH = hAx, B ∗ yiM = hx, A∗ B ∗ yiE .

The result follows. Proposition 2.4.5. For any linear map A : E → M we have (A∗ )∗ = A. Proof. Clearly, h(A∗ )∗ x, yiM = hx, A∗ yiE = hAx, yiM .

16

2.5

Self-adjoint operators

Definition 2.5.1. A linear operator A on a Euclidean space is self-adjoint if A = A∗ . It is evident that self-adjoint operators form a linear space. By Proposition 2.4.3 we immediately have the next result. Proposition 2.5.1. If A is a self-adjoint operator then its matrix representation with respect to an orthonormal basis is a symmetric matrix. Denote the set of all n × n real symmetric matrices by Sn . The following lemma and proof are modified from [1, Lemma 7.11]. Lemma 2.5.2. Suppose the operator A : E → E is self-adjoint. If α, β ∈ R are such that α2 < 4β, then A2 + αA + βId

(2.2)

is invertible. Proof. Let x ∈ E be a nonzero vector. By Proposition 2.3.1,

h(A2 + αA + βId)x, xi = hA2 x, xi + hαAx, xi + βkxk2 = hAx, Axi + αhAx, xi + βkxk2 ≥ kAxk2 − |α|kAxkkxk + βkxk2 α2 |α|kxk 2 = kAxk − + β− kxk2 > 0. 2 4

This shows that N (A2 + αA + βId) = {0} and by Proposition 2.1.3 A2 + αA + βId is invertible. The following lemma and proof are modified from [1, Lemma 7.12].

17 Lemma 2.5.3. Suppose the operator A : E → E is self-adjoint. Then A has an eigenvector in E. Proof. Let n := dim E and let v ∈ E be a nonzero vector. Let k be the smallest non-negative integer for which the set

{v, Av, A2 v, . . . , Ak v}

is linearly dependent. Then there exist real numbers a0 , . . . , ak , not all zero, such that 0 = a0 v + a1 Av + · · · + ak Ak v.

(2.3)

Choosing k this way ensures that 1 ≤ k ≤ n and that ak 6= 0. Any polynomial of degree k with real coefficients can be factored into linear and quadratic terms as:

a0 + a1 x + · · · + ak xk = ak (x2 + α1 x + β1 ) · · · (x2 + αM x + βM )(x − λ1 ) · · · (x − λm ),

where λi is real for i = 1, . . . , m and αj and βj are real numbers such that αj2 < 4βj for j = 1, . . . , M . We can factor (2.3) the same way if each polynomial above is replaced with a corresponding operator to get

0 = ak (A2 + α1 A + β1 Id) · · · (A2 + αM A + βM Id)(A − λ1 Id) · · · (A − λm Id)v.

By Lemma 2.5.2 each A2 + αi A + βi Id is invertible, because A is self-adjoint and

18 αj2 < 4βj . Then the above equation implies that

0 = (A − λ1 Id) · · · (A − λm Id)v.

This show that there is an l ∈ {1, . . . , m} such that w := Πm i=l+1 (A − λi Id)v 6= 0, but (A − λl Id)w = 0. Thus, w is an eigenvector corresponding to λl . Proposition 2.5.4. Let A be a self-adjoint operator on E. Then E has an orthonormal basis of eigenvectors of A. Proof. Let n := dim E. By Lemma 2.5.3 the operator A has an eigenvector x1 ∈ E with some corresponding eigenvalue λ. Without loss of generality we may assume that kx1 k = 1. Define the subspace of E:

E1 := {x ∈ E | hx, x1 i = 0}. We show that E1 is invariant under A. This is equivalent to showing hAx, x1 i = 0 for all x ∈ E:

hAx, x1 i = hx, A∗ x1 i = hx, Ax1 i = hx, λx1 i = λhx, x1 i = 0.

By Lemma 2.5.3, the operator A has an eigenvector x2 in E1 , of length one. Furthermore x2 is orthogonal to x1 . Repeating the process define the subspace E2 ⊆ E1 :

E2 := {x ∈ E1 | hx, xi i = 0, i = 1, 2}.

An analogous argument shows that E2 is invariant under A. By the lemma there is

19 an eigenvector x3 ∈ E3 , of length one, orthogonal to E2 . Repeat this process until we have an orthonormal set of eigenvectors {x1 , . . . , xn }. By Proposition 2.3.4 this set of eigenvectors is linearly independent and thus forms a basis for E. Definition 2.5.2. The set of real n × n orthogonal matrices is denoted by On ; that is, U ∈ On if and only if U T = U −1 . Lemma 2.5.5. Let {xi }ni=1 and {yi }ni=1 be two orthonormal bases of E. Then the matrix representation B of Id : (E, {xi }ni=1 ) → (E, {yi }ni=1 ) is orthogonal. Proof. By Proposition 2.3.5 we have B ij = hIdxj , yi i = hxj , yi i. To see that this is an orthogonal matrix notice that

n X j=1

ij

B B

kj

=

n X

n X hxj , yi ihxj , yk i = h xj hxj , yk i, yi i = hyk , yi i = δik ,

j=1

j=1

where the third equality holds by Lemma 2.3.3 with y = yk . Definition 2.5.3. The standard Euclidean space of ordered n-tuples of real numbers with the inner product hx, yi :=

Pn

i=1

xi yi will be indicated by Rn .

Definition 2.5.4. Let the standard basis for Rn be {e1 , . . . , en } where ei is the vector with ith entry equal to 1 and zeros everywhere else. Definition 2.5.5. For a vector x ∈ Rn denote by Diag x the n × n diagonal matrix with x on the main diagonal. For an n × n matrix X denote by diag X the vector of diagonal elements in Rn . Proposition 2.5.6. For any A ∈ Sn there exists a vector a ∈ Rn and U ∈ On such that A = U T (Diag a)U.

20 Proof.

Matrix A defines a self-adjoint operator A : (Rn , {ei }ni=1 ) → (Rn , {ei }ni=1 ),

by Aei =

n X

Aji ej .

j=1

By Proposition 2.5.4 with E = Rn , there exist an orthonormal basis of eigenvectors {xi }ni=1 of A in Rn . Notice that the A can be represented as

Id

A

Id

(Rn , {ei }ni=1 ) −→ (Rn , {xi }ni=1 ) − → (Rn , {xi }ni=1 ) −→ (Rn , {ei }ni=1 ).

By Lemma 2.5.5 there is a matrix U ∈ On for the map Id : (Rn , {ei }ni=1 ) → (Rn , {xi }ni=1 ). By Corollary 2.1.6 the map Id : (Rn , {xi }ni=1 ) → (Rn , {ei }ni=1 ) has matrix representation U −1 which equals U T , since U ∈ On . Notice that the map A : (Rn , {xi }ni=1 ) → (Rn , {xi }ni=1 ) has matrix representation Diag a for some a ∈ Rn composed of eigenvalues, since {xi }ni=1 is a orthonormal basis of eigenvectors of A. Combining these observations with Proposition 2.1.2 shows A = U T (Diag a)U . Corollary 2.5.7. Let A ∈ Sn be a matrix representation of a self-adjoint linear operator A : E → E. Then the degree-n polynomial x 7→ det(A − xI) has n real roots. Proof.

By Proposition 2.5.6 there is a a ∈ Rn and U ∈ On such that A =

U T (Diag a)U . Using basic properties of determinants, we obtain

n Y det(A − xI) = det(U ) det(Diag a − xI) det(U ) = (ai − x). T

i=1

This shows that the coordinates of a are the roots of det(A − xI) = 0. Proposition 2.5.8. Let λ be an eigenvalue of a self-adjoint linear operator A : E → E. Then the geometric multiplicity of λ is equal to its algebraic multiplicity.

21 Proof.

Let A ∈ Sn be the matrix representation of A. Suppose that λ has an

algebraic multiplicity of k. By Proposition 2.5.6 and Corollary 2.5.7 there is an orthogonal matrix U such that 



 λIk 0  U T AU =  , 0 B

where λ is not an element of the diagonal matrix B and k is the algebraic multiplitiy of λ. Consequently, 



0   0 rank (A − λI) = rank   = n − k. 0 B − λI(n−k)

Since dim N (A − λI) = n − rank (A − λI) = k the geometric multiplicity is k.

2.6

Orthogonal operators

Definition 2.6.1. An operator U : E → E is orthogonal if

hx, yi = hUx, Uyi ∀x, y ∈ E.

Proposition 2.6.1. An orthogonal operator U : E → E is nonsingular and U −1 = U ∗ . Proof. First we show that U is nonsingular. Assume that {xi }ni=1 is an orthonormal basis of E. Then hUxi , Uxj i = hxi , xj i = δij , showing that {Uxi }ni=1 is an orthonormal basis of E. Thus, U is onto and by Propo-

22 sition 2.1.3 nonsingular. Next, U −1 = U ∗ , since

hU −1 x, yi = hUU −1 x, Uyi = hx, Uyi = hU ∗ x, yi, ∀x, y ∈ E.

2.7

The trace of an operator

Definition 2.7.1. The trace of an n × n matrix A is defined by

tr (A) :=

n X

Aii .

i=1

The following proposition is proved employing the definition. Proposition 2.7.1. The trace has the commutative property tr (AB) = tr (BA). Definition 2.7.2. Let the matrix representation of an operator A : (X, {xi }ni=1 ) → (X, {xi }ni=1 ) be Ax . Then the trace of the operator A is defined by

tr (A) := tr (Ax ).

Proposition 2.7.2. The trace of an operator is independent of the basis chosen in Definition 2.7.2. Proof. Let Ay be the matrix representation of A : (X, {yi }ni=1 ) → (X, {yi }ni=1 ). Then from Proposition 2.1.7 we know that Ax = BAy B −1 for some nonsingular matrix B, so tr (Ax ) = tr (BAy B −1 ) = tr (Ay B −1 B) = tr (Ay ), by Proposition 2.7.1.

23 Proposition 2.7.3. Let {xi }ni=1 be an orthonormal basis of E and A be an operator on E. Then tr (A) =

n X

hxi , Axi i.

i=1

Proof. If A is the matrix representation of A : (E, {xi }ni=1 ) → (E, {xi }ni=1 ), then by Proposition 2.3.5 we have tr (A) =

Pn

i=1

Aii =

Pn

i=1 hxi , Axi i.

Definition 2.7.3. Let A : E → E be a self-adjoint linear operator. Then A is called positive definite if hx, Axi > 0 for all nonzero x ∈ E. Furthermore, A is called positive semi-definite if hx, Axi ≥ 0 for all x ∈ E. Proposition 2.7.4. For any operator A the operators AA∗ and A∗ A are both selfadjoint and positive semi-definite. Proposition 2.4.4 and Proposition 2.4.5 show that (AA∗ )∗ = AA∗ . Addi-

Proof. tionally,

hx, AA∗ xi = hA∗ x, A∗ xi ≥ 0. The proof for A∗ A is analogous. Definition 2.7.4. Define an inner product between two linear maps A : E → M and B : E → M by hA, Bi := tr (A∗ B).

(2.4)

Proposition 2.7.5. Equation (2.4) defines an inner product. Proof. Let n := dim E and let A be a matrix representation A, then

∗

T

tr (A A) = tr (A A) =

n X i=1

T

ii

(A A) =

n X n X i=1 j=1

T ij

T ji

(A ) (A ) =

n X n X i=1 j=1

(Aij )2 ≥ 0.

24 Thus, tr (A∗ A) = 0 if and only if A = 0, since the zero matrix characterizes the zero map. The trace is a linear function, thus conditions (b) and (c) of Definition 2.3.1 are satisfied. The last condition in Definition 2.3.1 holds by Proposition 2.7.1.

2.8

Compressions In this section we assume that M is a linear subspace of the Euclidean space

E. The inner product in M is the one induced by the inner product on E. Definition 2.8.1. The orthogonal projection of E onto M is a linear map P : E → M such that hx − Px, yi = 0 ∀x ∈ E and ∀y ∈ M.

(2.5)

Definition 2.8.2. The linear map J : M → E defined by J x = x for all x ∈ M is called an injection. Proposition 2.8.1. The orthogonal projection map from E onto M exists and is unique. In fact, it is the adjoint J ∗ of the injection of M into E. Proof. Let J : M → E be the injection map. Notice that for all x ∈ E and y ∈ M

hx, yi = hx, J yi = hJ ∗ x, yi.

Thus, hx − J ∗ x, yi = 0 ∀x ∈ E, ∀y ∈ M with J ∗ satisfying Equation (2.5). If P1 and P2 are both projections of E onto M then hP1 x, yi = hx, yi = hP2 x, yi for all x ∈ E and all y ∈ M. That is, h(P1 − P2 )x, yi = 0 so P1 = P2 . Proposition 2.8.2. Let P be the orthogonal projection of E onto M then Px = x for all x ∈ M.

25 Proof. By definition, hx − Px, yi = 0 ∀x, y ∈ M implies x − Px = 0 ∀x ∈ M. Corollary 2.8.3. Let P be the orthogonal projection of E onto M then P 2 x = Px for all x ∈ E. Proof.

Since Px ∈ M for any x ∈ E, then clearly P 2 x = P(Px) = Px = x, by

Proposition 2.8.2. Proposition 2.8.4. The adjoint P ∗ of the projection from E to M is the injection J of M into E. Proof. Let x, y ∈ M, then hP ∗ x, yi = hx, Pyi = hx, yi. Since this holds for every y ∈ M we get P ∗ x = x for all x ∈ M. Thus, P ∗ = J since the injection is unique. Definition 2.8.3. The restriction of the operator A : E → E to M is denoted by A|M : M → E. Let P be the orthogonal projection from E to M then the composition PA|M is called the compression of A to M. Clearly, the compression of A to M is equal to the composition PAJ , where P is the projection of E onto M and J is the injection of M into E. Proposition 2.8.5. Let P be the orthogonal projection from E to M and let {xi }ni=1 be an orthonormal basis for E such that {xi }ki=1 is an orthonormal basis for M. The matrix representation of the compression PA|M : (M, {xi }ki=1 ) → (M, {xi }ki=1 ) is the upper-left k × k submatrix of the matrix representation of A : (E, {xi }ni=1 ) → (E, {xi }ni=1 ). Proof. Let A, P, A|M be the matrix representations of A, P and A|M with respect to the bases {xi }ni=1 and {xi }ki=1 . For any j = 1, . . . , k the projection of A|M xj = Axj =

Pn

i=1

Aij xi onto M is PAxj =

submatrix of A.

Pk

i=1

Aij xi . Thus, P A|M is the upper-left k × k

26 Proposition 2.8.6. Let A : E → E be a linear operator and P be the orthogonal projection of E onto M. If A is self-adjoint then the compression PA|M : M → M is self-adjoint. Proof. By the comment after Definition 2.8.3, we have PA|M = PAJ , where P is the projection of E onto M and J is the injection of M into E. By Proposition 2.4.4, we have (PA|M )∗ = (PAJ )∗ = J ∗ A∗ P ∗ = PAJ = PA|M , where in the penultimate equality, we used that A is self-adjoint together with Proposition 2.8.1 and Proposition 2.8.4.

27

Chapter 3 Convex and nonsmooth analysis

The development of this chapter utilizes ideas from [3, Chapter 6], [4, Chapter 2], [22, Section 3.1], and [24, Chapter 5].

3.1

Convex sets Let X be a real vector space.

Definition 3.1.1. Let x, y be any vectors in X. (a) The closed line segment between x and y is the set

[x, y] := {tx + (1 − t)y | t ∈ [0, 1]}.

(b) The open line segment between distinct x and y is the set

(x, y) := {tx + (1 − t)y | t ∈ (0, 1)}.

We define (x, x) := ∅.

28 Definition 3.1.2. A subset C of X is convex if for any two vectors x, y ∈ C, the vector αx + (1 − α)y is in C, for all α ∈ [0, 1]. Definition 3.1.3. A convex combination of x1 , . . . , xn ∈ X is a linear combination Pn

i=1

αi xi where the coefficients αi ∈ [0, 1] with

Pn

i=1

αi = 1.

Definition 3.1.4. The convex hull of D, denoted by conv (D), is the smallest convex set containing D. In other words, conv (D) =

T

{C | D ⊆ C, C is a convex set}. The set

conv (D) consists exactly of all convex combinations of finite collections of elements of D. Let E be a Euclidean space. We consider extended real-valued functions on E that can take values in R ∪ {+∞}. We adopt the convention

0 · (+∞) = 0.

Definition 3.1.5. The domain of a function f : E → R ∪ {+∞} is the set

dom f := {x ∈ E | f (x) < +∞}.

Definition 3.1.6. A function f : E → R ∪ {+∞} is called convex if for any x, y ∈ E and any t ∈ [0, 1] f (tx + (1 − t)y) ≤ tf (x) + (1 − t)f (y). Notice that the domain of a convex function is a convex set.

29 Definition 3.1.7. Denote the open ball in E centered at x with radius r by

B(x, r) := {y ∈ E | kx − yk < r}.

The closed ball is denoted by

¯ r) := {y ∈ E | kx − yk ≤ r}. B(x,

The following two propositions are well-know results from real analysis. For their proofs see Theorem 1.1.2 and Proposition 1.1.3 in [3]. Proposition 3.1.1 (Bolzano-Weierstrass). Every bounded infinite sequence in E has a convergent subsequence. Proposition 3.1.2 (Weierstrass). Suppose that the set D ⊂ E is nonempty, compact and f : D → R is a continuous function. Then the function f has a minimizer on D. Proposition 3.1.3 (Closest point). Let D be a closed nonempty subset of E and let y∈ / D. Then there exists an x∗ ∈ D such that ky − x∗ k ≤ ky − xk for all x ∈ D. ˆ := B(y, ¯ kz − yk) ∩ D. The set D ˆ is nonempty since Proof. Fix z ∈ D and let D z ∈ D and compact since it is the intersection of a closed ball with the closed set D. ˆ 7→ kx − yk ∈ R is continuous so it must have a minimizer x∗ on The function x ∈ D ˆ by Proposition 3.1.2, which is a point in D closest to y. D Proposition 3.1.4. Let C be a closed convex subset of E and let y ∈ / C. Then the closest point x∗ in C to y is unique and is characterized by the property

hy − x∗ , x − x∗ i ≤ 0, x∗ ∈ C, ∀x ∈ C.

(3.1)

30 Proof. Suppose that x∗ ∈ C satisfies (3.1) and define f (x) := ky − xk2 for x ∈ E. It is easy to see that f is a convex function. For x∗ ∈ C and x ∈ / C with t ∈ (0, 1] we have tf (x) + (1 − t)f (x∗ ) ≥ f (x∗ + t(x − x∗ )) or equivalently

f (x) − f (x∗ ) ≥

f (x∗ + t(x − x∗ )) − f (x∗ ) . t

Notice that f (x) is differentiable with ∇f (x) = −2(y − x). Taking the limit as t approaches 0+ we obtain f (x) − f (x∗ ) ≥ h∇f (x∗ ), x − x∗ i. Consequently,

f (x) ≥ f (x∗ ) + h∇f (x∗ ), x − x∗ i = f (x∗ ) − 2hy − x∗ , x − x∗ i ≥ f (x∗ ),

by Inequality (3.1). Thus, x∗ is the closest point in C to y. In the opposite direction, suppose that x∗ is the closest point in C to y. Fix a point x ∈ C. Define the function φ(t) := ky − (x∗ + t(x − x∗ ))k2 for all t ∈ [0, 1] and notice that φ0 (t) = −2hy − (x∗ + t(x − x∗ )), x − x∗ i. Since C is convex the point x∗ + t(x − x∗ ) belongs to C for all x ∈ C. Because x∗ is the closest point to y the function φ(t) achieves a minimum at t = 0. Thus,

φ(t) − φ(0) ≥ 0, t

for all t ∈ (0, 1]. Taking the limit as t approaches 0+ implies

−2hy − x∗ , x − x∗ i = φ0 (0) ≥ 0,

31 showing hy − x∗ , x − x∗ i ≤ 0. Finally, suppose that z ∗ ∈ C is another point that is closest to y in C. Then by (3.1)

0 ≥ hy − x∗ , z ∗ − x∗ i = hy, z ∗ i − hx∗ , z ∗ i − hy, x∗ i + hx∗ , x∗ i, 0 ≥ hy − z ∗ , x∗ − z ∗ i = hy, x∗ i − hz ∗ , x∗ i − hy, z ∗ i + hz ∗ , z ∗ i.

Adding these inequalities implies that

0 ≥ hx, x∗ i − 2hz ∗ , x∗ i + hz ∗ , z ∗ i = hx∗ − z ∗ , x∗ − z ∗ i = kx∗ − z ∗ k2 ,

thus x∗ = z ∗ . Proposition 3.1.5 (Basic separation theorem). Suppose that C is a closed convex set in E and that y is a vector in E that is not in C. Then there is a nonzero vector a ∈ E and a real number α such that ha, xi ≤ α < ha, yi for all x ∈ C. Proof. Let x∗ be the closest vector in C to y. By Proposition 3.1.3

hy − x∗ , x − x∗ i ≤ 0, ∀x ∈ C.

Denote by a the nonzero vector y − x∗ . With that notation the above inequality becomes ha, xi ≤ ha, x∗ i for all x ∈ C. Let α := ha, x∗ i and compute

ha, yi − α = ha, yi − ha, x∗ i = ha, y − x∗ i = ha, ai = kak2 > 0.

32 Hence ha, xi ≤ α < ha, yi for all x ∈ C. Definition 3.1.8. The point x is an interior point of the set D ⊆ E if there exists t > 0 such that x + tB(0, 1) ⊆ D. Definition 3.1.9. We denote by int D the set of all interior points of D. When x ∈ int D then D is called a neighborhood of x. ¯ := E \ int (E \ D). Definition 3.1.10. The closure of the set D ⊆ E is defined by D Corollary 3.1.6. Let C and D be closed and convex sets in E. Then C ⊆ D if and only if sup{hφ, ci | c ∈ C} ≤ sup{hφ, di | d ∈ D}, ∀φ ∈ E.

(3.2)

Proof. The fact that C ⊆ D implies sup{hφ, ci | c ∈ C} ≤ sup{hφ, di | d ∈ D} for all φ ∈ E is immediate. For the opposite direction assume (3.2). Suppose that C were not a subset of D, then there exists c in C such that c ∈ / D. By Proposition 3.1.5 there is a nonzero a ∈ E and α ∈ R such that ha, di < α − ≤ ha, ci for sufficiently small > 0 and all d ∈ D. Letting φ := a, we have

sup{ha, di | d ∈ D} ≤ α − < α ≤ sup{ha, ci | c ∈ C} ≤ sup{ha, di | d ∈ D}.

which is a contradiction. Proposition 3.1.7. Let C ⊆ E be a convex set and let z be a boundary point of C. Then there is a nonzero vector a ∈ E such that ha, xi ≤ ha, zi for all x ∈ C. Proof. Since z is a boundary point of C we can choose a sequence {zn } of vectors not in C¯ converging to z. Proposition 3.1.5 guarantees the existence of a nonzero an ∈ E such that han , xi < han , zn i for all x ∈ C¯ and all n. Without loss of generality

33 assume that an is of length one. Proposition 3.1.1 implies that some subsequence {ank } converges to some a with kak = 1. Lastly, notice that for any x ∈ C

ha, xi = limhank , xi ≤ limhank , znk i = ha, zi. k

k

This is what we wanted to prove.

3.2

Sublinear functions Define the strictly positive orthant by Rn++ := {x ∈ Rn | xi > 0, i = 1, . . . , n}

and the nonnegative orthant by Rn+ := {x ∈ Rn | xi ≥ 0, i = 1, . . . , n}. Definition 3.2.1. A function f : E → R ∪ {+∞} is (a) positively homogenous if f (αx) = αf (x), ∀α ∈ R+ and ∀x ∈ E, (b) subadditive if f (x + y) ≤ f (x) + f (y), ∀x, y ∈ E, (c) sublinear if f (αx + βy) ≤ αf (x) + βf (y), ∀x, y ∈ E and ∀α, β ∈ R+ . Proposition 3.2.1 (Jensen’s inequality). Let f : E → R ∪ {+∞} be a convex function. Then for any x1 , . . . , xm ∈ E and any α1 , . . . , αm ∈ [0, 1] with

Pm

i=1

αi = 1 we

have f

m X i=1

Proof.

αi xi ≤

m X

αi f (xi ).

i=1

When m = 1 there is nothing to show and m = 2 holds by definition of

a convex function. Assume the result is true for m ≥ 2 and we prove the result by induction on m. If α1 = 1 then α2 = α3 = · · · = αm+1 = 0 and the inequality is

34 trivial. Assume that 1 − α1 > 0 and let βi := m+1 X

αi 1−α1

αi xi = α1 x1 + (1 − α1 )

i=1

Since

Pm+1 i=2

f

∈ R+ for i = 2, . . . , m + 1. Then m+1 X

βi xi .

i=2

βi = 1 we can can apply the inductive assumption to get m+1 X

αi xi = f α1 x1 + (1 − α1 )

i=1

m+1 X

βi xi

i=2

≤ α1 f (x1 ) + (1 − α1 )f

m+1 X

m+1 X βi xi ≤ αi f (xi ).

i=2

i=1

Corollary 3.2.2. Let f : E → R ∪ {+∞} be a convex function. For any points x1 , . . . , xm ∈ dom f define C := conv {x1 , . . . , xm } then

sup f (x) = max f (xi ). i=1,...,m

x∈C

Proof. Pm

i=1

Let x ∈ C then x = α1 x1 + · · · + αm xm for αi ∈ R+ , i = 1, . . . , m with

αi = 1. By Proposition 3.2.1 we have

f (x) = f

m X i=1

αi xi ≤

m X i=1

αi f (xi ) ≤ max f (xi ). i=1,...,m

It follows that supx∈C f (x) ≤ maxi=1,...,m f (xi ) ≤ supx∈C f (x). Definition 3.2.2. The epigraph of a function f : E → R ∪ {+∞} is the set

epi f := {(x, r) ∈ E × R | f (x) ≤ r}.

Proposition 3.2.3. A function f : E → R ∪ {+∞} is convex if and only if epi f is

35 a convex set in E × R. Proof.

Let α ∈ [0, 1]. For (x1 , r1 ), (x2 , r2 ) ∈ epi f if f is a convex function then

(αx1 + (1 − α)x2 , αr1 + (1 − α)r2 ) belongs to epi f , since

αr1 + (1 − α)r2 ≥ αf (x1 ) + (1 − α)f (x2 ) ≥ f (αx1 + (1 − α)x2 ).

Conversely, if epi f is convex then (αx1 + (1 − α)x2 , αf (x1 ) + (1 − α)f (x2 )) ∈ epi f for any (x1 , f (x1 )), (x2 , f (x2 )) ∈ epi f and any α ∈ [0, 1], thus

f (αx1 + (1 − α)x2 ) ≤ αf (x1 ) + (1 − α)f (x2 ).

(3.3)

If x1 and/or x2 are not in dom f then (3.3) is trivial, since 0 · (+∞) = 0. Proposition 3.2.4 (Sublinearity). A function f : E → R ∪ {+∞} is sublinear if and only if it is positively homogenous and subadditive. Proof.

If f is positively homogenous and subadditive then immediately we have

f (αx + βy) ≤ f (αx) + f (βy) ≤ αf (x) + βf (y) for all α, β ∈ R+ and all x, y ∈ E. Now suppose that f is sublinear, letting α := β := 1 in f (αx + βy) ≤ αf (x) + βf (y) shows subadditivity. Letting β := 0 shows that

f (αx) = f (αx + βy) ≤ αf (x) + βf (x) = αf (x).

Thus, f (αx) ≤ αf (x). To show the opposite inequality we must consider two cases. If α 6= 0 then

1 α

∈ R+ and by above f ( α1 x) ≤

1 f (x). α

Substituting x0 =

1 x α

we see

that αf (x0 ) ≤ f (αx0 ) for all x0 ∈ E. If α = 0 then f (0) ≤ 0. However, subadditivity

36 implies f (0) ≤ f (0 + 0) ≤ 2f (0) or 0 ≤ f (0), so f (0) = 0. Thus, f (αx) = αf (x). Definition 3.2.3. We say that f : E → R ∪ {+∞} is directionally differentiable at x ∈ dom f in the direction h ∈ E if the following limit exists in R ∪ {−∞, +∞}:

lim t↓0

f (x + th) − f (x) . t

When it exists it is denoted by f 0 (x; h). We say that f is directionally differentiable at x if it is directionally differentiable at x in every direction h. Lemma 3.2.5. Let f : E → R ∪ {+∞} be a convex function and x¯ ∈ int (dom f ). Then g(t) :=

f (¯ x + th) − f (¯ x) , t

(3.4)

defined for all t 6= 0, is a non-decreasing function of t. Proof.

For any nonzero t1 < t2 we have to show that g(t1 ) ≤ g(t2 ). We consider

three cases: 0 < t1 < t2 , t1 < 0 < t2 and t1 < t2 < 0. First, assume that 0 < t1 < t2 and define y := x¯ + t2 h. Then for every α ∈ (0, 1] we have x¯ + α(t2 h) = (1 − α)¯ x + αy. By the convexity of f we have

f (¯ x + α(t2 h)) = f ((1 − α)¯ x + αy) ≤ (1 − α)f (¯ x) + αf (y) = f (¯ x) − αf (¯ x) + αf (y).

Subtracting f (¯ x) from both sides and dividing by α we get

f (¯ x + α(t2 h)) − f (¯ x) ≤ f (y) − f (¯ x) for all α ∈ (0, 1]. α

37 Substitute α with

t1 t2

∈ (0, 1] and y with x¯ + t2 h to obtain

g(t1 ) =

f (¯ x + t1 h) − f (¯ x) f (¯ x + t2 h) − f (¯ x) ≤ = g(t2 ). t1 t2

Second, assume that t1 < 0 < t2 and define x := x¯ + t1 h and y := x¯ + t2 h. Then for every number α ∈ (0, 1] we have x + α(t2 − t1 )h = (1 − α)x + αy. Using the fact that f is a convex function we get

f (x + α((t2 − t1 )h)) = f ((1 − α)x + αy) ≤ (1 − α)f (x) + αf (y).

Choose α =

−t1 t2 −t1

∈ (0, 1), then 1 − α =

t2 . t2 −t1

For this α we have x + α(t2 − t1 )h = x¯.

Substituting into the above inequality we get

f (¯ x) ≤

t1 t2 f (x) − f (y). t2 − t1 t2 − t1

Multiplying both sides by t2 − t1 > 0 and regrouping gives

t2 (f (¯ x) − f (x)) ≤ (−t1 )(f (y) − f (¯ x)).

Dividing by (−t1 )t2 > 0 we obtain

g(t1 ) =

f (y) − f (¯ x) f (x) − f (¯ x) ≤ = g(t2 ), t1 t2

by the definitions of x and y. The third case is analogous to the first case. Proposition 3.2.6 (Directional derivatives of convex functions exist). Let f : E →

38 R ∪ {+∞} be a convex function. Then for x¯ ∈ int (dom f ) the directional derivative f 0 (¯ x; h) exists for all h ∈ E and is finite. Proof. Define g(t) :=

f (¯ x + th) − f (¯ x) for nonzero t ∈ R. t

For t ∈ R++ sufficiently small, both x¯ + th and x¯ − th belong to dom f and hence f (¯ x + th) and f (¯ x − th) are finite along with f (¯ x). By Lemma 3.2.5 we have g(−t) ≤ g() ≤ g(t) for all −t < 0 < < t. Since g() is monotone and finite in , the limit as converges to 0+ , exists and is finite. Proposition 3.2.7. Let f : E → R ∪ {+∞} be directionally differentiable at x¯. Then the function h 7→ f 0 (¯ x; h) is positively homogeneous. Proof. For any direction h ∈ E and any α ≥ 0 we have

f (¯ x + αth) − f (¯ x) f (¯ x + αth) − f (¯ x) = α lim t↓0 t↓0 t αt f (¯ x + t0 h) − f (¯ x) = α lim = αf 0 (¯ x; h), 0 0 t ↓0 t

f 0 (¯ x; αh) = lim

where we let t0 := αt. Definition 3.2.4. A vector φ ∈ E is a subgradient of a function f : E → R ∪ {+∞} at x ∈ dom f if f (y) ≥ f (x) + hφ, y − xi for all y ∈ E. The subdifferential of f at x is the set ∂f (x) := {φ ∈ E | φ is a subgradient of f at x}. If x ∈ / dom f then we define ∂f (x) := ∅. Proposition 3.2.8. Let f : E → R∪{+∞} be a convex function and x¯ ∈ int (dom f )

39 then ∂f (¯ x) = {φ ∈ E | hφ, hi ≤ f 0 (¯ x; h) ∀h ∈ E}. Proof. Suppose that φ ∈ ∂f (¯ x) then hφ, x − x¯i + f (¯ x) ≤ f (x) for all x ∈ E. Let x := x¯ + th then for all small enough t > 0

hφ, thi ≤ f (¯ x + th) − f (¯ x).

Divide both sides by t and take the limit as t ↓ 0 to see hφ, hi ≤ f 0 (¯ x; h). Now suppose that hφ, hi ≤ f 0 (¯ x; h) for all h ∈ E and some φ ∈ E. By Lemma 3.2.5 for every h = x − x¯ we have

hφ, x − x¯i ≤ f 0 (¯ x; x − x¯) = lim t↓0

f (¯ x + t(x − x¯)) − f (¯ x) f (¯ x + x − x¯) − f (¯ x) ≤ . t 1

Then hφ, x − x¯i + f (¯ x) ≤ f (x) for all x ∈ E, showing that φ ∈ ∂f (¯ x). Proposition 3.2.9. If f : E → R ∪ {+∞} is a convex function and x¯ ∈ int (dom f ) then ∂f (¯ x) is nonempty. Proof. Notice if (¯ x + h) ∈ / dom f then any φ ∈ E satisfies hφ, hi ≤ f (¯ x + h) − f (¯ x). Then consider D = {h ∈ E | x¯ + h ∈ dom f }. It is easy to see that (¯ x, f (¯ x)) is a boundary point of epi f . By Proposition 3.1.7 there exists a nonzero vector (φ, u), with φ ∈ E and u ∈ R, satisfying

hφ, x¯ + hi + uα ≤ hφ, x¯i + uf (¯ x), ∀h ∈ D, α ∈ R with (¯ x + h, α) ∈ epi f.

If u > 0 then we can choose α sufficiently large that the inequality fails, so we conclude

40 that u ≤ 0. If u = 0 then hφ, x¯ + hi ≤ hφ, x¯i. For n sufficiently large and letting h :=

φ n

∈ D (here we use the fact that x¯ ∈ int (dom f )) we obtain that

hφ, x¯i +

1 hφ, φi ≤ hφ, x¯i, n

or hφ, φi ≤ 0. Thus, φ must be zero and we have a contradiction (φ, u) = (0, 0). We conclude that u < 0. Subtract hφ, x¯i from each side and divided both sides of the inequality by (−u):

1 hφ, hi − α ≤ −f (¯ x), ∀h ∈ D with (¯ x + h, α) ∈ epi f. −u

The inequality must also hold for α = f (¯ x + h):

1 hφ, hi − f (¯ x + h) ≤ −f (¯ x), ∀h ∈ D. −u

Together with the first line of the proof this shows that

3.3

1 φ −u

∈ ∂f (¯ x).

Lipschitz continuity of convex functions

Definition 3.3.1. A function f : E → R ∪ {+∞} is said to be Lipschitz at x¯, with a Lipschitz constant K if there is a neighborhood U ∈ dom f of x¯ such that

|f (x0 ) − f (x00 )| ≤ Kkx0 − x00 k, ∀x0 , x00 ∈ U.

Definition 3.3.2. A function f : E → R ∪ {+∞} is locally upper bounded at x¯ ∈ E

41 if there exists an r > 0 and an M ∈ R such that f (x) ≤ M for all x ∈ B(¯ x, r). Lemma 3.3.1. Let f : E → R ∪ {+∞} be convex and x¯ ∈ int (dom f ). Then f is locally upper bounded at x¯. Proof. Let {e0i }ni=1 be an orthonormal basis for E. Since x¯ ∈ int (dom f ) there exists > 0 such that x¯ + e0i ∈ int (dom f ) for i = 1, . . . , n. Define C := conv {¯ x ± e0i | i = 1, . . . , n}. We show that B(¯ x, ¯) ⊂ C where ¯ := If x ∈ B(¯ x, ¯) then x = x¯ + pPn

i=1

Pn

√ . n

hi e0i for some reals h1 , . . . , hn such that

i=1

h2i < ¯. Without loss of generality assume that hi ≥ 0, because C is symmetric

with respect to x¯ and we can substitute e0i for −e0i if necessary. By Proposition 3.2.1 we have n X h 2 i

i=1

n

n

1X 2 h. n i=1 i

≤

Taking the square root of both sides and multiplying by n results in

n X

v u n X √ u √ hi ≤ nt h2i < n¯ = .

i=1

i=1

Then

x = x¯ +

n X

hi e0i

= 1−

i=1

Pn

i=1

hi

x¯ +

n X hi i=1

(¯ x + e0i ).

This expresses x as a convex combination of x¯ and x¯ + e0i , i = 1, ..., n, therefore x ∈ C. Since B(¯ x, ¯) ⊂ C using Corollary 3.2.2 we have

sup f (x) = max f (x) ≤ max f (¯ x ± e0i ) =: M. x∈B(¯ x,¯ )

x∈C

i=1,...,n

42 This shows that f is locally upper bounded at x¯. Lemma 3.3.2. If g : E → R ∪ {+∞} is a convex function that is upper bounded on B(0, 2r) with g(0) = 0 then it is Lipschitz on B(0, r) for r > 0. Proof. Fix distinct points x, y ∈ B(0, r) and define

z := x +

r (x − y) where α := kx − yk. α

¯ r) = B(0, 2r) we get g(z) ≤ M . Solving for x in the definition Since z ∈ B(0, r)+ B(0, of z we obtain x=

r 1 α z + y. 1 + αr 1 + αr

Since g is convex g(x) ≤

r 1 α g(z) + g(y) 1 + αr 1 + αr

and subtracting g(y) from both sides leads to

g(x) − g(y) ≤

1 1 g(y). r g(z) − 1+ α 1 + αr

Notice that −g(y) ≤ g(−y). Indeed,

0 = g(0) = g

1

1 1 1 y + (−y) ≤ g(y) + g(−y). 2 2 2 2

Since −y ∈ B(0, r) then −g(y) ≤ g(−y) ≤ M and using Inequality (3.5) we get

g(x) − g(y) ≤

1 1 2 2α 2 M = M kx − yk. r M + r M = r M ≤ 1+ α 1+ α 1+ α r r

(3.5)

43 Exchanging the places of x and y in the above arguments we can see that

2 g(y) − g(x) ≤ M kx − yk. r

Thus, g is Lipschitz on B(0, r). Theorem 3.3.3. If f : E → R ∪ {+∞} is convex with x¯ ∈ int (dom f ) then f is Lipschitz at x¯. Proof.

By Lemma 3.3.1 there is a constant M such that f (x) ≤ M ∈ R for all

x ∈ B(¯ x, 2r), for some r > 0. Define the convex function g(z) := f (¯ x + z) − f (¯ x) and notice that 0 ∈ int (dom g) and that it is upper bounded on B(0, 2r), because

¯. g(z) = f (z + x¯) − f (¯ x) ≤ M − f (¯ x) =: M

Lemma 3.3.2 implies g is Lipschitz on B(0, r) with some constant K. For x0 , x00 close to x¯ we have

|f (x0 )−f (x00 )| = |f ((x0 −¯ x)+¯ x)−f ((x00 −¯ x)+¯ x)| = |g(x0 −¯ x)−g(x00 −¯ x)| ≤ Kkx0 −x00 k,

so f is Lipschitz at x¯.

44

3.4

The Clarke subdifferential

Definition 3.4.1. Let f : E → R ∪ {+∞} be Lipschitz at x¯ and let h be a vector in E. The Clarke directional derivative of f at x¯ in the direction h is defined by:

f ◦ (¯ x; h) := lim sup y→¯ x,t↓0

f (y + th) − f (y) . t

Proposition 3.4.1. Let f : E → R ∪ {+∞} be Lipschitz at x¯ then the function h 7→ f ◦ (¯ x; h) is sublinear and satisfies |f ◦ (¯ x; h)| ≤ Kkhk for all h ∈ E. Proof. Let f be Lipschitz at x¯ with positive constant K. For any h ∈ E we have

f (y + th) − f (y) ≤ lim sup |f (y + th) − f (y)| |f (¯ x; h)| = lim sup t t y→¯ x,t↓0 y→¯ x,t↓0 ◦

≤ K lim sup y→¯ x,t↓0

ky + th − yk = Kkhk. t

This also shows that the function h → f ◦ (¯ x; h) is everywhere finite. Next, we show that h 7→ f ◦ (¯ x; h) is positively homogeneous. For α = 0,

f ◦ (¯ x; 0h) = lim sup y→¯ x,t↓0

f (y) − f (y) = 0 = 0f ◦ (¯ x; h), t

since f ◦ (¯ x; h) is finite. If α > 0 then

f ◦ (¯ x; αh) = lim sup y→¯ x,t↓0

f (y + αth) − f (y) t

= α lim sup y→¯ x,t↓0

= α lim sup y→¯ x,t0 ↓0

f (y + αth) − f (y) αt f (y + t0 h) − f (y) t0

45

= αf ◦ (¯ x; h),

where in the third equality above we made the substitution t0 := αt. Now we show that h 7→ f ◦ (¯ x; h) is subadditive. Indeed,

f ◦ (¯ x; h1 + h2 ) = lim sup y→¯ x,t↓0

= lim sup y→¯ x,t↓0

≤ lim sup y→¯ x,t↓0

= lim sup z→¯ x,t↓0

f (y + th1 + th2 ) − f (y) t f (y + th1 + th2 ) − f (y + th2 ) + f (y + th2 ) − f (y) t f (y + th1 + th2 ) − f (y + th2 ) f (y + th2 ) − f (y) + lim sup t t y→¯ x,t↓0 f (z + th1 ) − f (z) f (y + th2 ) − f (y) + lim sup t t y→¯ x,t↓0

= f ◦ (¯ x; h1 ) + f ◦ (¯ x; h2 ),

where z = y + th2 . By Proposition 3.2.4, h 7→ f ◦ (¯ x; h) is sublinear. Proposition 3.4.2. Let f : E → R ∪ {+∞} be Lipschitz at x¯, then

f ◦ (¯ x; −h) = (−f )◦ (¯ x; h), ∀h ∈ E.

Proof. We calculate

f ◦ (¯ x; −h) = lim sup y→¯ x,t↓0

= lim sup y→¯ x,t↓0

= lim sup u→¯ x,t↓0

f (y − th) − f (y) t (−f )(y) − (−f )(y − th) t (−f )(u + th) − (−f )(u) t

= (−f )◦ (¯ x; h), as stated.

where u := y − th

46 Definition 3.4.2. Let f : E → R ∪ {+∞} be Lipschitz at x¯. The Clarke subdifferential of f at x¯ is defined by

∂ ◦ f (¯ x) := {φ ∈ E | hφ, hi ≤ f ◦ (¯ x; h), ∀ h ∈ E}.

Proposition 3.4.3 (Scalar multiples). Let f : E → R ∪ {+∞} be Lipschitz at x¯, then ∂ ◦ (αf )(¯ x) = α∂ ◦ f (¯ x), ∀α ∈ R. Proof. The function αf is also Lipschitz at x¯, since for all x0 and x00 close to x¯ we have |αf (x0 ) − αf (x00 )| ≤ |α||f (x0 ) − f (x00 )| ≤ K|α||x0 − x00 |. Suppose that α > 0 then

∂ ◦ (αf )(¯ x) ={φ ∈ E|hφ, hi ≤ αf ◦ (¯ x; h) ∀h ∈ E} φ ={φ ∈ E|h , hi ≤ f ◦ (¯ x; h) ∀h ∈ E} α α φ ={ φ ∈ E|h , hi ≤ f ◦ (¯ x; h) ∀h ∈ E} α α ={αφ0 ∈ E|hφ0 , hi ≤ f ◦ (¯ x; h) ∀h ∈ E}

where φ0 =

φ α

=α{φ0 ∈ E|hφ0 , hi ≤ f ◦ (¯ x; h) ∀h ∈ E} = α∂ ◦ f (¯ x).

To show the result for strictly negative α, it is enough to show it for α = −1:

∂ ◦ (−f )(¯ x) = {φ ∈ E|hφ, hi ≤ (−f )◦ (¯ x; h) ∀h ∈ E} = {φ ∈ E|hφ, hi ≤ f ◦ (¯ x; −h) ∀h ∈ E}

by Proposition 3.4.2

= {φ ∈ E|hφ, −h0 i ≤ f ◦ (¯ x; h0 ) ∀h0 ∈ E} = {φ ∈ E|h−φ, h0 i ≤ f ◦ (¯ x; h0 ) ∀h0 ∈ E} = −∂ ◦ (f )(¯ x).

where h0 = −h

47 Finally, if α = 0 then

∂ ◦ (0f )(¯ x) = {φ ∈ E|hφ, hi ≤ 0, ∀h ∈ E} = {0} = {0∂ ◦ f (¯ x)}.

The proof is complete. Proposition 3.4.4 (Local extrema). Let f : E → R ∪ {+∞} be Lipschitz at x¯, where x¯ is a local extremum of f , then 0 ∈ ∂ ◦ f (¯ x). Proof. It is enough to show the result in the case when x¯ is a local minimum, since

0 ∈ ∂ ◦ f (¯ x) ⇔ 0 ∈ −∂ ◦ f (¯ x) ⇔ 0 ∈ ∂ ◦ (−f )(¯ x),

by Proposition 3.4.3. Suppose x¯ is a local minimum and fix an h ∈ E. Then for sufficiently small t ∈ R+ we have f (¯ x + th) ≥ f (¯ x) and

f ◦ (¯ x; h) = lim sup y→¯ x,t↓0

f (y + th) − f (y) f (¯ x + th) − f (¯ x) ≥ lim sup ≥ 0. t t t↓0

This implies f ◦ (¯ x; h) ≥ h0, hi for all h ∈ E. Therefore 0 ∈ ∂ ◦ f (¯ x). Definition 3.4.3. For D ⊆ E, the indicator function of D is

δD (x) =

   0,

if x ∈ D

  +∞, if x ∈ / D.

Lemma 3.4.5. Let f : E → R ∪ {+∞} be Lipschitz at x¯, then

δ∂ ◦ f (¯x) (φ) = sup{hφ, hi − f ◦ (¯ x; h)}, ∀φ ∈ E. h∈E

48 Proof.

If φ ∈ ∂ ◦ f (¯ x) then hφ, hi ≤ f ◦ (¯ x; h) for all h ∈ E by definition. The

supremum is 0 because f ◦ (¯ x; 0) = 0. On the other hand if φ ∈ / ∂ ◦ f (¯ x) then there exist an h0 ∈ E such that hφ, h0 i > f ◦ (¯ x; h0 ). By Proposition 3.4.1, f ◦ (¯ x, ·) is positively homogenous so we have

sup sup{hφ, thi − f ◦ (¯ x; th)} ≥ sup {t hφ, h0 i − f ◦ (¯ x; h0 ) } = +∞ t≥0 h∈E

t∈R+

and the proof is complete. Proposition 3.4.6. Let f : E → R ∪ {+∞} be Lipschitz at x¯ ∈ E with Lipschitz constant K, then the Clarke subdifferential is a compact and convex set. Moreover,

∂ ◦ f (¯ x) ⊂ KB(0, 1).

Proof. By Proposition 3.4.1, h 7→ f ◦ (¯ x; h) is finite. For any φ1 , φ2 ∈ ∂ ◦ f (¯ x) we have

hφ1 , hi ≤ f ◦ (¯ x; h), ∀ h ∈ E hφ2 , hi ≤ f ◦ (¯ x; h), ∀ h ∈ E.

Multiplying the first inequality by α ∈ [0, 1] and the second by (1 − α) and adding the results gives hαφ1 + (1 − α)φ2 , hi ≤ f ◦ (¯ x; h) ∀ h ∈ E. Therefore αφ1 + (1 − α)φ2 ∈ ∂ ◦ f (¯ x), showing that ∂ ◦ f (¯ x) is a convex set. Let {φn } be a sequence of subgradients of f at x¯, converging to φ. Since the inner product is continuous, taking the limit in

hφn , hi ≤ f ◦ (¯ x; h), ∀ h ∈ E

49 shows that hφ, hi ≤ f ◦ (¯ x; h). Thus, φ belongs to ∂ ◦ f (¯ x) showing that ∂ ◦ f (¯ x) is closed. To see that ∂ ◦ f (¯ x) is bounded take a φ in ∂ ◦ f (¯ x) and apply Proposition 3.4.1: hφ, hi ≤ f ◦ (¯ x, h) ≤ Kkhk, ∀ h ∈ E. Choosing h = φ, gives kφk2 ≤ Kkφk or kφk ≤ K. Theorem 3.4.7 (Nonsmooth max formula). Let f : E → R ∪ {+∞} be Lipschitz at x¯, then ∂ ◦ f (¯ x) is a nonempty set and

f ◦ (¯ x; h) = max{hφ, hi | φ ∈ ∂ ◦ f (¯ x)}, ∀h ∈ E.

Proof.

(3.6)

By Proposition 3.4.1, h 7→ f ◦ (¯ x; h) is finite and everywhere sublinear

so f ◦ (¯ x; ·) is convex with dom f ◦ (¯ x; ·) = E. Definition 3.4.2 immediately implies f ◦ (¯ x; h) ≥ sup{hφ, hi | φ ∈ ∂ ◦ f (¯ x)}. Define the convex function g(h) := f ◦ (¯ x; h) and observe that it has a nonempty subdifferential at every h by Proposition 3.2.9. Lemma 3.4.5 implies

sup{hφ, hi | φ ∈ ∂ ◦ f (¯ x)} = sup{hφ, hi − δ∂ ◦ f (¯x) (φ)} φ∈E

= sup{hφ, hi − sup{hφ, yi − f ◦ (¯ x; y)} φ∈E

y∈E

= sup inf {hφ, hi − hφ, yi + f ◦ (¯ x; y)} φ∈E y∈E

≥ sup inf {hφ, h − yi + f ◦ (¯ x; h) + hϕ, y − hi} φ∈E y∈E

= f ◦ (¯ x; h) + sup inf {hφ − ϕ, h − yi} φ∈E y∈E

= f ◦ (¯ x; h),

50 where the inequality holds for all ϕ ∈ ∂g(h); the last equality holds because

inf {hφ − ϕ, h − yi} =

y∈E

   0,

if φ = ϕ

  −∞,

if φ 6= ϕ.

Thus, we showed that

f ◦ (¯ x; h) = sup{hφ, hi | φ ∈ ∂ ◦ f (¯ x)}, ∀h ∈ E.

Since f ◦ (¯ x; h) is finite, we conclude that the set ∂ ◦ f (¯ x) is nonempty. Since, by Proposition 3.4.6, ∂ ◦ f (¯ x) is a compact set the supremum is attained. Proposition 3.4.8 (Nonsmooth sum rule). Let fi : E → R ∪ {+∞} be Lipschitz at x¯ ∈ E, i = 1, . . . , n. Then

n n X X ∂ fi (¯ x) ⊆ ∂ ◦ fi (¯ x). ◦

i=1

i=1

Proof. The right-hand side is the set of all points of the form

Pn

i=1

φi , where each

φi belongs to ∂ ◦ fi (¯ x). It suffices to prove the formula for n = 2, since the general case follows by induction. Thus, assume that n = 2. The left-hand side is a compact set by Theorem 3.4.7. The right-hand side is a sum of two compact sets and is therefore closed. By Corollary 3.1.6 the inclusion holds if and only if

max{hφ, hi | φ ∈ ∂ ◦ (f1 + f2 )(¯ x)} ≤ max{hφ, hi | φ ∈ ∂ ◦ f1 (¯ x) + ∂ ◦ f2 (¯ x)}.

51 By Theorem 3.4.7 we need to show that

(f1 + f2 )◦ (¯ x; h) ≤ f1◦ (¯ x; h) + f2◦ (¯ x; h).

This follows from

(f1 + f2 )◦ (¯ x; h) = lim sup y→¯ x,t↓0

= lim sup y→¯ x,t↓0

≤ lim sup y→¯ x,t↓0

(f1 + f2 )(y + th) − (f1 + f2 )(y) t f1 (y + th) + f2 (y + th) − f1 (y) − f2 (y) t f1 (y + th) − f2 (y) f2 (y + th) − f2 (y) + lim sup t t y→¯ x,t↓0

= f1◦ (¯ x; h) + f2◦ (¯ x; h). Our goal is to prove a nonsmooth chain rule formula. Lemma 3.4.9. Let x, y ∈ E be distinct and suppose that f : E → R is Lipschitz on an open set containing [x, y]. The function g : [0, 1] → R defined by g(t) := f (x+t(y−x)) is Lipschitz on (0, 1) and

∂ ◦ g(t) ⊆ h∂ ◦ f (x + t(y − x)), y − xi.

Proof. We first show that g is Lipschitz. Indeed, for any t1 , t2 ∈ (0, 1)

|g(t1 ) − g(t2 )| = |f (x + t1 (y − x)) − f (x + t2 (y − x))| ≤ Kk(t1 − t2 )(y − x)k ≤ |t1 − t2 |Kky − xk ≤ Kky − xk,

where K is the Lipschitz constant of f . Since ∂ ◦ g(t) and h∂ ◦ f (x + t(y − x)), y − xi are nonempty convex intervals in R, by Corollary 3.1.6 we need to show that for each

52 h ∈ {−1, +1} we have

max{∂ ◦ g(t)h} ≤ max{h∂ ◦ f (x + t(y − x)), y − xih}.

By Theorem 3.4.7 the left-hand side is just

g ◦ (t; h) = lim sup s→t,α↓0

= lim sup s→t,α↓0

= lim sup s→t,α↓0

g(s + αh) − g(s) α f (x + (s + αh)(y − x)) − f (x + s(y − x)) α f (x + s(y − x) + αh(y − x)) − f (x + s(y − x)) . α

Let y 0 := x + s(y − x) and note y 0 converges to x + t(y − x) if and only if s converges to t. We continue by observing that after the substitution, the limit supremum is taken over a larger set of sequences, thus

g ◦ (t; h) ≤

f (y 0 + αh(y − x)) − f (y 0 ) α y 0 →x+t(y−x),α↓0 lim sup

= f ◦ (x + t(y − x); h(y − x)) = max{h∂ ◦ f (x + t(y − x)), y − xih},

where in the last equality we used Theorem 3.4.7. Theorem 3.4.10 (Lebourg mean value theorem). Let x, y ∈ E and let f : E → R ∪ {+∞} be Lipschitz on an open set containing [x, y]. Then there exists a point u in (x, y) such that f (y) − f (x) ∈ h∂ ◦ f (u), y − xi.

53 Proof.

Let xt := x + t(y − x) for t ∈ [0, 1] then define g(t) : [0, 1] → R by

g(t) := f (xt ). Consider the function θ on [0, 1] defined by

θ(t) = f (xt ) + t(f (x) − f (y)).

Since θ(0) = θ(1) = f (x) and θ is continuous on [0, 1] there must be a point t ∈ (0, 1) at which θ attains a local extremum. By Proposition 3.4.4,

0 ∈ ∂ ◦ θ(t) = ∂t◦ (f (xt ) + t(f (x) − f (y)))

by the definition of θ(t)

= ∂t◦ (g(t) + t(f (x) − f (y)))

by the definition of g(t)

⊆ ∂ ◦ g(t) + ∂t◦ t(f (x) − f (y))

by Proposition 3.4.8

= ∂ ◦ g(t) + (f (x) − f (y))∂ ◦ t

by Proposition 3.4.3

= ∂ ◦ g(t) + f (x) − f (y) ⊆ h∂f (xt ), y − xi + f (x) − f (y)

since ∂t◦ t = 1 by Lemma 3.4.9

Letting u := xt shows that f (y) − f (x) ∈ h∂ ◦ f (u), y − xi.

3.5

The Michel-Penot directional derivative

Definition 3.5.1. Let f be Lipschitz at x¯ and let h be a direction in E. For any function f : E → R∪{+∞} the Michel-Penot directional derivative at x¯ ∈ int (dom f ) in the direction h ∈ E is defined by

f (¯ x; h) := sup lim sup u∈E

t↓0

f (¯ x + th + tu) − f (¯ x + tu) . t

54 Proposition 3.5.1. Suppose f : E → R ∪ {+∞} is a convex. If the point x¯ ∈ int (dom f ) then f 0 (¯ x; h) = f (¯ x; h) = f ◦ (¯ x; h). Proof. If h = 0 then the statement is trivial. Fix h ∈ E and assume initially that khk = 1. Clearly f 0 (¯ x; h) ≤ f (¯ x; h). The result follows if f (¯ x; h) ≤ f ◦ (¯ x; h) and f ◦ (¯ x; h) ≤ f 0 (¯ x; h). For each fixed u ∈ E the point x¯ + tu converges to x¯ as t ↓ 0. Thus,

lim sup t↓0

f (¯ x + th + tu) − f (¯ x + tu) f (y + th) − f (y) ≤ lim sup = f ◦ (¯ x; h). t t y→¯ x,t↓0

Taking the supremum of the left-hand side over all u ∈ E shows f (¯ x; h) ≤ f ◦ (¯ x; h). To show the second inequality, let > 0 be small enough so that f is Lipschitz on B(¯ x, 2) with constant K. Such an exists by Theorem 3.3.3. If t ∈ (0, ) then for every y such that ky − x¯k ≤ t, we have y + th ∈ B(¯ x, 2) along with x¯ + th ∈ B(¯ x, 2). Then by Lemma 3.2.5 we have

f (y + th) − f (y) f (y + th) − f (¯ x + th) f (¯ x + th) − f (¯ x) f (¯ x) − f (y) = + + t t t t f (¯ x + th) − f (¯ x) f (¯ x + th) − f (¯ x) + K ≤ + 2K. ≤ K + t t

Taking the limit supremum as y → x¯ and t ↓ 0, from both sides shows

f ◦ (¯ x; h) ≤ f 0 (¯ x; h) + 2K.

(3.7)

Since can be made arbitrarily small we obtain that f ◦ (¯ x; h) ≤ f 0 (¯ x; h) for every h ∈ E with khk = 1.

55 Finally, if h ∈ E is an arbitrary nonzero vector, then by above we get

f ◦ (¯ x; h) = khkf ◦ (¯ x; h/khk) ≤ khkf 0 (¯ x; h/khk) = f 0 (¯ x; h),

where we used that both directional derivatives are positively homogeneous in the direction, by Proposition 3.2.7 and Proposition 3.4.1.

56

Chapter 4 Majorization

This chapter follows the development in [2, Chapter 1] and [23, Chapter 2].

4.1

Basic definitions Denote by e the all-one vector in Rn and let [·] : Rn → Rn be the map

permuting the coordinates of a vector into nonincreasing order. In other words for any x ∈ Rn [x]1 ≥ [x]2 ≥ . . . ≥ [x]n . A cone is the subset C of vector space X such that 0 ∈ C and for any α ∈ R++ the relationship αC = C holds. Denote by Rn↓ := {x ∈ Rn | x = [x]} the set of all vectors in Rn that are ordered non-increasingly. It is easy to see that Rn↓ is a cone. A permutation matrix is an n × n real matrix with entries 0 or 1 and exactly one 1 in each row and in each column. Then set of all n × n permutation matrices is denoted by Pn . The Hadamard or entrywise product product of two m × n matrices A and B, denoted by A ◦ B, is the m × n matrix defined by (A ◦ B)ij = Aij B ij for all i, j.

57

4.2

Majorizations of vectors

Definition 4.2.1. For x, y ∈ Rn we say x is majorised by y, denoted by x ≺ y, if k X

[x]i ≤

i=1

k X

[y]i , for all k = 1, . . . , n − 1, and

i=1

n X

[x]i =

i=1

n X

[y]i .

i=1

Subtracting the inequalities from the equality we obtain an alternative characterization: x ≺ y if and only if

n X i=n−k+1

[y]i ≤

n X

[x]i , for all k = 1, . . . , n − 1, and

n X

xi =

i=1

i=n−k+1

n X

yi .

i=1

Notice that for any P1 , P2 ∈ Pn we have x ≺ y if and only if P1 x ≺ P2 y. Definition 4.2.2. A partial order on a set P is a binary relation, denoted by ≤, that satisfies the following three properties: (a) Reflexivity: a ≤ a for all a ∈ P ; (b) Antisymmetry: If a ≤ b and b ≤ a for some a, b ∈ P then a = b; (c) Transitivity: If a ≤ b and b ≤ c for some a, b, c ∈ P then a ≤ c. It is immediate to see that majorization is reflexive and transitive. We will see later on that if x ≺ y and y ≺ x then x = P y, for some P ∈ Pn . Thus, majorization is not a partial order, since it is not antisymmetric. Example 4.2.1. Let Ω := {x ∈ Rn+ |

Pn

i=1

xi = 1} then for all x ∈ Ω we have

1 1 1 , ,..., ≺ x ≺ (1, 0, . . . , 0). n n n

58 Pk

≤ 1, so x ≺ (1, 0, . . . , 0). Suppose that there

i=1 [x]i

< nk . Since some of the largest k coordinates of x

Indeed, it is easy to see that is an index k such that are strictly less than This implies that

1 n

i=1 [x]i

Pk

then all of the remaining coordinates are strictly less than n1 .

Pn

i=k+1 [x]i

k X

<

n−k n

n X

[x]i +

i=1

and

k n−k + = 1. n n

[x]i <

i=k+1

This is a contradiction with the fact that

Pn

i=1

xi = 1.

A doubly stochastic matrix is an n × n matrix with nonnegative entries, such that the entries in each row and column sum to 1. The proof of the following proposition can be found in [2, Theorem II.2.3]. Proposition 4.2.1 (Birkhoff). A matrix is doubly stochastic if and only if it is a convex combination of permutation matrices. Lemma 4.2.2. For any y ∈ Rn the set {x ∈ Rn | x ≺ y} is convex. Proof.

Let x, x0 ∈ Rn be two vectors that are majorized by y. Without loss of

generality assume x = [x] and y = [y]. For any µ ∈ [0, 1] and k = 1, . . . , n − 1 we have µ

k X i=1

xi + (1 − µ)

k X i=1

x0i

≤µ

k X

yi + (1 − µ)

i=1

k X i=1

yi =

k X

yi .

i=1

Clearly, the inequality holds with equality when k = n. Proposition 4.2.3. A matrix D is doubly stochastic if and only if Dx ≺ x for all vectors x ∈ Rn . Proof.

Let Dx ≺ x for all x ∈ Rn . Denote ith column of D by di := Dei . Since

59 Dei ≺ ei we have

[di ]1 + · · · + [di ]n−1 ≤ 1

(4.1)

[di ]1 + · · · + [di ]n = 1.

(4.2)

Equation (4.2) guarantees the ith column of D sums to 1. Subtracting (4.2) from (4.1) we see that [di ]n ≥ 0 implying that 1 ≥ [di ]1 ≥ . . . ≥ [di ]n ≥ 0. Denote r := De, that is ri is the sum of the entries of in the ith row of D. Since De ≺ e we have

[r]1 ≤ 1 .. . [r]1 + · · · + [r]n−1 ≤ n − 1

(4.3)

[r]1 + · · · + [r]n = n.

(4.4)

Taking the difference between (4.4) and (4.3), we see that 1 ≥ [r]1 ≥ . . . ≥ [r]n ≥ 1. This shows that D is doubly stochastic. Now, suppose that D is a doubly stochastic matrix. We want to show that Dx ≺ x for all x ∈ Rn . By Proposition 4.2.1 matrix D is a convex combination of permutation matrices, thus Dx = µ1 P1 x+· · ·+µn Pn x for some Pi ∈ Pn and µi ∈ [0, 1] with

Pn

i=1

µi = 1. By Lemma 4.2.2, since Pi x ≺ x the convex combination Dx must

be majorized by x. A T-transform is a linear map T : Rn → Rn of the form tI + (1 − t)P where

60 P ∈ Pn and t ∈ [0, 1]. Define the following set of linear maps

Tj,k := {tI + (1 − t)Pj,k | t ∈ [0, 1]}

where Pj,k ∈ Pn is the matrix corresponding to the transposition of numbers j and k. For example, for any y ∈ Rn if T ∈ Tj,k then

T y = (y1 , . . . , yj−1 , tyj + (1 − t)yk , yj+1 , . . . , yk−1 , (1 − t)yj + tyk , yk+1 , . . . , yn )T .

Lemma 4.2.2 implies that T y ≺ y for all y. Proposition 4.2.4. For x, y ∈ Rn , the following statements are equivalent: (i) x ≺ y; (ii) x is obtained from y by a finite number of T-transforms; (iii) x ∈ conv {P y | P ∈ P n }; (iv) x = Dy for some doubly stochastic matrix D. Proof. We show the implication (i) ⇒ (ii) by induction on dimension n. Since x is obtained from [x] by a finite number of T -transforms, we may assume without loss of generality that x = [x] and y = [y]. When n = 2, (i) implies that

x1 ≤ y1 and x1 + x2 = y1 + y2 .

(4.5) (4.6)

Subtracting (4.6) from (4.5) we see that y1 ≥ x1 ≥ x2 ≥ y2 . Thus, there exists a

61 t ∈ [0, 1] such that x1 = ty1 + (1 − t)y2 . Subtracting this representation of x1 from (4.6), we obtain x2 = (1 − t)y1 + ty2 , showing that x = T y where T ∈ T1,2 . Next, assume that implication (i) ⇒ (ii) holds for vectors of dimension up to n − 1. That is for any x0 , y 0 ∈ Rn−1 if x0 ≺ y 0 then x0 = Tr . . . T1 y 0 for some T-transforms acting on vectors in Rn−1 . Let x, y ∈ Rn , if x ≺ y, then x1 ≤ y1 and yn ≤ xn . Combine these two inequalities to get yn ≤ x1 ≤ y1 . Thus, there must be an index k such that yk ≤ x1 ≤ yk−1 . Let t ∈ [0, 1] such that x1 = ty1 + (1 − t)yk then observe that T := {tI + (1 − t)P1,k } belongs to T1,k and the first coordinate of T y is x1 . Focusing on the last n − 1 entries of x and T y we let

x0 := (x2 , . . . , xn )T and y 0 := (y2 , . . . , yk−1 , (1 − t)y1 + tyk , yk+1 , . . . , yn ).

(4.7)

Because of the choice of the index k we have x0 = [x0 ] and y 0 = [y 0 ]. We now show that x0 ≺ y 0 . Indeed, since

y1 ≥ . . . ≥ yk−1 ≥ x1 ≥ x2 ≥ . . . ≥ xn ≥ yn ,

we have for 1 ≤ m ≤ k − 2

m X j=1

x0i

=

m+1 X

xj ≤

j=2

m+1 X

yj =

m X

j=2

yi0 .

j=1

For k − 1 ≤ m ≤ n − 1

m X j=1

yj0

=

k−1 X j=2

yj + (1 − t)y1 + tyk +

m+1 X j=k+1

yj =

k X j=1

yj − ty1 − yk + tyk +

m+1 X j=k+1

yj

62

=

m+1 X

yj − ty1 − (1 − t)yk =

j=1

≥

m+1 X j=2

m+1 X

yj − x1

j=1

xj =

m X

x0j ,

j=1

where the inequality follows from x ≺ y. This shows that x0 ≺ y 0 . By the induction hypothesis x0 = Tr · · · T1 y 0 for some T -transforms Tr , . . . , T1 on Rn−1 . Letting 



 1 0  Ti0 :=   for i = 1, . . . , r 0 Ti

we have x = Tr0 · · · T10 T y, completely proving the implication (i) ⇒ (ii). The implication (ii) ⇒ (iii) follows from the fact that the product of Ttransforms is a convex combination of permutation matrices. Let T1 := t1 I +(1−t1 )P1 and T2 := t2 I + (1 − t2 )P2 for P1 , P2 ∈ Pn and t1 , t2 ∈ [0, 1]. Then

T1 T2 =(t1 I + (1 − t1 )P1 )(t2 I + (1 − t2 )P2 ) =t1 t2 I + t1 (1 − t2 )P2 + (1 − t1 )t2 P1 + (1 − t1 )(1 − t2 )P1 P2 .

This is a convex combination of permutation matrices since the product of permutation matrices is a permutation matrix. It now easy to see that the product of more than two T -transforms is a convex combination of permutation matrices. The implication (iii) ⇒ (iv) follows by Proposition 4.2.1.

Implication

(iv) ⇒ (i) is a consequence of Proposition 4.2.3. Corollary 4.2.5. For any x, y ∈ Rn , x ≺ y if and only if −x ≺ −y. Definition 4.2.3. The matrix D is orthostochastic if there exists an orthogonal ma-

63 trix U such that D = U ◦ U . Every orthostochastic matrix is doubly stochastic matrices. The converse is true only when n ≤ 2. Indeed, the case n = 1 is trivial. When n = 2, a doubly stochastic matrix can be expressed as 

 1−r  , 1−r r r

 D=

for some 0 ≤ r ≤ 1. Let θ be such that r = cos2 θ and then clearly  

 cos θ  D=

− sin θ



sin θ   cos θ ◦ − sin θ cos θ

sin θ  . cos θ

For n ≥ 3 there are doubly stochastic matrices that are not orthostochastic. Indeed, we will use the example cited by Horn [10, pg. 622]. Consider the doubly stochastic matrix





 1 1 0   1   D =  1 0 1 .  2   0 1 1 Suppose there exists an orthogonal U such that D = U ◦ U then U would have to be of the form





 a b 0      U =  c 0 d .     0 e f Since U is orthogonal we have ac = 0. This is contradiction because if a = 0 then 1 2

= D11 = (U 11 )2 = 0 and analogously if c = 0.

64 Proposition 4.2.4, part (iv) has an unexpected strengthening due to A. Horn. The proof presented below is not the original one, it is modified from [23]. Proposition 4.2.6. For x, y ∈ Rn , x ≺ y if and only if there exists an orthostochastic matrix D such that x = Dy. Proof. One direction is clear by Proposition 4.2.4, since every orthostochastic matrix is doubly stochastic. To prove the opposite direction suppose that x ≺ y. We may assume that x = P1 x = [x] and y = P2 y = [y] for some P1 , P2 ∈ Pn , since this implies ˆ for D ˆ := P1−1 P2 D, which is also orthostochastic. We show the existence of x = Dy an orthostochastic D such that x = Dy by induction on the dimension n. First let n := 2. By Proposition 4.2.4 (ii) we know that 

 1−t   y, 1−t t t

 x=

for some t ∈ [0, 1]. Define the orthogonal matrix √

 √ t − 1−t   U :=  , √ √ 1−t t 

we see that Dij = (U ij )2 , showing that D is orthostochastic. Suppose that for any x0 , y 0 ∈ Rn−1 with x0 ≺ y 0 there exists an (n − 1) × (n − 1) orthostochastic matrix ∆ such that x0 = ∆y 0 . Let x, y ∈ Rn with x ≺ y. In proving the implication (i) ⇒ (ii) in Proposition 4.2.4 we showed that there exist T-transforms T1 , . . . , Tr on Rn−1 and

65 a T-transform T ∈ T1,k on Rn such that 







 1 0   1 0  x= ···  T y. 0 Tr 0 T1

(4.8)

Define x0 and y 0 as in Equation (4.7) and recall that x0 ≺ y 0 . Applying the induction hypothesis, there there must be an orthostochastic ∆ of dimension (n − 1) × (n − 1) such that x0 = ∆y 0 . Then we can write Equation (4.8) as

x = DT y,

where 



 1 0  D= . 0 ∆ Our goal is to show that DT is an orthostochastic matrix. Without loss of generality we may suppose that k = 2, that is T ∈ T1,2 , allowing us to write 



1−t 0  t   T = 1−t t 0   0 0 In−2

   .  

Then 



1−t 0  , ˜ (1 − t)δ tδ ∆ t

 DT = 

˜ is the matrix where δ ∈ Rn−1 is the first column of ∆ and the (n−1)×(n−2) matrix ∆ of the remaining n − 2 columns. Since ∆ is orthostochastic there is a U ∈ On−1 such

66 that ∆ij = (U ij )2 . Let u ∈ Rn−1 be the first column of U and let the (n − 1) × (n − 2) matrix U˜ be the matrix of the remaining n − 2 columns. Define a n × n matrix V by √

 √ t − 1−t 0   V :=  . √ √ ˜ 1 − tu tu U 

It is easy to see that V is orthogonal and (DT )ij = (V ij )2 . Theorem 4.2.7 (Hardy-Littlewood-Polya). Any vectors x, y ∈ Rn satisfy

xT y ≤ [x]T [y].

Equality holds if and only if there is a P ∈ Pn such that [x] = P x and [y] = P y. Proof. Assume without loss of generality that y = [y], since the inequality does not change if we simultaneously exchange x with P x and y with P y. If x = [x] then there is nothing to prove. Thus, suppose that xm ≤ xm+1 for some m ∈ {1, . . . , n − 1} and denote x¯ := (x1 , . . . , xm−1 , xm+1 , xm , xm+2 , . . . , xn ). We want to show that xT y ≤ x¯T y, indeed

x¯T y − xT y = xm+1 ym + xm ym+1 − (xm ym + xm+1 ym+1 ) = (xm+1 − xm )ym + (xm − xm+1 )ym+1 = (xm+1 − xm )(ym − ym+1 ) ≥ 0.

Thus, with each such swap the value of the dot product xT y does not decrease. We can then swap coordinates in x if xm ≤ xm+1 until x = [x]. This proves the inequality xT y ≤ [x]T [y].

67 Equality holds if and only if we had equality in each swapping process. Then when xm < xm+1 equality holds only if ym = ym+1 . If P is the permutation matrix that transposes xm and xm+1 then P y = y. Letting P ∈ Pn be the permutation matrix that is the product of these transposition matrices that order x non-increasingly it satisfies P y = y = [y] and P x = [x]. Corollary 4.2.8. For any x, y ∈ Rn , x ≺ y if and only if

z T x ≤ [z]T [y], ∀z ∈ Rn .

Proof. Suppose that x ≺ y. By Proposition 4.2.4 (iii) x is in the convex hull of all vectors obtained by permuting the coordinates of y. That is, x = µ1 P1 y + . . . + µn Pn y for some µi ∈ [0, 1] with

Pn

i=1

µi = 1. Then by Theorem 4.2.7

z T x = z T (µ1 P1 y + . . . + µn Pn y) ≤ µ1 [z]T [P1 y] + . . . + µn [z]T [Pn y] = µ1 [z]T [y] + . . . + µn [z]T [y] = [z]T [y].

In the opposite direction, suppose that z T x ≤ [z]T [y] for all z ∈ Rn . Let indices i1 , . . . , in be such that xi1 = [x]1 , . . . , xin = [x]n . Then for k = 1, . . . , n

[x]1 + . . . + [x]k = (ei1 + . . . + eik )T x ≤ [ei1 + . . . + eik ]T [y] = [y]1 + . . . + [y]k .

The last inequality (k = n) holds with equality since

−[x]1 − . . . − [x]n = (−e)T x ≤ [−e]T [y] = −[y]1 − . . . − [y]n .

68 Multiplying by (−1), we get [x]1 + . . . + [x]n ≥ [y]1 + . . . + [y]n . Thus, x ≺ y. Corollary 4.2.9. For a, b ∈ Rn we have a + b ≺ [a] + [b]. Proof. For every z ∈ Rn we have

z T (a + b) = z T a + z T b ≤ [z]T [a] + [z]T [b] = [z]T ([a] + [b]),

by Theorem 4.2.7.

69

Chapter 5 Theorems of Courant-Fischer, Schur, Fan & Lidskii

The developments in this chapter are based on [20] and [2]. Throughout this chapter we assume that M is a subspace of Rn .

5.1

Inequalities of Courant-Fischer, Schur & Fan Define the map λ : Sn → Rn↓ by λ(A) := (λ1 (A), . . . , λn (A))T for all A ∈ Sn ,

where λi (A) is the ith largest eigenvalue of A, i = 1, . . . , n. This map is called the eigenvalue map. Proposition 5.1.1. Let M1 and M2 be subspaces of the n-dimensional real vector space X. If dim M1 + dim M2 > n then M1 ∩ M2 contains a non-zero vector. Proof. Assume that dim M1 = j and dim M2 > n − j. If dim(M1 ∩ M2 ) = 0 then

dim(M1 + M2 ) = dim M1 + dim M2 > n.

This is a contradiction with the fact that M1 + M2 is a subspace of X

70 Given a subspace M of Rn , by SM we denote the unit sphere in M:

SM := {y ∈ M | kyk2 = 1}.

Proposition 5.1.2 (Courant-Fischer). Let A ∈ Sn . Then

λi (A) = max min xT Ax and

(5.1)

max xT Ax.

(5.2)

dim M=i x∈SM

λi (A) =

min

dim M=n−i+1 x∈SM

Proof. By Proposition 2.5.6 there exists a U ∈ On such that A = U T (Diag λ(A))U . We first show Equality (5.1). Let y := U T x and observe that kxk = kyk since U is orthogonal. By Proposition 2.2.2, with D := Diag λ(A), we need to establish that

λi (A) = sup

min y T Dy.

dim M=i y∈SM

Let M ⊆ Rn be any subspace of dimension i and let F := span{ei , . . . , en }. By Proposition 5.1.1 the subspace M ∩ F has dimension at least one. Thus, SM∩F is nonempty and vectors y ∈ SM∩F are of the form y = (0, . . . , 0, yi , . . . , yn )T with Pn

j=i

yj2 = 1. So, we have

y T Dy =

n X j=i

λj (A)yj2 ≤ λi (A)

n X j=i

yj2 = λi (A) for all y ∈ SM∩F .

71 Since SM∩F ⊆ SM , it follows that

min y T Dy ≤ min y T Dy ≤ λi (A).

y∈SM

y∈SM∩F

These inequalities hold for any subspace M of dimension i, hence

min y T Dy ≤ λi (A).

sup

dim M=i y∈SM

(5.3)

˜ := span{e1 , . . . , ei }. Then every y ∈ S ˜ has the For the opposite inequality let F F form y = (y1 , . . . , yi , 0, . . . , 0)T with

y T Dy =

i X

Pi

j=1

λj (A)yj2 ≥ λi (A)

j=1

yj2 = 1 and hence i X

yj2 = λi (A) for all y ∈ SF˜ .

j=1

So sup

min y T Dy ≥ min y T Dy ≥ λi (A).

dim M=i y∈SM

y∈SF ˜

(5.4)

Combining (5.3) and (5.4) we obtain (5.1). Notice that the supremum is attained by ˜ the subspace F. We prove Equation (5.2) by establishing

λi (A) =

inf

max y T Dy.

dim M=n−i+1 y∈SM

(5.5)

Let M be any subspace of dimension n − i + 1 and let F := span{e1 , . . . , ei }. By Proposition 5.1.1 the space M ∩ F has dimension at least one. Thus, the set SM∩F is nonempty and vectors y ∈ SM∩F are of the form y = (y1 , . . . , yi , 0, . . . , 0)T with

72 Pi

j=1

yj2 = 1. So we have

T

y Dy =

i X

λj (A)yj2

≥ λi (A)

j=1

i X

yj2 = λi (A) for all y ∈ SM∩F .

j=1

Since SM∩F ⊆ SM , it follows

max y T Dy ≥ max y T Dy ≥ λi (A).

y∈SM

y∈SM∩F

These inequalities hold for every subspace M of dimension n − i + 1, hence

max y T Dy ≥ λi (A).

inf

dim M=n−i+1 y∈SM

(5.6)

˜ := span {ei , . . . , en }, then every We now prove the reverse inequality. Let F y ∈ SF˜ has the form y = (0, . . . , 0, yi , . . . , yn )T and hence

T

y Dy =

n X

λj (A)yj2

≤ λi (A)

j=i

n X

yj2 = λi (A) for all y ∈ SF˜ .

j=1

So inf

max y T Dy ≤ max y T Dy ≤ λi (A).

dim M=n−i+1 y∈SM

y∈SF ˜

(5.7)

Combining (5.6) and (5.7) we obtain (5.5), proving (5.2). Notice that the infimum is ˜ attained by the subspace F. Corollary 5.1.3. Let A ∈ Sn , then

λ1 (A) = max xT Ax and x∈SRn

(5.8)

73

λn (A) = min xT Ax.

(5.9)

x∈SRn

Proof. Set i = 1 in Equation (5.2) to obtain Equation (5.8). Similarly, set i = n in Equation (5.1) for the result in Equation (5.9). Proposition 5.1.4 (Interlacing eigenvalues). For A ∈ Sn , c ∈ Rn , α ∈ R let 



 A c  n+1 B :=  ∈S . cT α

Then

λ1 (B) ≥ λ1 (A) ≥ λ2 (B) ≥ λ2 (A) ≥ . . . ≥ λn (B) ≥ λn (A) ≥ λn+1 (B).

(5.10)

Proof. By Proposition 2.5.6 there exists a U ∈ On such that A = U T (Diag λ(A))U . Let 



 U 0  n V :=  ∈O . 0 1 ˜ := V BV T . Let By Proposition 2.2.2 the eigenvalues of B are identical to those of B F := span {e1 , . . . , ei } ⊂ R(n+1) . For any x ∈ SF we have

˜ = x Bx T

i X

λj (A)x2j

j=1

≥ λi (A)

i X

x2j = λi (A).

j=1

˜ yields Applying Equation (5.1) to B

˜ ≥ min xT Bx ˜ ≥ λi (A). λi (B) = max min xT Bx dim M=i x∈SM

x∈SF

74 Now let F := span {ei−1 , . . . , en } ⊂ Rn+1 . For any x ∈ SF we have

˜ = x Bx T

n X

λj (A)x2j

≤ λi−1 (A)

j=i−1

n X

x2j = λi−1 (A),

j=1

where we used that xn+1 = 0. Applying Equation (5.2) we have

λi (B) =

min

˜ ≤ max xT Bx ˜ ≤ λi−1 (A). max xT Bx

dim M=n−i+2 x∈SM

x∈SF

The proof is complete.

If A is any n×n principal submatrix of B ∈ Sn+1 , then (5.10) holds, because each principal submatrix can be brought to the upper-left-hand corner by a transformation P T BP for some P ∈ Pn+1 . Thus the eigenvalues of a matrix belonging to Sn +1 can be interlaced with the eigenvalues of each of its n×n principal submatrices. Proposition 5.1.5 (Schur). Any A ∈ Sn satisfies diag (A) ≺ λ(A). Proof. By Proposition 2.5.6 there exists a U ∈ On such that A = U T (Diag λ(A))U . Denote by ui the ith column of U , then Aij = uTi (Diag λ(A))uj for i, j = 1, . . . , n. Thus, we have the following relationships

A11 = λ1 (A)(U 11 )2 + λ2 (A)(U 21 )2 + · · · + λn (A)(U n1 )2 .. . Ann = λ1 (A)(U 1n )2 + λ2 (A)(U 2n )2 + · · · + λn (A)(U nn )2 .

This shows that diag (A) = (U ◦ U )T λ(A). Since U ◦ U is a doubly stochas-

75 tic matrix, Proposition 4.2.4 implies that diag (A) ≺ λ(A). Proposition 5.1.6 (Ky Fan maximum principle). Any A ∈ Sn satisfy

k X

λi (A) = max

k X

i=1

xTi Axi , for k = 1, . . . , n.

i=1

The maximum is taken over all orthonormal k-tuples of vectors {x1 , . . . , xk } in Rn . Proof. By Proposition 2.5.6 there exists U ∈ On such that A = U T (Diag λ(A))U . The ith row ui of U is an eigenvector of A corresponding to the eigenvalue λi (A). Since {u1 , . . . , uk } is an orthonormal k-tuple

max

k X

xTi Axi

i=1

≥

k X

uTi Aui

i=1

=

k X

λi (A),

i=1

where the maximum is taken over all orthonormal k-tuples of vectors {x1 , . . . , xk }. We now show the opposite inequality. Let {x1 , . . . , xk } be an orthonormal k-tuple. Complete it to an orthonormal basis and let X := [x1 , . . . , xn ] ∈ On . By Proposition 5.1.5 we have

k X

xTi Axi

i=1

=

k X

T

[diag (X AX)]i ≤

i=1

k X

T

λi (X AX) =

i=1

k X

λi (A),

i=1

where we also used Proposition 2.2.2. Taking the maximum over all orthonormal k-tuples {x1 , . . . , xk } gives

max

k X i=1

The result follows.

xTi Axi

≤

k X i=1

λi (A).

(5.11)

76 Observe that setting k = 1 in Proposition 5.1.6 gives the first statement of Corollary 5.1.3. Corollary 5.1.7. Any A, B ∈ Sn satisfy

k X

λj (A + B) ≤

j=1

k X

λj (A) +

k X

j=1

λj (B) for k=1,. . . ,n.

(5.12)

j=1

Proof. By Proposition 5.1.6 we have

k X

λj (A + B) = max

j=1

k X

xTj (A

+ B)xj ≤ max

j=1

=

k X j=1

λj (A) +

k X

xTj Axj

+ max

j=1 k X

k X

xTj Bxj

j=1

λj (B),

j=1

where all maxima are over all orthonormal k-tuples of vectors {x1 , . . . , xk } in Rn . Corollary 5.1.8. Any A, B ∈ Sn satisfy λ(A + B) ≺ λ(A) + λ(B). Proof. The linearity of the trace implies tr (A+B) = tr (A)+tr (B), which combined with Inequality (5.12) shows the result. Corollary 5.1.9. The function σk : Sn → R defined by σk (A) :=

Pk

i=1

λi (A) is

sublinear for k ≤ n and any A ∈ Sn . Proof. Corollary 5.1.7 shows that σk is subadditive. The sublinearity follows by the fact that σk is positively homogeneous. In particular, Corollary 5.1.9 shows that σk is a convex function. Since dom σk = Sn , it is Lipschitz continuous at any A ∈ Sn by Theorem 3.3.3. Since λi = σi − σi−1 we obtain that every λi is Lipschitz continuous at any A ∈ Sn .

77

5.2

Lidskii’s theorem via Wielandt’s minimax principle The next theorem is proved in the Appendix, see Theorem A.2.1 there. In

what follows, by Ni , Mi we denote subspaces of Rn , i = 1, . . . , k. The abbreviation ‘o.n.’ stands for ‘orthonormal’. Theorem 5.2.1 (Wielandt’s minimax principle). Any A ∈ Sn and any indices 1 ≤ i1 < · · · < ik ≤ n satisfy k X

λij (A) =

j=1 k X

λij (A) =

j=1

max

min

k X

M1 ⊂···⊂Mk xj ∈Mj dim Mj =ij {xj } o.n. j=1

min

xTj Axj

k X

max

N1 ⊃···⊃Nk xj ∈Nj dim Nj =n−ij +1 {xj } o.n. j=1

(5.13)

xTj Axj .

(5.14)

Proposition 5.2.2. Any A, B ∈ Sn and any indices 1 ≤ i1 < · · · < ik ≤ n satisfy k X

λij (A + B) ≤

j=1

k X

λij (A) +

k X

j=1

λj (B).

j=1

Proof. Equation (5.13) guarantees the existence of a chain of subspaces M1 ⊂ · · · ⊂ Mk with dim Mj = ij such that k X j=1

λij (A + B) = min

k X

xj ∈Mj {xj } o.n. j=1

≤ min

xj ∈Mj {xj } o.n.

xTj (A

k X

+ B)xj = min

xj ∈Mj {xj } o.n.

k X

xTj Axj

+

xTj Bxj

j=1

xTj Axj + λj (B) ,

j=1

by Proposition 5.1.6. Since the eigenvalues of λj (B) are independent of the choice of

78 the orthonormal k-tuple {x1 , . . . , xk } we have k X

λij (A + B) ≤ min

xj ∈Mj {xj } o.n.

j=1

≤

=

max

k X

xTj Axj

k X

min

λij (A) +

j=1

λj (B)

i=1

M1 ⊂···⊂Mk xj ∈Mj dim Mj =ij {xj } o.n. j=1 k X

+

j=1

k X

k X

xTj Axj +

k X

λj (B)

j=1

λj (B),

j=1

by Equation (5.13) again. Notice that Corollary 5.1.7 is a particular case of Proposition 5.2.2 when ij = j for all j = 1, . . . , k. Theorem 5.2.3 (Lidskii’s Theorem - 1st Proof). Any A, B ∈ Sn satisfy

λ(A + B) − λ(A) ≺ λ(B).

Proof. Fix k ∈ {1, . . . , n}. Choose indices 1 ≤ i1 < . . . < ik ≤ n such that k X

(λ(A + B) − λ(A))ij =

j=1

k X

[λ(A + B) − λ(A)]j .

j=1

By Proposition 5.2.2

k X j=1

[λ(A + B) − λ(A)]j =

k X j=1

λij (A + B) −

k X j=1

λij (A) ≤

k X

λj (B).

j=1

The inequality holds with equality when k = n, since the trace is a linear function. Theorem 5.2.3 is not a trivial consequence of Corollary 5.1.8 because for

79 vectors a, b, c ∈ Rn , the relationship [c] ≺ [a] + [b] does not imply [c] − [b] ≺ [a]. To see this, consider the vectors a = (1, 0)T , b = (3, −3)T and c = (1, 0)T . Then [c] − [b] = (−2, 3)T and [a] + [b] = (4, −3) so [c] ≺ [a] + [b], but [c] − [b] ⊀ [a]. Corollary 5.2.4. Any a, b ∈ Rn satisfy [a + b] − [b] ≺ [a]. Proof.

Let A := Diag a and B := Diag b, then λ(A) = [a], λ(B) = [b] and

λ(A + B) = [a + b]. Applying Theorem 5.2.3 produces the result. Corollary 5.2.5. Any A, B ∈ Sn satisfy λ(A + B) ∈ conv {λ(A) + P λ(B) | P ∈ P n }. Proof.

The result is immediate, since Theorem 5.2.3 and Proposition 4.2.4 (iii)

imply λ(A + B) − λ(A) ∈ conv {P λ(B) | P ∈ Pn }.

5.3

Lidskii’s theorem via the Courant-Fischer theorem For any A, B ∈ Sn we write A 0 when A is positive semi-definite and

A B if A − B 0. Lemma 5.3.1. For A, B ∈ Sn if A B then λi (A) ≥ λi (B) for all i = 1, . . . , n. Proof. By Equation (5.1) we have

T

λi (A) = max min x Bx + x (A − B)x ≥ max min xT Bx = λi (B), dim M=i x∈SM

T

dim M=i x∈SM

since A − B is positive semi-definite and xT (A − B)x ≥ 0. Thus, λi (A) ≥ λi (B). Chapter 6 presents another proof of Lemma 5.3.1 using Theorem 3.4.10. Theorem 5.3.2 (Lidskii’s theorem - 2nd proof). Any A, B ∈ Sn satisfy

λ(A + B) − λ(B) ≺ λ(A).

80 Proof. By Proposition 2.5.6 there is a U ∈ On such that A = U T (Diag λ(A))U . Fix an integer k ∈ {1, . . . , n} and define 



0  max{λ1 (A) − λk (A), 0} . . .   .. .. ... A+ := U T  . .   0 · · · max{λn (A) − λk (A), 0}

    U 0.  

Notice that A+ A − λk (A)I, which is equivalent to A+ + B A − λk (A)I + B. Lemma 5.3.1, implies that λi (A+ + B) − λi (B) ≥ λi (A − λk (A)I + B) − λi (B) for all i = 1, . . . , n. It follows that

k X

[λ(A + B) − λ(B)]i =

i=1

= ≤

k X i=1 k X

[λ(A + B) − λ(B) − λk (A)e]i + kλk (A). [λ(A − λk (A)I + B) − λ(B)]i + kλk (A).

i=1 k X

[λ(A+ + B) − λ(B)]i + kλk (A).

i=1

Since A+ + B B, Lemma 5.3.1 implies λi (A+ + B) − λi (B) ≥ 0 for all i = 1, . . . , n so we we continue

k X

[λ(A+ + B) − λ(B)]i + kλk (A) ≤

i=1

n X

[λ(A+ + B) − λ(B)]i + kλk (A)

i=1

= tr (A+ + B) − tr (B) + kλk (A) =

k X

λi (A),

i=1

since tr (A+ ) =

Pk

i=1

λi (A)−kλk (A). Combining the two steps and using the linearity

of the trace (when k = n) completes the proof.

81

Chapter 6 Spectral functions

6.1

Derivatives of spectral functions

Definition 6.1.1. A function F : Sn → R ∪ {+∞} is called orthogonally invariant or a spectral function if F (U T AU ) = F (A) for all A ∈ dom F and for all U ∈ On . For example, the function A ∈ Sn 7→ − log det(A) is orthogonally invariant, since det(U T AU ) = det(A). Definition 6.1.2. The function f : Rn → R ∪ {+∞} is called symmetric if f (P a) = f (a) for all a ∈ dom f and for all permutation matrices P ∈ Pn . Lemma 6.1.1. If f is symmetric and Lipschitz at a then f is also Lipschitz at [a]. Proof. Let a0 , a00 ∈ Rn be two points in a neighborhood of [a]. Choose P ∈ Pn such that P [a] = a. Then P a0 and P a00 are in a neighborhood of a. So

|f (a0 ) − f (a00 )| = |f (P a0 ) − f (P a00 )| ≤ KkP a0 − P a00 k = Kka0 − a00 k,

82 where K is the Lipschitz constant of f at a. We want to prove an analogous statement for spectral functions and for that we need a few preparatory results. Proposition 6.1.2 (Fan). Any A, B ∈ Sn satisfy the inequality

tr (AB) ≤ λ(A)T λ(B).

Proof.

First, suppose A = Diag a for some a ∈ Rn . By Proposition 5.1.5 and

Corollary 4.2.8 it follows that aT (diag B) ≤ [a]T λ(B) then

tr (AB) = aT (diag B) ≤ [a]T λ(B) = λ(A)λ(B).

If A ∈ Sn is arbitrary, by Proposition 2.5.6 there is for some U ∈ On such that A = U T (Diag λ(A))U . Then

tr (AB) = tr U T (Diag λ(A))U B = tr (Diag λ(A))U BU T ≤ λ(A)T λ(U BU T ) = λ(A)T λ(B),

by the first case. The next corollary shows the eigenvalue map is Lipschitz with constant 1. Notice that λ(A)T λ(A) = kAk2 for all A ∈ S n . Indeed, let U ∈ On be such that A = U T (Diag λ(A))U

hλ(A), λ(A)i = hDiag λ(A), Diag λ(A)i = hU T (Diag λ(A))U, U T (Diag λ(A))U i

83

= hA, Ai.

Corollary 6.1.3. Any A, B ∈ Sn satisfy kλ(A) − λ(B)k ≤ kA − Bk. Proof. By Proposition 6.1.2,

kλ(A) − λ(B)k2 = λ(A)T λ(A) + λ(B)T λ(B) − 2λ(A)T λ(B) ≤ kAk2 + kBk2 − 2tr (AB) ≤ kA − Bk2 .

In the first inequality we used that λ(A)T λ(A) = kAk2 . Proposition 6.1.4. If F : Sn → R is a spectral function then f (a) := F (Diag a) is symmetric and F = f ◦ λ. Also, f is Lipschitz at a ∈ Rn with constant K if and only if F is Lipschitz at A = U T (Diag a)U ∈ Sn for some U ∈ On with constant K. Proof. For any P ∈ Pn and a ∈ Rn we have P T (Diag a)P = Diag P T a. Then,

f (a) = F (Diag a) = F (P T (Diag a)P ) = F (Diag P T a) = f (P T a).

Notice that f (λ(A)) = F (Diag λ(A)) = F (U T (Diag λ(A))U ) = F (A) so F = f ◦ λ. Suppose that f is Lipschitz at a with constant K. Let A0 and A00 be close to A, then λ(A0 ) and λ(A00 ) are close to [a] by the fact that the eigenvalue map is continuous. Then by Lemma 6.1.1 and Corollary 6.1.3 we have

|F (A0 ) − F (A00 )| = |f (λ(A0 )) − f (λ(A00 ))| ≤ Kkλ(A0 ) − λ(A00 )k ≤ KkA0 − A00 k.

84 In the opposite direction assume that F is Lipschitz at A with constant K. Then for a0 and a00 sufficiently close to a, U T (Diag a0 )U and U T (Diag a00 )U are close to A. Then

kf (a0 ) − f (a00 )k = kF (Diag a0 ) − F (Diag a00 )k = kF (U T (Diag a0 )U ) − F (U T (Diag a00 )U k ≤ KkU T (Diag a0 )U − U T (Diag a00 )U k = KkDiag a0 − Diag a00 k = Kka0 − a00 k,

which proves the result. For any A, H ∈ Sn the directional derivative of λi at A in the direction H is

λ0i (A; H) := lim t↓0

λi (A + tH) − λi (A) . t

Proposition 6.1.5. For any A, H ∈ Sn the directional derivative of λi at A in the direction H exists and is finite. Proof. By Corollary 5.1.9 λi (A) is the difference of two sublinear functions: λi (A) = σi (A) − σi−1 (A). Since dom σi = Sn , Proposition 3.2.6 implies σi0 (A; H) exists and is finite for any A, H ∈ Sn . For any A ∈ Sn , the directional derivative satisfies the following basic properties: • λ0 (A; I) = (1, . . . , 1)T . Indeed,

λ0i (A; I) = lim t↓0

λi (A + tI) − λi (A) λi (A) + t − λi (A) = lim = 1. t↓0 t t

85 • λ0 (I; A) = λ(A). Indeed,

λ0i (I; A) = lim t↓0

λi (I + tA) − λi (I) 1 + tλi (A) − 1 = lim = λi (A). t↓0 t t

• λ0 (A; A) = λ(A). Indeed,

λ0i (A; A) = lim t↓0

λi (A + tA) − λi (A) (1 + t)λi (A) − λi (A) = lim = λi (A). t↓0 t t

The directional differentiability of the eigenvalues can be expressed as

λ(A + tH) = λ(A) + tλ0 (A; H) + o(t),

where o(t) is a function such that o(t)/t converges to 0 as t converges to 0. The next proposition strengthens this fact by showing the little-o term is uniform in the direction H. Proposition 6.1.6. The directional derivative at A ∈ Sn in any direction H ∈ Sn satisfies λ(A + H) = λ(A) + λ0 (A; H) + o(H), where o(H) is a function such that o(H)/kHk converges to 0 as H converges to 0. Proof. Suppose that the stated equation is not true for coordinate i of the vector λ. That is, there is an > 0 and a sequence of matrices {Hm } approaching 0 such that

|λi (A + Hm ) − λi (A) − λ0i (A; Hm )| > kHm k.

86 By taking a subsequence, we may assume that kHm k ≤ 1/m and by taking a further subsequence that Hm /kHm k converges to H. Let tm := kHm k and observe that

kHm k < |λi (A + Hm ) − λi (A) − λ0i (A; Hm )| ≤ |λi (A + Hm ) − λi (A + tm H)| + |λi (A + tm H) − λi (A) − tm λ0i (A; H)| + |tm λ0i (A; H) − λ0i (A; Hm )|.

Proposition 3.4.1 together with Proposition 3.5.1 imply that the function H → λ0i (A; H) is positively homogeneous and convex. Then, by Theorem 3.3.3 we conclude that it is Lipschitz at 0 with Lipschitz constant K. Then for large m Corollary 6.1.3 implies

kHm k < kHm − tm Hk + |λi (A + tm H) − λi (A) − tm λ0i (A; H)| + KkHm − tm Hk.

Dividing both sides by tm and letting m approach +∞ we reach a contradiction, because the right-hand side converges to zero. Our goal is to state a formula for λ0 (A; H) in terms of A and H. For that purpose we need to look deeper into the structure of the vector λ(A). That is, we have to take into account the fact that the eigenvalues of A may have multiplicities greater 1. We assume that A has r distinct eigenvalues with the following multiplicities:

λ1 (A) = · · · = λk1 (A) > λk1 +1 (A) = · · · = λk2 (A) > · · · > λkr−1 +1 (A) = · · · = λkr (A). (6.1)

From now on we assume that the eigenvalues of A satisfy (6.1). Later it will be convenient define k0 := 0.

87 T Proposition 6.1.7. For any H ∈ Sn , vector λ0kj−1 +1 (A; H), . . . , λ0kj (A; H) is ordered non-increasingly, j = 1, . . . , r. Proof. Fix two indices p, q ∈ {kj−1 + 1, . . . , kj } with p < q. For any H ∈ Sn and t > 0 we have

λp (A + tH) = λp (A) + tλ0p (A; H) + o(t) and λq (A + tH) = λq (A) + tλ0q (A; H) + o(t).

Since λp (A + tH) ≥ λq (A + tH) and λp (A) = λq (A), subtracting the two equations, dividing both sides by t and taking the limit as t approaches 0+ shows that λ0p (A; H) ≥ λ0q (A; H). Define the matrices Xl := [ekl−1 +1 , . . . , ekl ] for l = 1, . . . , r. The proof of the following theorem can be found in [9, Theorem 3.12]. Theorem 6.1.8. For A = Diag a with a ∈ Rn↓ and any H ∈ Sn we have

λ0 (A; H) = (λ(X1T HX1 )T , λ(X2T HX2 )T , . . . , λ(XrT HXr )T )T .

The general formula for λ0 (A; H) is now easy to derive. Theorem 6.1.9. For any A ∈ Sn let A := U T (Diag λ(A))U for some orthogonal U ∈ On . The directional derivative of λ at A in the direction H is

λ0 (A; H) = (λ(X1T U HU T X1 )T , λ(X2T U HU T X2 )T , . . . , λ(XrT U HU T Xr )T )T .

88 Proof. Directly from the definition of the directional derivative, we obtain

λ(A + tH) − λ(A) λ(U T (Diag λ(A))U + tH) − λ(A) = lim t↓0 t↓0 t t T λ(Diag λ(A) + tU HU ) − λ(Diag λ(A)) = λ0 (Diag λ(A); U HU T ). = lim t↓0 t

λ0 (A; H) = lim

Applying Theorem 6.1.8 we have the result. The following theorem, stated here without a proof, is found in [13, Theorem 1.4]. Theorem 6.1.10. Let F : Sn → R be a spectral function with corresponding symmetric function f : Rn → R. If F is Lipschitz at A then

∂ ◦ F (A) = {U T (Diag ∂ ◦ f (λ(A))U | U ∈ On , such that A = U T (Diag λ(A))U }.

6.2

Lidskii’s theorem via nonsmooth analysis In this section we present a third proof of Lidskii’s theorem following the

ideas in [14]. We begin by presenting an alternative proof of Corollary 5.2.4 that does not use Lidskii’s theorem. Lemma 6.2.1. Any a, b ∈ Rn satisfy [a + b] − [b] ≺ [a]. Proof. We may express

Pk

j=1 [a]j

= max

Pk

i=1

aij , where we take the maximum over

the indices 1 ≤ i1 < · · · < ik ≤ n. Then, k X j=1

[a + b]j −

k X j=1

[b]j = max

i1 <···
k X j=1

(a + b)ij − max

i1 <···
k X j=1

bij

89

≤ max

k X

i1 <···
=

k X

aij + max

j=1

i1 <···
k X

bij − max

i1 <···
j=1

k X

bij

j=1

[a]j .

j=1

Lemma 6.2.2. Let ω ∈ Rn . Then the function f (a) := ω T [a] is Lipschitz. Proof. Observe that

2

k[a] − [b]k = ≤

n X

2

([a]1 − [b]1 ) =

i=1 n X

n X

[a]2i

− 2[a]i [b]i +

[b]2i

i=1

=

n X

a2i + b2i − 2[a]T [b]

i=1

a2i + b2i − 2aT b = ka − bk2 .

i=1

(The inequality uses Theorem 4.2.7.) Then by the Cauchy-Schwarz inequality we get

|f (a) − f (b)| = |ω T [a] − ω T [b]| = |ω T ([a] − [b])| ≤ kωkk[a] − [b]k ≤ kωkka − bk,

showing that f is Lipschitz. Lemma 6.2.3. Let ω ∈ Rn and f (a) := ω T [a] then

∂ ◦ f (a) ⊂ conv {P ω | P ∈ Pn }.

Proof.

(6.2)

The right-hand side is a closed and convex set, while the left-hand side is

such by Proposition 3.4.6 and Lemma 6.2.2. By Corollary 3.1.6 it is sufficient to show that for all v ∈ Rn

sup{hv, c1 i | c1 ∈ ∂ ◦ f (a)} ≤ sup{hv, c2 i | c2 ∈ conv {P ω | P ∈ Pn }}.

(6.3)

90 By Theorem 3.4.7 the left-hand side of (6.3) is f ◦ (a; v), while the right-hand side is easily seen to be [ω]T [v] by Theorem 4.2.7. Therefore we just need to show f ◦ (a; v) ≤ [ω]T [v] for all v ∈ Rn . Indeed, by Lemma 6.2.1 for positive t we have [b+tv]−[b] ≺ t[v], since [tv] = t[v]. Then, Corollary 4.2.8 implies that ω T [b + tv] − ω T [b] ≤ t[ω]T [v] and

f ◦ (a; v) = lim sup b→a,t↓0

ω T [b + tv] − ω T [b] ≤ [ω]T [v] t

completes the proof.

Theorem 6.2.4 (Lidskii’s theorem - 3rd proof). Any A, B ∈ Sn satisfy

λ(A + B) − λ(A) ≺ λ(B).

Proof. By Corollary 4.2.8, Lidskii’s theorem is equivalent to

ω T (λ(A + B) − λ(A)) ≤ [ω]T λ(B) for all ω ∈ Rn .

(6.4)

Fix ω ∈ Rn and define the function f (a) := ω T [a]. Notice that f is symmetric and

(f ◦ λ)(X) = ω T λ(X) for all X ∈ Sn .

By Lemma 6.2.2, f (a) is Lipschitz on Rn , so by Proposition 6.1.4 f ◦ λ is Lipschitz on Sn . By Theorem 3.4.10, there is an X ∈ (A, A + B) satisfying

ω T λ(A + B) − ω T λ(A) ∈ h∂ ◦ (f ◦ λ)(X), Bi.

(6.5)

91 That is, there is a V ∈ ∂ ◦ (f ◦ λ)(X) such that

ω T λ(A + B) − ω T λ(A) = hV, Bi = tr (V B) ≤ λ(V )T λ(B),

(6.6)

by Proposition 6.1.2. By Theorem 6.1.10, V = U T (Diag v)U for some U ∈ On and some v ∈ ∂f ◦ (λ(X)). Then using Lemma 6.2.3 we see that

λ(V ) = [v] ∈ conv {P ω | P ∈ Pn },

or equivalently λ(V ) ≺ ω, by Proposition 4.2.4. Then, by Corollary 4.2.8,

λ(V )T λ(B) ≤ [ω]T λ(B).

(6.7)

Combining Inequalities (6.6) and (6.7) we get (6.4). Using similar ideas we give a second, nonsmooth proof of Lemma 5.3.1. Lemma 6.2.5. For A, B ∈ Sn if A B then λi (A) ≥ λi (B) for all i = 1, . . . , n. Proof.

Suppose that A B and the function f (a) := eTi [a], which implies that

(f ◦ λ)(A) = λi (A). By Theorem 3.4.10 there exists a X ∈ (A, B) that satisfies

eTi λ(A) − eTi λ(B) = h∂ ◦ (f ◦ λ)(X), A − Bi.

(6.8)

By Theorem 6.1.10, there is a V in ∂(f ◦ λ)(X), that is, V = U T (Diag v)U for some U ∈ On and v ∈ ∂ ◦ f (λ(X)). Then Equation (6.8) becomes

λi (A) − λi (B) = hV, A − Bi.

(6.9)

92 By Lemma 6.2.3, we can express v as

Pn!

j=1

µj Pj ei for µj ∈ [0, 1] with

Pn!

j=1

µj = 1,

where Pj ∈ Pn for j = 1, . . . , n!. Since every permutation of ei is another vector in the standard basis we can express this sum as

Pn

j=1

γj ej for some positive γj , j = 1, . . . , n.

Then Equation (6.9) becomes

λi (A) − λi (B) = hU

T

Diag

n X

γj ej U, A − Bi =

j=1

=

n X

n X

γj tr (U T (Diag ej )U (A − B))

j=1

γj tr ((Diag ej )U (A − B)U T ).

j=1

Letting W := U (A − B)U T , it is easy to see that W is positive semi-definite. Also

λi (A) − λi (B) =

n X

γj W jj .

j=1

Since W jj = eTj W ej ≥ 0 for j = 1, . . . , n, we conclude that λi (A) − λi (B) ≥ 0.

93

Chapter 7 Lidskii theorem for spectral directional derivatives

7.1

Partition majorization The results in this subsection are modified from [15, Lemma 3, Theorem 5.4]

and [13, Lemma 2.2] A partition of a set X is a set of nonempty subsets of X such that every element x in X is in exactly one of these subsets. We call each subset a block. We denote by π a partition of the set Nn := {1, 2, . . . , n}. For example, the following are partitions of N6 :

{{1}, {2}, {3}, {4}, {5}, {6}}, {{1, 2, 3, 4, 5, 6}}, {{1, 2}, {3, 4}, {5, 6}}.

We say that a map g : Nn → Nn preserves the blocks of a partition π if i and g(i) belong to the same block for all i = 1, . . . , n. Define

Pπn = {P ∈ Pn | the permutation corresponding to P preserves the blocks of π}.

94 It is easy to see that Pπn is a subgroup of Pn . Given a partition π we say that x is π−majorised by y, denoted x ≺π y, when x ∈ conv {P y | P ∈ Pπn }. If π1 and π2 are two partitions then it is easy to see that Pπn1 is a subgroup of Pπn2 if and only if whenever i and j belong to the same block of π1 they belong to the same block of π2 . Proposition 7.1.1. For x, y ∈ Rn , (i) x ≺ y is equivalent to x ≺π y with π = {{1, . . . , n}}, (ii) x = y is equivalent to x ≺π y with π = {{1}, {2}, . . . , {n}}. Proof.

(i) If π = {{1, . . . , n}} then any P ∈ Pn preserves the blocks of π, so

Pπn = Pn , which is equivalent to x ≺ y by Proposition 4.2.4. (ii) If π = {{1}, {2}, . . . , {n}} then each block is a singleton, so Pπn contains only I. Therefore x ∈ conv {P y | P = I} = {y}. Lemma 7.1.2. For w ∈ Rn↓ , the function wT λ : S n → R is convex and any x ∈ Rn↓ satisfies Diag w ∈ ∂(wT λ)(Diag x). Proof.

Fix a real α ∈ [0, 1] and matrices A, B ∈ Sn . By Corollary 5.1.8, λ(αA +

(1 − α)B) ≺ λ(αA) + λ((1 − α)B). Convexity follows by Corollary 4.2.8, since

wT λ(αA + (1 − α)B) ≤ wT (λ(αA) + λ((1 − α)B)) = αwT λ(A) + (1 − α)wT λ(B).

By definition Diag w ∈ ∂(wT λ)(Diag x) if and only if for all A ∈ S n we have

wT λ(Diag x) + hDiag w, A − Diag xi ≤ wT λ(A).

This reduces to tr ((Diag w)A) ≤ wT λ(A) for all A ∈ Sn , true by Proposition 6.1.2.

95 For each x ∈ Rn , let π(x) be the partition of the set {1, 2, . . . , n} defined by: indices i and j belong to the same block if and only if xi = xj . Let x, v ∈ Rn . Observe n n . Since is a subgroup of Pπ(v) that ∀i, j(xi = xj implies vi = vj ) is equivalent to Pπ(x)

the eigenvalues of A ∈ Sn satisfy (6.1), then π(λ(A)) has blocks Il = {kl−1 + 1, . . . , kl } for l = 1, . . . , r. n n Lemma 7.1.3. For v ∈ Rn and x ∈ R↓n , if Pπ(x) is a subgroup of Pπ(v) then v T λ is

differentiable at Diag x with (v T λ)0 (Diag x) = Diag v. Proof. We have to show that

v T λ(Diag x + H) − v T λ(Diag x) − tr (Diag v)H lim = 0. H→0 kHk

By Proposition 6.1.6, this is equivalent to showing that

v T λ0 (Diag x; H) − tr (Diag v)H v T λ0 (Diag x; H) − v T (diag H) 0 = lim = lim . H→0 H→0 kHk kHk

Suppose that the blocks of the partition π(x) are I1 , . . . , Ir and define the matrices n n Xl := [ei | i ∈ Il ], l = 1, ..., r. The fact that Pπ(x) is a subgroup of Pπ(v) means that

vi = vj whenever i, j ∈ Il for some l ∈ {1, ..., r}. Let this common value of the entries of v in each block Il be called vl , l = 1, ..., r. Then, by Theorem 6.1.8 the numerator of the last differential quotient is equal to

v

T

(λ(X1T HX1 )T , . . . , λ(XrT HXr )T )T

T

− v (diag H) = =

r X l=1 r X l=1

vl tr (XlT HXl ) − v T (diag H) vl tr (XlT HXl ) −

n X i=1

vi H ii

96

= 0,

since tr (XlT HXl ) is the sum of the diagonal elements of H with indexes in Il . n , x ∈ Rn↓ and A ∈ Sn we have Lemma 7.1.4. For any P ∈ Pπ(x)

λ0 (Diag x; A) = λ0 (Diag x; P T AP ).

Proof.

The proof is analogous to the proof of Theorem 6.1.9, using the fact that

P ∈ On and that P preserves the blocks of π(x). The following theorem first appeared in [15]. The rather direct proof given there uses Lemma 7.1.2, Lemma 7.1.3 and techniques from convex analysis. Our proof shows that it is just another consequence of the formula for the directional derivative given in Theorem 6.1.8. Theorem 7.1.5 (Schur’s theorem for directional derivatives). Any x ∈ Rn↓ and A ∈ Sn satisfy diag A ≺π(x) λ0 (Diag x; A). Proof.

Let I1 , . . . , Ir be the blocks of the partition π(x). Define the matrices

Xl := [ei | i ∈ Il ], l = 1, ..., r. By Theorem 6.1.8 we have

λ0 (Diag x; A) = (λ(X1T AX1 )T , . . . , λ(XrT AXr )T )T .

By Schur’s theorem, Proposition 5.1.5, we have

(Aii | i ∈ Il )T ≺ λ(XlT AXl )

97 for all l = 1, ..., r. The statement of the theorem follows, by the definition of πmajorization. The next corollary shows that Theorem 7.1.5 is a generalization of Schur’s theorem, see Proposition 5.1.5. Corollary 7.1.6. Any A ∈ Sn satisfies diag A ≺ λ(A). Proof.

Since e ∈ Rn↓ then diag A ≺π(e) λ0 (Diag e; A) by Theorem 7.1.5. Since

π(e) = {{1, . . . , n}}, by Proposition 7.1.1 we have

diag A ≺ λ0 (Diag e; A) = λ0 (I; A) = λ(A),

by the directional derivative properties beginning on page 84. The next corollary shows that Theorem 7.1.5 captures another important property of the directional derivatives. Corollary 7.1.7. For any A ∈ Sn , if x ∈ Rn↓ has distinct coordinates then

λ0 (Diag x; A) = diag A.

Proof. Theorem 7.1.5 implies that

diag A ≺π(x) λ0 (Diag x; A).

Since π(x) = {{1}, . . . , {n}}, Proposition 7.1.1, part (i) now shows that diag A = λ0 (Diag x; A).

98

7.2

Perturbations of the spectral directional derivatives We begin with a simple observation that generalizes Schur’s theorem.

Proposition 7.2.1. Any A, H and V in Sn satisfy

λ0 (A; H + V ) − λ0 (A; V ) ≺ λ(H).

Proof. By the definition of the directional derivative we get

λ(A + t(H + V )) − λ(A) − λ(A + tV ) + λ(A) t↓0 t λ(A + t(H + V )) − λ(A + tV ) = lim t↓0 t λ(tH) ≺ lim t↓0 t

λ0 (A; H + V ) − λ0 (A; V ) = lim

= λ(H),

where the majorization follows from Lidskii’s theorem. The above proposition is a generalization of Schur’s theorem (Proposition 5.1.5), since letting A := Diag a for some vector a ∈ Rn↓ with distinct coordinates and using Corollary 7.1.7 we get

diag H = diag (H + V ) − diag V = λ0 (A; H + V ) − λ0 (A; V ) ≺ λ(H).

The main result in this subsection is Theorem 7.2.5. Before we prove it, we give direct proofs of two particular cases of it.

99 Proposition 7.2.2. For A = Diag ae where a ∈ R and any H, V in Sn , we have

λ0 (A; H + V ) − λ0 (A; V ) ≺π(a) λ0 (A; H).

Proof. Since π(a) = {{1, 2, . . . , n}}, by Proposition 7.1.1, part (ii) it is enough to show that λ0 (A; H + V ) − λ0 (A; V ) ≺ λ0 (A; H). Indeed,

λ(A + t(H + V )) − λ(A + tV ) t↓0 t λ(tH) ≺ lim t↓0 t λ(A + tH) − λ(A) = lim t↓0 t

λ0 (A; H + V ) − λ0 (A; V ) = lim

= λ0 (A; H),

where the majorization follows from Lidskii’s theorem. Proposition 7.2.3 (Block equality). Let A, H and V be in Sn and suppose that the eigenvalues of A satisfy (6.1). Then for any s ∈ {0, 1, . . . , r − 1} we have ks+1

X i=ks +1

ks+1

λ0i (A; H

+V)−

λ0i (A; V

)=

X

λ0i (A; H).

(7.1)

i=ks +1

Proof. Suppose the blocks of π(λ(A)) are I1 , . . . , Ir . Fix an index s ∈ {1, . . . , r − 1}

100 and let v ∈ Rn be such that

vi =

   1,

if i ∈ Is

  0,

otherwise.

n n . By Lemma 7.1.3 we have that is a subgroup of Pπ(v) Clearly, Pπ(a)

tr ((Diag v)(H + V )) = v T λ0 (A; H + V ) and tr ((Diag v)V ) = v T λ0 (A; V ).

(7.2) (7.3)

Subtracting (7.3) from (7.2) and applying Lemma 7.1.3 again, gives us

v T λ0 (A; H) = tr ((Diag v)H) = v T (λ0 (A; H + V ) − λ0 (A; V )).

This shows the claim. In the next proposition, for a subset I ⊆ {1, ..., n} by λ0I we denote the subvector of λ0 with indices in I. Proposition 7.2.4. Any A, H and V in Sn satisfy

λ0 (A; H + V ) − λ0 (A; V ) ≺π(λ(A)) λ0 (A; H),

|I |

whenever λ0Ip (A; H + V ) − λ0Ip (A; V ) ∈ R↓ p for each block Ip of the partition π(λ(A)). Proof.

Let the blocks of the partition π(λ(A)) be I1 , ..., Ir . Fix an index p ∈

{1, ..., r}. In view of Proposition 7.2.3 we need to show the inequalities required for

101 majorization within each block. The definition of the directional derivative implies

λi (A + t(H + V )) − λi (A) − λi (A + tV ) + λi (A) t↓0 t λi (A + t(H + V )) − λi (A + tV ) . = lim t↓0 t

λ0i (A; H + V ) − λ0i (A; V ) = lim

For all integers l satisfying 1 ≤ l ≤ kp − kp−1 : kp−1 +l

X

λ0i (A; H

+V)−

λ0i (A; V

i=1

kp−1 +l X λi (A + t(H + V )) − λi (A + tV ) ) = lim t↓0 t i=1

kp−1 +l

= lim t↓0

X λi (A + t(H + V )) − λi (A + tV ) t i=1

σkp−1 +l (A + t(H + V )) − σkp−1 +l (A + tV ) t↓0 t σk +l (A + t(H + V )) − σkp−1 +l (A + tV ) ≤ sup lim sup p−1 t t↓0 V ∈Sn = lim

= σkp−1 +l (A; H) = σk0 p−1 +l (A; H),

by Proposition 3.5.1 and the fact Corollary 5.1.9 implies σkp−1 +l is a convex. Thus, kp−1 +l

X

λ0i (A; H

+V)−

λ0i (A; V

kp−1 +l X ) ≤ λ0i (A; H).

(7.4)

i=1

i=1

Subtracting Equation (7.1) for each s ∈ {1, ..., p−1} from Inequality (7.4) we conclude kp−1 +l

X i=kp−1 +1

kp−1 +l

λ0i (A; H

+V)−

λ0i (A; V

) ≤

X

λi (A; H).

i=kp−1 +1

Since l ∈ {1, ..., kp − kp−1 } is arbitrary, using Equation (7.1) with s = p and the fact

102 |I |

that λ0Ip (A; H + V ) − λ0Ip (A; V ) ∈ R↓ p , we deduce the majorization result. Theorem 7.2.5. Any A, H and V in Sn satisfy

λ0 (A; H + V ) − λ0 (A; V ) ≺π(λ(A)) λ0 (A; H).

Proof.

(7.5)

Without loss of generality let A := Diag a for a ∈ Rn↓ then π(λ(A)) =

π(a). Suppose that the blocks of the partition π(a) are I1 , . . . , Ir . Define matrices Xl := [ei | i ∈ Il ], l = 1, ..., r. By the formula for the directional derivatives of the eigenvalues for each block we have

λ0Il (A; H + V ) = λ(XlT (H + V )Xl ) = λ(XlT HXl + XlT V Xl ) and λ0Il (A; V ) = λ(XlT V Xl ).

Thus, by Lidskii’s theorem we conclude that

λ0Il (A; H + V ) − λ0Il (A; V ) = λ(XlT HXl + XlT V Xl ) − λ(XlT V Xl ) ≺ λ(XlT HXl ) = λ0Il (A; H).

This establishes the result. Theorem 7.2.5 is a generalization of Lidskii’s theorem. Indeed, if A = I then by the directional derivative properties beginning on page 84 we have

λ0 (A; H + V ) − λ0 (A; V ) = λ(H + V ) − λ(V ) and λ0 (A; H) = λ(H).

103 By Proposition 7.1.1, part (i) Theorem 7.2.5 becomes

λ(H + V ) − λ(V ) ≺ λ(H), ∀H, V ∈ S n .

This is precisely Lidskii’s theorem.

104

Chapter 8 Conclusions

This work provides the background and development of matrix perturbation theory, including three different approaches to the important theorem of Lidskii: first by Wielandt’s minimax theorem; second by Courant-Fischer’s theorem; and third via nonsmooth variational analysis using the Lebourg mean value theorem for Lipschitz functions. Examining partitions of equal eigenvalues allows us to apply Hiriart-Urruty and Ye’s directional derivative formula for symmetric matrices. This analysis led to Lewis’s generalization of Schur’s theorem. We offered a simple generalization of Schur’s theorem. Without the formula we showed a partition majorization holds over equal eigenvalue blocks with two conditions. Finally using the formula we show the result holds without any condition.

105

Bibliography

[1] S.J. AXLER. Linear Algebra Done Right. Springer-Verlag, NY, 2004. [2] R. BHATIA. Matrix Analysis. Springer-Verlag, NY, 1997. [3] J.M. BORWEIN and A.S. LEWIS. Convex Analysis and Nonlinear Optimization. Springer-Verlag (CMS books in mathematics), NY, 2006. [4] F.H. CLARKE. Optimization and Nonsmooth Analysis. Centre de Recherches Math´ematiques, Montr´eal, 1989. [5] K. FAN. On a theorem of weyl concerning eigenvalues of linear transformations. Proceedings of the National Academy of Sciences of the United States of America, 35:652–683, 1949. ¨ [6] E. FISCHER. Uber quadratische Formen mit reellen koeffizienten. Monatshefte f¨ ur Mathematik und Physik, 16:234–249, 1905. [7] W. FULTON. Eigenvalues, invariant factors, highest weights, and Schubert calculus. Bulletin (New Series) of the American Mathematical Society, 37, 1999. [8] G.H. HARDY, J.E. LITTLEWOOD, and G. POLYA. Inequalities. Cambridge University Press, 1934. [9] J.B. HIRIART-URRUTY and D. YE. Sensitivity analysis of all eigenvalues of a symmetric matrix. Numerische Mathematik, 70:45–72, 1995.

106 [10] A. HORN. Doubly stochastic matrices and the diagonal of a rotation matrix. American Journal of Mathematics, 76:620–630, 1954. [11] A. HORN. Eigenvalues of sums of hermitian matrices. Pacific Journal of Mathematics, 12:225–241, 1962. [12] A. KNUTSON and T. TAO. The honeycomb model of GL(n) tensor products I: proof of the saturation conjecture. Journal of the American Mathematical Society, 12:1055–1090, 1999. [13] A.S. LEWIS. Derivatives of spectral functions. Mathematics of Operations Research, 21:576–588, 1996. [14] A.S. LEWIS. Lidskii’s theorem via nonsmooth analysis. SIAM Journal on Matrix Analysis and Applications, 21:379–381, 1999. [15] A.S. LEWIS. Nonsmooth analysis of eigenvalues. Mathematical Programming, 84:1–24, 1999. [16] V.B. LIDSKII. The proper values of the sum and product of symmetric matrices. Doklady Akademii Nauk SSSR, 75:769–772, 1950. [17] M.O. LORENZ. Methods of measuring the concentration of wealth. Publications of the American Statistical Association, 9:209–219, 1905. [18] A. MARSHALL and I. OLKIN. Inequalities: Theory of Majorization and Its Applications. Academic Press, 1979. [19] A. MAS-COLELL, M. WHINSTON, and J. GREEN. Microeconomic Theory. Oxford University Press, Inc., NY, 1995.

107 [20] C. MEYER. Matrix Analysis and Applied Linear Algebra. The Soceity for Industriral and Applied Mathematics, PA, 2000. [21] R.F. MUIRHEAD. Some methods applicable to identities and inequalities of symmetric algebraic functions of n letters. Proceedings of the Edinburgh Mathematical Society, 21:144–157, 1903. [22] Y. NESTEROV. Introductory Lectures on Convex Optimization. Kluwer Academic Publishers, Boston, 2004. [23] M.A. NIELSEN. Majorization and its application to quantum information theory, 1999. [24] A.L. PERESSINI, F.E. SULLIVAN, and J.J. UHL Jr. The Mathematics of Nonlinear Programming. Springer-Verlag New York, Inc., 1988. [25] R. RADO. An inequeality. Journal of London Mathematical Society, 27:1–6, 1952. [26] A.R. ROBERTS and D.E. VARBERG. Another proof that convex functions are locally Lipschitz. The American Mathematical Monthly, 81(9):1014–1016, 1974. [27] W. RUDIN. Functional Analysis. McGraw Hill, NY, 1997. ¨ [28] I. SCHUR. Uber den Zusammenhang zwischen einem Problem der Zahlentheorie und einem Satz u ¨ber algebraische Funktionen. Berlinische Mathematische Gesellschaft, 22:9–20, 1923. [29] R. THOMPSON and L. FREEDE. On the eigenvalues of sums of Hermitian matrices. Linear Algebra and Applications, 4:369–376, 1971.

108 [30] H. WEYL. Das asymptotische Verteilungsgesetz der Eigenwerte linearer partieller Differentialgleichungen. Mathematische Annalen, 71:441–479, 1912. [31] H. WIELANDT. An extremum property of sums of eigenvalues. Proceedings of the American Mathematical Society, 6:106–110, 1955.

109

Appendix A Proof of Wielandt’s minimax principle

The results of this chapter are modified from [2, Chapter III]. Assume M, Mi and N, Ni are subspaces of a vector space X of dimension n. Let E be an Euclidean space of dimension dim E.

A.1

Supporting propositions and definitions By X\M we denote the set of vectors in X that are not in M. Note that any

vector in X \ M is nonzero and is not expressible as a linear combination of vectors in M. Lemma A.1.1. Let M1 ⊃ M2 ⊃ . . . ⊃ Mk be a descending chain of subspaces of X with dim Mj ≥ k − j + 1. Let the set {wj ∈ Mj | j = 1, . . . , k − 1} be linearly independent and let U := span {w1 , . . . , wk−1 }. Then there exists a nonzero vector u ∈ M1 \ U such that U + span {u} has a basis {vj }kj=1 with vj ∈ Mj for j = 1, . . . , k. Proof.

We prove the result by induction on k. Let k := 2, then the chain of

subspaces has two elements M1 ⊃ M2 with dim M1 ≥ 2 and dim M2 ≥ 1. Choose a nonzero vector w1 ∈ M1 and let U := span {w1 }. Then w1 either belongs to M2 or

110 to M1 \ M2 . If w1 ∈ M2 , then there exists a vector u ∈ M1 such that w1 and u are linearly independent and form a basis of U + span {u}. For the second case suppose w1 ∈ M1 \ M2 , then there exists a vector u ∈ M2 linearly independent of w1 such that {w1 , u} is a basis of U + span {u}. Assume the statement is true for k −1. That is, it is true for any chain M2 ⊃ M3 ⊃ . . . ⊃ Mk of subspaces of X of length k−1 with dim Mj+1 ≥ (k−1)−(j+1)+1 = k − j − 1 and any linearly independent set of vectors {wj+1 ∈ Mj+1 | j = 1, . . . , k − 2}. Thus, if we let S := span {w2 , . . . , wk−1 } then there exists a vector v in M2 \ S such that S + span {v} has a basis {vi }ki=2 with vj ∈ Mj for j = 2, . . . , k. Suppose now that the premise of the lemma holds and apply the induction hypothesis. Since dim Mj+1 ≥ k − (j + 1) + 1 = k − j ≥ k − j − 1 we can apply the induction hypothesis to M2 ⊃ · · · ⊃ Mk to find a vector v in M2 \ S such that

S + span {v} = span {v2 , . . . , vk }

for some linearly independent vectors vj ∈ Mj , j = 2, . . . , k. We have two possibilities, either v ∈ U or v ∈ / U. Suppose that v ∈ U. Since S ⊂ U, dim U = dim S + 1 and dim(S + span {v}) = k − 1 we conclude that U = S + span {v}. However dim M1 ≥ k so there exists a vector u in M1 \ U. Then {u, v2 , . . . , vk } is a basis for U + span {u} and all requirements are now met with v1 := u. Suppose that v ∈ / U, then we will show w1 ∈ / S + span {v}. Suppose on the contrary that w1 did belong to S + span {v} then w1 = α2 w2 + · · · + αk−1 wk−1 + α1 v. The coefficient α1 is nonzero, because w1 is not a linear combination of w2 , . . . , wk−1 .

111 Then v=

α2 αk−1 1 w1 − w2 − · · · − wk−1 ∈ U, α1 α1 α1

a contradiction. Then span {w1 , v2 , . . . , vk } is a k-dimensional space and

span {w1 , v2 , . . . , vk } = span {w1 } + span {v2 , . . . , vk } = span {w1 } + span {v} + S = span {v} + U.

Since v ∈ / U we are done, by setting u := v ∈ M1 \ U and v1 := w1 , because vj ∈ Mj for j = 1, . . . , k. Proposition A.1.2. Fix indices 1 ≤ i1 < i2 · · · < ik ≤ n. Let N1 ⊂ N2 ⊂ · · · ⊂ Nk be an ascending chain of subspaces with dim Nj = ij and let M1 ⊃ M2 ⊃ · · · ⊃ Mk be a descending chain of subspaces with dim Mj = n − ij + 1. Then there exist linearly independent sets {vj ∈ Nj | j = 1, . . . , k} and {wj ∈ Mj | j = 1, . . . , k}, such that

span {v1 , . . . , vk } = span {w1 , . . . , wk }.

Proof.

For any j = 1, . . . , k we have dim(Nj ∩ Mj ) ≥ 1 by Proposition 5.1.1.

We will prove the proposition by induction on k. Let k := 1 and choose a nonzero vector u ∈ N1 ∩ M1 . Setting v1 := u =: w1 , we have v1 ∈ N1 and w1 ∈ M1 with span {v1 } = span {w1 }. Assume the statement is true for k − 1. That is, for any chains of subspaces N1 ⊂ N2 ⊂ · · · ⊂ Nk−1 and M1 ⊃ M2 ⊃ · · · ⊃ Mk−1 of length k −1 with dim Nj = ij and dim Mj = n − ij + 1, we can choose linearly independent sets {vj ∈ Nj | j =

112 1, . . . , k − 1} and {wj ∈ Mj | j = 1, . . . , k − 1}, such that span {v1 , . . . , vk−1 } = span {w1 , . . . , wk−1 }. Suppose now that the premise in the proposition holds and apply the induction hypothesis. Define U := span {v1 , . . . , vk−1 } = span {w1 , . . . , wk−1 }. Then U is a subspace of the largest space Nk . Let Sj := Mj ∩ Nk for j = 1, . . . , k. Note that

n ≥ dim Mj + dim Nk − dim(Mj ∩ Nk ) = (n − ij + 1) + ik − dim Sj .

This is equivalent to dim Sj ≥ ik −ij +1 ≥ k−j+1, because ik −ij must be at least k−j. Clearly, S1 ⊃ S2 ⊃ · · · ⊃ Sk are subspaces of Nk and wj ∈ Sj for j = 1, 2, . . . , k − 1, because wj ∈ Mj and wj ∈ U by definition. Applying Lemma A.1.1 produces a vector u ∈ S1 U ⊆ Nk such that U + span {u} has a basis {uj }kj=1 where uj ∈ Sj ⊂ Mj for j = 1, 2, . . . , k. Define vk := u, then vj ∈ Nj for j = 1, . . . , k. The sets {v1 , . . . , vk } and {u1 , . . . , uk } are linearly independent with span {v1 , . . . , vk } = span {u1 , . . . , uk }.

Corollary A.1.3. The sets of vectors {v1 , . . . , vk } and {w1 , . . . , wk } in Proposition A.1.2 can be chosen to be orthonormal when X = E. Proof. Proposition A.1.2 guarantees there are linearly independent vectors vj ∈ Nj , wj ∈ Mj for j = 1, . . . , k, such that span {v1 , . . . , vk } = span {w1 , . . . , wk }. Apply the Gram-Schmidt procedure, Proposition 2.3.2, to get the desired orthonormal sets. Lemma A.1.4. Let A : E → E be a self-adjoint operator with eigenvalues λ1 (A) ≥ · · · ≥ λn (A) corresponding to an orthonormal set of eigenvectors {u1 , . . . , un }. Define the subspace M := span {uj , . . . , uk } for 1 ≤ j < k ≤ n. Then any norm-one u ∈ M

113 satisfies λj (A) ≥ hu, Aui ≥ λk (A).

Since {u}ki=j is a basis of M, there exist numbers αj , . . . , αk such that

Proof. u= Pk

Pk

i=j

(A.1)

i=j

αi ui with

Pk

i=j

αi2 = 1. Then from hu, Aui =

Pk Pk i=j

r=j

αr αi λi (A)δri =

αi2 λi (A) we conclude (A.1).

Proposition A.1.5. Let A : E → E be a self-adjoint operator with eigenvalues λ1 (A) ≥· · ·≥ λn (A) corresponding to an orthonormal set of eigenvectors {u1 , . . . , un }. Define subspaces Mj := span {u1 , . . . , uj } and Nj := span {uj , . . . , un }, j = 1, . . . , n, then for any indices 1 ≤ i1 < i2 < · · · < ik ≤ n the following statements hold. (i) Let {xij ∈ Mij | j = 1, . . . , k} be an orthonormal set, M := span {xi1 , . . . , xik }. Then the compression PA|M satisfies λj (PA|M ) ≥ λij (A) for j = 1, . . . , k. (ii) Let {xij ∈ Nij | j = 1, . . . , k} be an orthonormal set, N := span {xi1 , . . . , xik }. Then the compression PA|N satisfies λj (PA|N ) ≤ λij (A) for j = 1, . . . , k Proof. Proposition 2.8.6 implies that PA|M : M → M is a self-adjoint operator, so it has eigenvalues λ1 (PA|M ) ≥ · · · ≥ λk (PA|M ) with corresponding orthonormal set of eigenvectors {y1 , . . . , yk }. For some fixed j ∈ {1, . . . , k}, define the subspaces of M: X := span {xi1 , . . . , xij } and Y := span {yj , . . . , yk }. By Proposition 5.1.1, there exists a norm-one vector u ∈ X ∩ Y. Since u ∈ X ⊂ Mij :

λj (PA|M ) = hyj , PA|M yj i ≥ hu, PA|M ui = hu, PAui = hu, Aui

by Lemma A.1.4, u ∈ Y since u ∈ M by Definition 2.8.1

114

≥ λij (A).

by Lemma A.1.4, u ∈ Mij

To show the second part of the proposition, notice that PA|N : N → N is also a self-adjoint operator with eigenvalues λ1 (PA|N ) ≥ · · · ≥ λk (PA|N ) and corresponding set of orthonormal eigenvectors {y1 , . . . , yk }. For fixed j ∈ {1, . . . , k}, define the subspaces of N: X := span {xij , . . . , xik } and Y := span {y1 , . . . , yj }. By Proposition 5.1.1, there exists a norm-one vector u ∈ X ∩ Y. Since u ∈ X ⊂ Nij :

λj (PA|N ) = hyj , PA|N yj i ≤ hu, PA|N ui

by Lemma A.1.4, u ∈ Y

= hu, PAui

since u ∈ N

= hu, Aui

by Definition 2.8.1

≤ λij (A).

by Lemma A.1.4, u ∈ Nij

Thus, the second statement of the result holds.

A.2

Proof of Wielandt’s minimax principle

Theorem A.2.1 (Wielandt’s minimax principle). Let A : E → E be a self-adjoint operator, then for any indices 1 ≤ i1 < · · · < ik ≤ n we have k X

λij (A) =

j=1 k X j=1

λij (A) =

max

min

k X

M1 ⊂···⊂Mk xj ∈Mj dim Mj =ij {xj } o.n. j=1

min

max

hxj , Axj i

k X

N1 ⊃···⊃Nk xj ∈Nj dim Nj =n−ij +1 {xj } o.n. j=1

hxj , Axj i.

(A.2)

(A.3)

115 Proof.

Assume the self-adjoint operator A has eigenvalues λ1 (A) ≥ · · · ≥ λn (A)

that correspond to the orthonormal set of eigenvectors {u1 , . . . , un }. Define Mij := span {u1 , . . . , uij }, then by Lemma A.1.4 any length-one vector xj in Mij satisfies hxj , Axj i ≥ λij (A). Then k X

min

xj ∈Mij j=1 {xj } o.n.

hxj , Axj i ≥

k X

λij (A).

(A.4)

j=1

Letting Mj := Mij for j = 1, . . . , k defines an increasing chain of subspaces, so

sup

min

k X

M1 ⊂···⊂Mk xj ∈Mj dim Mj =ij {xj } o.n. j=1

hxj , Axj i ≥

k X

λij (A).

j=1

Now, let M1 ⊂ · · · ⊂ Mk be any ascending chain of subspaces with dim Mj = ij , j = 1, ..., k. We must find an orthonormal set {xj ∈ Mj | j = 1, ..., k}, such that k X j=1

hxj , Axj i ≤

k X

λij (A).

j=1

Letting Nij := span {uij , . . . , un }, j = 1, . . . , k, defines a descending chain of subspaces with dim Nij = n − ij + 1. By Corollary A.1.3, there are orthonormal sets {xj ∈ Mj | j = 1, ..., k} and {yj ∈ Nij | j = 1, ..., k}, such that

N := span {y1 , . . . , yk } = span {x1 , . . . , xk }.

116 Let P be the projection of E onto N. Then

k X

hxj , Axj i =

j=1

k X

hxj , A|N xj i

since xj ∈ N

j=1

=

k X

hxj , PA|N xj i

by Definition 2.8.1

j=1

= tr (PA|N ) =

k X

by Proposition 2.7.3

λj (PA|N )

j=1

≤

k X

λij (A),

j=1

by the second statement of Proposition A.1.5. Then clearly,

min

k X

xj ∈Mj {xj } o.n. j=1

hxj , Axj i ≤

k X

λij (A).

j=1

Since this holds for any ascending chain M1 ⊂ · · · ⊂ Mk , of subspaces with dim Mj = ij , j = 1, ..., k, we obtain the opposite inequality:

sup

min

k X

M1 ⊂···⊂Mk xj ∈Mj dim Mj =ij {xj } o.n. j=1

hxj , Axj i ≤

k X

λij (A).

j=1

We now prove the Equation (A.3). Let Nij := span {uij , . . . , un }, then by Lemma A.1.4 any length-one vector xj in Nij satisfies hxj , Axj i ≤ λij (A). Then

max

k X

xj ∈Nij j=1 {xj } o.n.

hxj , Axj i ≤

k X

λij (A).

j=1

Letting Nj := Nij , j = 1, . . . , k, defines a decreasing chain of subspaces with

117 dim Nj = n − ij + 1. Then

inf

max

k X

N1 ⊃···⊃Nk xj ∈Nj dim Nj =n−ij +1 {xj } o.n. j=1

hxj , Axj i ≤

k X

λij (A).

j=1

Let N1 ⊃ · · · ⊃ Nk be any descending chain of subspaces with dim Nj = n − ij + 1. We must find an orthonormal set {xj ∈ Nj | j = 1, ..., k} such that k X

hxj , Axj i ≥

k X

j=1

λij (A).

j=1

Letting Mij := span {u1 , . . . , uij }, j = 1, . . . , k, defines an ascending chain of spaces with dim Mij = ij . By Corollary A.1.3, there are orthonormal sets {xj ∈ Nj | j = 1, ..., k} and {yj ∈ Mij | j = 1, ..., k}, such that

M := span {y1 , . . . , yk } = span {x1 , . . . , xk }.

Let P be the projection of E onto M. Then

k X

hxj , Axj i =

j=1

k X

hxj , A|M xj i

since xj ∈ M

j=1

=

k X

hxj , PA|M xj i

by Definition 2.8.1

j=1

= tr (PA|M ) =

k X

λj (PA|M )

j=1

≥

k X j=1

λij (A),

by Proposition 2.7.3

118 by the first statement of Proposition A.1.5. Then clearly,

max

k X

xj ∈Nj {xj } o.n. j=1

hxj , Axj i ≥

k X

λij (A).

j=1

Since this holds for any descending chain N1 ⊃ · · · ⊃ Nk of subspaces with dim Nj = n − ij + 1, we obtain the opposite inequality

inf

max

k X

N1 ⊃···⊃Nk xj ∈Nj dim Nj =n−ij +1 {xj } o.n. j=1

hxj , Axj i ≥

The Wielandt’s theorem is completely established.

k X j=1

λij (A).

THE STABILIZATION THEOREM FOR PROPER ...

The Gauss-Bonnet theorem for the noncommutative 2 ...

System and method for controlled directional drilling

A NOTE ON THE TRACE THEOREM FOR DOMAINS ...

An H-theorem for the general relativistic Ornstein

THE JORDAN CURVE THEOREM FOR PIECEWISE ...

Roth's theorem in the primes

Rendering Omniâdirectional Stereo Content Developers

The Fundamental Theorem of Calculus

the harrington-kechris- louveau theorem

The optimal fourth moment theorem

The Shiftable Complex Directional Pyramid, Part I

The Shiftable Complex Directional Pyramid, Part I ...

A STRUCTURE THEOREM FOR RATIONALIZABILITY ...

Bi-directional overrunning clutch

A STRUCTURE THEOREM FOR RATIONALIZABILITY IN ... - STICERD

Bi-directional overrunning clutch

Methods and apparatus for controlled directional drilling of boreholes

Two Simplified Proofs for Roberts' Theorem

The BrÃ©zis-Browder Theorem revisited and properties ...

free Colorability and the Boolean Prime Ideal Theorem

Inverse Halftoning Based on the Bayesian Theorem - IEEE Xplore

Proof of the JuliaâZee Theorem - Springer Link