STATISTICAL MACHINE LEARNING IN THE T -EXPONENTIAL FAMILY OF DISTRIBUTIONS

A Dissertation Submitted to the Faculty of Purdue University by Nan Ding

In Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy

May 2013 Purdue University West Lafayette, Indiana

ii

To my family on earth and in heaven.

iii

ACKNOWLEDGMENTS I have been surely proud of being a Purdue Boilermaker for the last five years. During these years, I have been so blessed to live and work with so many kind and intelligent people. Especially, I would like to acknowledge a few people who have made big impact on my life and study. First of all, I would like to express my utmost gratitude to my Ph.D. advisor Prof. S.V.N.(Vishy) Vishwanathan. Vishy not only teaches me how to do solid research, but also influences me to be a better human being. Because of Vishy’s Chinese relation, he always says ”nu3 li4 gong1 zuo4, bu2 yao4 shui4 jiao4” (meaning work hard, do not sleep) and he himself is always working as hard if not harder than all of us. Although Vishy is a very strict advisor, he is also a very patient teacher. He has kept teaching me that presentation/writing should be as organizable as peeling the onions and he has kept helping me improve my skills. Although Vishy is very knowledgeable and smart, he is very honest about things that he is uncertain. It has been always enjoyable discussing with Vishy because of his intelligence, integrity and passion. Of course Ph.D. study is not just about reading and writing papers. During the four years in Vishy’s group, he also provided me with varieties of experience including giving lectures, writing proposals, organizing seminars and so on. In addition, Vishy always cares about our career development. He always encourages and helps us to build connections in the academic and industry communities. He has also supported and refered us to valuable internships in top research labs or companies every year. There are so many things that Vishy has done for me, for what I would be forever grateful. I would also like to deeply appreciate the following advisors of mine during the past five years: Prof. Alan Qi, who recruited me from Tsinghua University with the prestigious Ross Fellowship. Alan also served as my inital advisor who worked closely with me from day to night in my first year. Dr. Wray Buntine, who was my advisor during the visit

iv to National ICT Australia. It was great working with Wray on nonparametric Bayesian models and together with Changyou Chen we have nice collaborations for two years which lead to two publications. Dr. Cedric Archambeau and Shengbo Guo, who kindly hosted me during my internship at Xerox Research Centre Europe. The summer that I spent in France has been really enjoyable and fruitful. Prof. Manfred Warmuth, who graciously served as my co-advisor during Vishy’s sabbatical and hosted my two visits to UC Santa Cruz. He is truely a master of mind, who shared with me his wealth of brilliant ideas in both research and life. And finally, Prof. Jayanta Ghosh and David Gleich, who along with Alan served as my thesis advisors. They gave me comprehensive comments and invaluable advice to my thesis and research. Besides of my co-authors as well as the thesis committee members, there are a few other contributors to this thesis. Vasil Denchev helped me polishing the thesis. Dr. Xinhua Zhang spent hours helping me set up and compile the PETSc and TAO package so that I can run t-logistic regression in large scale datasets. The idea of generalizing t-logistic regression to mismatch losses first came from Prof. Manfred Warmuth during the discussion in the IMA workshop at University of Minnesota. The codes and experiments on t-CRF are joint work with Changyou Chen at NICTA. I really appreciate their tremendous effort and support. In addition, my Ph.D. study would not have been so wonderful without the generous help and joint work with my colleagues and collaborators at Purdue, NICTA, XRCE and UCSC. My thanks go to but do not limit to the following amazing people: Nguyen Cao, Shuhao Cao, Francois Caron, Xiaoxiao Chen, Yi Chen, Bo Dai, Jyotishka Datta, Lan Du, Yi Fang, Youhan Fang, Rupesh Gupta, Long He, Pei He, Dunxu Hu, Hongbin Kuang, Te Ke, Zhiqiang Lin, Fangjia Lu, Shin Matsushima, Hai Nguyen, Lichen Ni, Jiazhong Nie, Philip Ritchey, Ankan Saha, Huanyu Shao, Bin Shen, Sanvesh Srivastava, Zhaonan Sun, Xi Tan, Choonhui Teo, Tao Wang, Xu Wang, Rongjing Xiang, Jingjie Xiao, Chao Xu, Pinar Yanardag, Feng Yan, Jin Yu, Lin Yuan, Hyokun Yun, Dan Zhang, Lumin Zhang and Yao Zhu. I also want to extend my sincere thanks to the following professors, senior researchers, and staffs for their kind help: Doug Crabill, Stefania Delassus, Marian Duncan, William

v Gorman, Holly Graef, Sergey Kirshner, Chuanhai Liu, Hartmut Neven, Jeniffer Neville, Shaun Ponders, Luo Si, and Jian Zhang. Last but not least, I am incredibly grateful to my family for their love and support which keeps me moving forward during this unforgettable journey.

vi

TABLE OF CONTENTS Page LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

x

LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

xiii

ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

xix

1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1

1.1

Dissertation Outline . . . . . . . . . . . . . . . . . . . . . . . . . . .

4

1.2

Collaborators and Related Publications . . . . . . . . . . . . . . . . .

7

2 BACKGROUND . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

8

2.1

Noise Tolerance of Convex Losses . . . . . . . . . . . . . . . . . . . .

8

2.2

Logistic Regression and Exponential Family of Distributions . . . . . .

10

2.3

Φ-Exponential Family of Distributions . . . . . . . . . . . . . . . . . .

13

2.3.1

Φ-Logarithm . . . . . . . . . . . . . . . . . . . . . . . . . . .

13

2.3.2

Φ-Exponential . . . . . . . . . . . . . . . . . . . . . . . . . .

14

2.3.3

Φ-Exponential Family of Distributions . . . . . . . . . . . . .

16

2.3.4

T -Exponential Family of Distributions . . . . . . . . . . . . .

18

Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . .

20

3 T -LOGISTIC REGRESSION . . . . . . . . . . . . . . . . . . . . . . . . .

21

3.1

Binary Classification . . . . . . . . . . . . . . . . . . . . . . . . . . .

21

3.2

Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

23

3.2.1

Bayes-Risk Consistency . . . . . . . . . . . . . . . . . . . . .

25

3.2.2

Robustness . . . . . . . . . . . . . . . . . . . . . . . . . . . .

26

3.2.3

Multiple Local Minima . . . . . . . . . . . . . . . . . . . . .

30

3.3

Multiclass Classification . . . . . . . . . . . . . . . . . . . . . . . . .

32

3.4

Optimization Methods . . . . . . . . . . . . . . . . . . . . . . . . . .

37

Convex Multiplicative Programming . . . . . . . . . . . . . .

37

2.4

3.4.1

vii Page Empirical Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . .

39

3.5.1

Noise Models . . . . . . . . . . . . . . . . . . . . . . . . . .

41

3.5.2

Experiment Design . . . . . . . . . . . . . . . . . . . . . . .

45

3.5.3

Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

46

Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . .

51

4 T -DIVERGENCE BASED APPROXIMATE INFERENCE . . . . . . . . . .

52

Variational Inference in Exponential Family of Distributions . . . . . .

52

4.1.1

Mean Field Methods . . . . . . . . . . . . . . . . . . . . . . .

54

4.1.2

Assumed Density Filtering . . . . . . . . . . . . . . . . . . .

56

T -Entropy and T -Divergence . . . . . . . . . . . . . . . . . . . . . .

58

4.2.1

T -Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . .

59

4.2.2

T -Divergence . . . . . . . . . . . . . . . . . . . . . . . . . .

62

3.5

3.6

4.1

4.2

. . . . .

63

4.3.1

Mean Field Methods . . . . . . . . . . . . . . . . . . . . . . .

65

4.3.2

Assumed Density Filtering . . . . . . . . . . . . . . . . . . .

69

Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . .

73

5 T -CONDITIONAL RANDOM FIELDS . . . . . . . . . . . . . . . . . . . .

74

Conditional Random Fields . . . . . . . . . . . . . . . . . . . . . . .

74

5.1.1

Undirected Graphical Models . . . . . . . . . . . . . . . . . .

74

5.1.2

Conditional Random Fields . . . . . . . . . . . . . . . . . . .

75

5.1.3

Parameter Estimation . . . . . . . . . . . . . . . . . . . . . .

77

5.1.4

Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

78

5.2

T -Conditional Random Fields . . . . . . . . . . . . . . . . . . . . . .

79

5.3

Approximate Inference . . . . . . . . . . . . . . . . . . . . . . . . . .

82

5.4

2-D T -CRF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

84

5.5

Empirical Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . .

85

5.6

Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . .

91

6 GENERALIZED T -LOGISTIC REGRESSION . . . . . . . . . . . . . . . .

92

4.3

4.4

5.1

Variational Inference in T -Exponential Family of Distribution

viii Page 6.1

Binary Classification . . . . . . . . . . . . . . . . . . . . . . . . . . .

92

6.2

Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

94

6.3

Multiclass Classification . . . . . . . . . . . . . . . . . . . . . . . . .

96

6.4

Empirical Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . .

96

6.5

Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . .

100

7 SUMMARY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

101

7.1

Contributions

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

101

7.2

Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

102

LIST OF REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

104

APPENDICES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

108

Appendix A: Fundamentals of Convex Optimizations . . . . . . . . . . . . .

108

A.1

Convex Analysis . . . . . . . . . . . . . . . . . . . . . . . . .

108

A.2

Numerical Optimization . . . . . . . . . . . . . . . . . . . . .

110

Appendix B: Technical Proofs and Verifications . . . . . . . . . . . . . . . .

111

B.1

Proof of Theorem 2.3.1 . . . . . . . . . . . . . . . . . . . . .

111

B.2

Proof of Theorem 2.3.2 . . . . . . . . . . . . . . . . . . . . .

112

B.3

Proof of Lemma 3.2.1 . . . . . . . . . . . . . . . . . . . . . .

113

B.4

Proof of Lemma 3.2.2 . . . . . . . . . . . . . . . . . . . . . .

113

B.5

Proof of Theorem 3.2.3 . . . . . . . . . . . . . . . . . . . . .

114

B.6

Proof of Theorem 3.4.1 . . . . . . . . . . . . . . . . . . . . .

115

B.7

Proof of Theorem 4.2.1 . . . . . . . . . . . . . . . . . . . . .

116

B.8

Verification in Section 3.1 . . . . . . . . . . . . . . . . . . . .

116

B.9

Verification in Section 3.2.2 . . . . . . . . . . . . . . . . . . .

117

B.10

Verification in Section 3.3 . . . . . . . . . . . . . . . . . . . .

119

B.11

Verification in Definition 4.2.2 . . . . . . . . . . . . . . . . . .

120

B.12

Verification of Equation (4.33) . . . . . . . . . . . . . . . . . .

120

B.13

Verification in Section 4.3.1 . . . . . . . . . . . . . . . . . . .

121

B.14

Verification in Section 4.3.2 . . . . . . . . . . . . . . . . . . .

123

ix Page Verification in Section 6.2 . . . . . . . . . . . . . . . . . . . .

124

Appendix C: Additional Figures of Section 3.5 . . . . . . . . . . . . . . . .

125

Appendix D: Additional Figures of Section 6.4 . . . . . . . . . . . . . . . .

159

Appendix E: Additional Tables of Section 3.5 . . . . . . . . . . . . . . . . .

173

VITA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

176

B.15

x

LIST OF TABLES Table 1.1 2.1 3.1 3.2 3.3

3.4

3.5

3.6

4.1 5.1

Page Some popular convex losses used for binary classification. The loss functions are plotted in Figure 1.1. . . . . . . . . . . . . . . . . . . . . . . . . . . .

2

A few examples of non-convex losses for binary classification. The loss functions are plotted in Figure 2.2. erf is error function. . . . . . . . . . . . .

10

The robustness of some loss functions for binary classification based on Il (u). The verifications are provided in Appendix B.9. . . . . . . . . . . . . . .

30

Average time (in milliseconds) spent by our iterative scheme and fsolve in ˆ t (b computing G a) for multiclass t-logistic regression. . . . . . . . . . . . .

36

Summary of the binary classification datasets used in our experiments. n is the total # of examples, d is the # of features, and n+ : n− is the ratio of the number of positive vs negative examples. M denotes a million. ∗ deontes the datasets that are unused in Chapter 6 due to high computational costs. . . . . . . .

41

Summary of the multiclass classification datasets used in our experiments. n is the total # of examples, d is the # of features, nc is the # of classes. ∗ deontes the datasets that are unused in Chapter 6 due to high computational costs. .

42

The number of binary classification datasets that logistic regression or t-logistic regression is significantly better than the other. The left part of the table is the number of datasets where logistic regression is significantly better; the right part is the number of datasets where t-logistic regression is significantly better. The total number of datasets is 24 (dna and ocr datasets are excluded). . .

47

The number of multiclass classification datasets that logistic regression or tlogistic regression is significantly better than the other. The left part of the table is the number of datasets where logistic regression is significantly better; the right part is the number of datasets where t-logistic regression is significantly better. The total number of datasets is 9. . . . . . . . . . . . . . . . . . .

47

The accumulated prediction error rate on the synthetic online dataset using Bayesian online learning. . . . . . . . . . . . . . . . . . . . . . . . . . .

73

Comparisons of p(y 1 | X) and p(y 1 |y 2 , y 3 , X) between t-CRF (t = 1.1 and 1.5) and CRF (t = 1.0) in the 3-node chain example. . . . . . . . . . . . . . . .

81

xi Table 5.2

6.1 6.2 6.3 6.4

6.5

Page Optimal Parameters t and λ for CRF and t-CRF in image denoising task and image annotation task. (0%) denotes no extreme noise is added and (20%) denotes 20% extreme noise is added. . . . . . . . . . . . . . . . . . . . .

87

The number of binary datasets that each value of t2 for Mis-0 loss is optimal based on cross validation. The total number of datasets is 20. . . . . . . .

98

The number of binary datasets that each value of t2 for Mis-I loss is optimal based on cross validation. The total number of datasets is 20. . . . . . . .

98

The number of binary datasets that each value of t1 for Mis-II loss is optimal based on cross validation. The total number of datasets is 20. . . . . . . .

98

The number of binary classification datasets that the test error between mismatch losses is significant. Each column of the table is the number of datasets where a certain type of mismatch loss has significantly lower test error than another. The total number of datasets is 20. . . . . . . . . . . . . . . . . .

99

The number of multiclass classification datasets that the test error between mismatch losses is significant. Each column of the table is the number of datasets where a certain type of mismatch loss has significantly lower test error than another. The total number of datasets is 8. . . . . . . . . . . . . . . . . .

99

xii Appendix Table

Page

E.1 CPU time spent on binary datasets (Total time, Averaged time per function evaluation) on seconds. . . . . . . . . . . . . . . . . . . . . . . . . . . .

174

E.2 CPU time spent on multiclass datasets (Total time, Averaged time per function evaluation) on seconds. . . . . . . . . . . . . . . . . . . . . . . . . . . .

175

xiii

LIST OF FIGURES Figure 1.1 1.2

2.1

2.2

2.3 2.4 3.1

3.2

3.3

Some commonly used convex surrogate loss functions, including hinge loss, logistic loss, and exponential loss, for binary classification. . . . . . . . . . T -logistic loss for binary classification with different t: t = 1.3, t = 1.6, t = 1.9. Unlike the logistic loss (which is t = 1), t-logistic loss bends down as u  0, which caps the influence from outliers. . . . . . . . . . . . . . . .

Page 3

5

The Long-Servedio dataset. Points with label +1 are in red, while points with label −1 are in blue. Each blob of data points plays one of the three roles: large margin (25%), puller (25%), penalizer (50%). The black double arrow represents the true classifier. The red double arrow represents the optimal classifier of convex losses when 10% of data labels are flipped (represented by the circles surrounding the blobs). The red double arrow is no longer able to classify the penalizers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

9

Some commonly used non-convex loss functions, including ramp loss, sigmoid loss, and Savage loss, for binary classification. We omit the probit loss because it is very close to sigmoid loss. . . . . . . . . . . . . . . . . . . . . . . .

11

The left figure depicts logt for the various values of t indicated. The right figure zooms in to better depict the interval [0, 1] in which logt are negative. . . . .

14

The left figure depicts expt for the various values of t indicated. The right figure zooms in to better depict when expt can achieve the value zero. . . .

15

T -logistic loss for binary classification with four different t: t = 1.3, t = 1.6, t = 1.9. Unlike the logistic loss (which is t = 1), t-logistic loss bends down as u  0, which caps the influence from outliers. . . . . . . . . . . . . . . .

24

An illustration of the three robust types. All the three types of losses behave similarly as u > 0. When u → −∞, Type-0 loss goes to +∞; Type-I loss goes to a constant; and Type-II loss goes to 0. . . . . . . . . . . . . . . . . . .

29

The empirical risk of t-logistic regression (upper) and Savage loss (lower) on a toy two dimensional dataset. T -logistic regression appears to be easier to optimize than Savage loss. . . . . . . . . . . . . . . . . . . . . . . . . .

31

xiv Figure 3.4

4.1

4.2

4.3

4.4

5.1 5.2 5.3 5.4 5.5

5.6

6.1

Page

Empirical risk of logistic regression and t-logistic regression on the one dimensional example. The optimal solutions before and after adding the outlier are significantly different for logistic regression. In contrast, the global optimum of t-logistic regression stays the same. . . . . . . . . . . . . . . . . . . .

33

T -entropy corresponding to two well known probability distributions. Left: the Bernoulli distribution p(z; µ). Right: the 1-dimensional Student’s t-distribution p(z; 0, σ 2 , v), where v = 2/(t−1)−1. One recovers the SBG entropy by letting t = 1.0. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

61

T -divergence between two distributions. Top: Bernoulli distributions p1 = p(z; µ) and p2 = p(z; 0.5). Bottom: Student’s t-distributions. Left: p1 = p(z; µ, 1, v) and p2 = p(z; 0, 1, v). Right: p1 = p(z; 0, σ 2 , v) and p2 = p(z; 0, 1, v). v = 2/(t − 1) − 1. One recovers the K-L divergence by letting t = 1.0. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

64

The negative t-divergence between the product of ten 1-dimensional Student’s t-distributions and one 10-dimensional Student’s t-distribution using the mean field approach for 500 iterations. . . . . . . . . . . . . . . . . . . . . . .

69

The discrepancy D(i) between the true weight vector w(i) and Ep˜i [w] the posterior mean of p ˜i (w) at each data i from the synthetic online dataset using Bayesian online learning. Left: case I. Right: case II. . . . . . . . . . . . .

73

The 3-node chain model. Each node indicates a variable. Each edge on the graph represents a dependency. . . . . . . . . . . . . . . . . . . . . . . .

75

The 3-node conditional chain model. Blue nodes indicate the labels; red nodes indicate the data variables. Each edge on the graph represents a factor. . . .

76

A 2-D conditional model. Blue nodes indicate the labels; red nodes indicate the observed input variables. . . . . . . . . . . . . . . . . . . . . . . . .

84

Test error between t-CRF and CRF with and without extreme noise added. Left: image denoising task. Right: image annotation task. . . . . . . . . .

88

Image denoising task. Top row is the dataset: left is the input image; right is the true label. Middle row is the denoise result without extreme noise: left is CRF, right is t-CRF. Bottom row is the denoise result with extreme noise: left is CRF, right is t-CRF. . . . . . . . . . . . . . . . . . . . . . . . . . . . .

89

Image annotation task. The first and the third rows are the annotation results without extreme noise: left is CRF, right is t-CRF. The second and the fourth rows are the annotation results with extreme noise: left is CRF, right is t-CRF.

90

Generalized t-logistic regression with t2 = 1 and four different t1 : t1 = 1, t1 = 0.7, t1 = 0.4, t1 = 0.1. . . . . . . . . . . . . . . . . . . . . . . . . .

95

xv Appendix Figure

Page

C.1 Experiment on adult9 Dataset. Top: Generalization Performance; Middle: Random Initialization; Bottom: Forgetting Variables. . . . . . . . . . . . .

126

C.2 Experiment on alpha Dataset. Top: Generalization Performance; Middle: Random Initialization; Bottom: Forgetting Variables. . . . . . . . . . . . .

127

C.3 Experiment on astro-ph Dataset. Top: Generalization Performance; Middle: Random Initialization; Bottom: Forgetting Variables. . . . . . . . . . .

128

C.4 Experiment on aut-avn Dataset. Top: Generalization Performance; Middle: Random Initialization; Bottom: Forgetting Variables. . . . . . . . . . . . .

129

C.5 Experiment on beta Dataset. Top: Generalization Performance; Middle: Random Initialization; Bottom: Forgetting Variables. . . . . . . . . . . . .

130

C.6 Experiment on covertype Dataset. Top: Generalization Performance; Middle: Random Initialization; Bottom: Forgetting Variables. . . . . . . . . . .

131

C.7 Experiment on delta Dataset. Top: Generalization Performance; Middle: Random Initialization; Bottom: Forgetting Variables. . . . . . . . . . . . .

132

C.8 Experiment on epsilon Dataset. Top: Generalization Performance; Middle: Random Initialization; Bottom: Forgetting Variables. . . . . . . . . . . . .

133

C.9 Experiment on gamma Dataset. Top: Generalization Performance; Middle: Random Initialization; Bottom: Forgetting Variables. . . . . . . . . . . . .

134

C.10 Experiment on kdd99 Dataset. Top: Generalization Performance; Middle: Random Initialization; Bottom: Forgetting Variables. . . . . . . . . . . . .

135

C.11 Experiment on kdda Dataset. Top: Generalization Performance; Middle: Random Initialization; Bottom: Forgetting Variables. . . . . . . . . . . . .

136

C.12 Experiment on kddb Dataset. Top: Generalization Performance; Middle: Random Initialization; Bottom: Forgetting Variables. . . . . . . . . . . . .

137

C.13 Experiment on longservedio Dataset. Top: Generalization Performance; Middle: Random Initialization; Bottom: Forgetting Variables. . . . . . . . .

138

C.14 Experiment on measewyner Dataset. Top: Generalization Performance; Middle: Random Initialization; Bottom: Forgetting Variables. . . . . . . . .

139

C.15 Experiment on mushrooms Dataset. Top: Generalization Performance; Middle: Random Initialization; Bottom: Forgetting Variables. . . . . . . . . . .

140

C.16 Experiment on news20 Dataset. Top: Generalization Performance; Middle: Random Initialization; Bottom: Forgetting Variables. . . . . . . . . . . . .

141

C.17 Experiment on real-sim Dataset. Top: Generalization Performance; Middle: Random Initialization; Bottom: Forgetting Variables. . . . . . . . . . .

142

xvi Appendix Figure

Page

C.18 Experiment on reuters-c11 Dataset. Top: Generalization Performance; Middle: Random Initialization; Bottom: Forgetting Variables. . . . . . . . .

143

C.19 Experiment on reuters-ccat Dataset. Top: Generalization Performance; Middle: Random Initialization; Bottom: Forgetting Variables. . . . . . . . .

144

C.20 Experiment on web8 Dataset. Top: Generalization Performance; Middle: Random Initialization; Bottom: Forgetting Variables. . . . . . . . . . . . .

145

C.21 Experiment on webspamtrigram Dataset. Top: Generalization Performance; Middle: Random Initialization; Bottom: Forgetting Variables. . . . . . . . . 146 C.22 Experiment on webspamunigram Dataset. Top: Generalization Performance; Middle: Random Initialization; Bottom: Forgetting Variables. . . . . . . . . 147 C.23 Experiment on worm Dataset. Top: Generalization Performance; Middle: Random Initialization; Bottom: Forgetting Variables. . . . . . . . . . . . .

148

C.24 Experiment on zeta Dataset. Top: Generalization Performance; Middle: Random Initialization; Bottom: Forgetting Variables. . . . . . . . . . . . .

149

C.25 Generalization Performance on dna Dataset. . . . . . . . . . . . . . . . .

149

C.26 Generalization Performance on ocr Dataset. . . . . . . . . . . . . . . . .

149

C.27 Experiment on dna Dataset. Top: Generalization Performance; Middle: Random Initialization; Bottom: Forgetting Variables. . . . . . . . . . . . . . .

150

C.28 Experiment on letter Dataset. Top: Generalization Performance; Middle: Random Initialization; Bottom: Forgetting Variables. . . . . . . . . . . . .

151

C.29 Experiment on mnist Dataset. Top: Generalization Performance; Middle: Random Initialization; Bottom: Forgetting Variables. . . . . . . . . . . . .

152

C.30 Experiment on protein Dataset. Top: Generalization Performance; Middle: Random Initialization; Bottom: Forgetting Variables. . . . . . . . . . . . .

153

C.31 Experiment on rcv1 Dataset. Top: Generalization Performance; Middle: Random Initialization; Bottom: Forgetting Variables. . . . . . . . . . . . .

154

C.32 Experiment on sensitacoustic Dataset. Top: Generalization Performance; Middle: Random Initialization; Bottom: Forgetting Variables. . . . . . . . . 155 C.33 Experiment on sensitcombined Dataset. Top: Generalization Performance; Middle: Random Initialization; Bottom: Forgetting Variables. . . . . . . . . 156 C.34 Experiment on sensitseismic Dataset. Top: Generalization Performance; Middle: Random Initialization; Bottom: Forgetting Variables. . . . . . . . .

157

xvii Appendix Figure

Page

C.35 Experiment on usps Dataset. Top: Generalization Performance; Middle: Random Initialization; Bottom: Forgetting Variables. . . . . . . . . . . . .

158

D.1 Experiment on adult9 Dataset. Top: Generalization Performance; Bottom: Random Initialization. . . . . . . . . . . . . . . . . . . . . . . . . . . .

159

D.2 Experiment on alpha Dataset. Top: Generalization Performance; Bottom: Random Initialization. . . . . . . . . . . . . . . . . . . . . . . . . . . .

159

D.3 Experiment on astro-ph Dataset. Top: Generalization Performance; Bottom: Random Initialization. . . . . . . . . . . . . . . . . . . . . . . . . .

160

D.4 Experiment on aut-avn Dataset. Top: Generalization Performance; Bottom: Random Initialization. . . . . . . . . . . . . . . . . . . . . . . . . . . .

160

D.5 Experiment on beta Dataset. Top: Generalization Performance; Bottom: Random Initialization. . . . . . . . . . . . . . . . . . . . . . . . . . . .

161

D.6 Experiment on covertype Dataset. Top: Generalization Performance; Bottom: Random Initialization. . . . . . . . . . . . . . . . . . . . . . . . . .

161

D.7 Experiment on delta Dataset. Top: Generalization Performance; Bottom: Random Initialization. . . . . . . . . . . . . . . . . . . . . . . . . . . .

162

D.8 Experiment on gamma Dataset. Top: Generalization Performance; Bottom: Random Initialization. . . . . . . . . . . . . . . . . . . . . . . . . . . .

162

D.9 Experiment on kdd99 Dataset. Top: Generalization Performance; Bottom: Random Initialization. . . . . . . . . . . . . . . . . . . . . . . . . . . .

163

D.10 Experiment on longservedio Dataset. Top: Generalization Performance; Bottom: Random Initialization. . . . . . . . . . . . . . . . . . . . . . . .

163

D.11 Experiment on measewyner Dataset. Top: Generalization Performance; Bottom: Random Initialization. . . . . . . . . . . . . . . . . . . . . . . .

164

D.12 Experiment on mushrooms Dataset. Top: Generalization Performance; Bottom: Random Initialization. . . . . . . . . . . . . . . . . . . . . . . . . .

164

D.13 Experiment on news20 Dataset. Top: Generalization Performance; Bottom: Random Initialization. . . . . . . . . . . . . . . . . . . . . . . . . . . .

165

D.14 Experiment on real-sim Dataset. Top: Generalization Performance; Bottom: Random Initialization. . . . . . . . . . . . . . . . . . . . . . . . . .

165

D.15 Experiment on reuters-c11 Dataset. Top: Generalization Performance; Bottom: Random Initialization. . . . . . . . . . . . . . . . . . . . . . . .

166

D.16 Experiment on reuters-ccat Dataset. Top: Generalization Performance; Bottom: Random Initialization. . . . . . . . . . . . . . . . . . . . . . . .

166

xviii Appendix Figure D.17 Experiment on web8 Dataset. Top: Generalization Performance; Bottom: Random Initialization. . . . . . . . . . . . . . . . . . . . . . . . . . . .

Page 167

D.18 Experiment on webspamunigram Dataset. Top: Generalization Performance; Bottom: Random Initialization. . . . . . . . . . . . . . . . . . . . . . . . 167 D.19 Experiment on worm Dataset. Top: Generalization Performance; Bottom: Random Initialization. . . . . . . . . . . . . . . . . . . . . . . . . . . .

168

D.20 Experiment on zeta Dataset. Top: Generalization Performance; Bottom: Random Initialization. . . . . . . . . . . . . . . . . . . . . . . . . . . .

168

D.21 Experiment on dna Dataset. Top: Generalization Performance; Bottom: Random Initialization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

169

D.22 Experiment on letter Dataset. Top: Generalization Performance; Bottom: Random Initialization. . . . . . . . . . . . . . . . . . . . . . . . . . . .

169

D.23 Experiment on mnist Dataset. Top: Generalization Performance; Bottom: Random Initialization. . . . . . . . . . . . . . . . . . . . . . . . . . . .

170

D.24 Experiment on protein Dataset. Top: Generalization Performance; Bottom: Random Initialization. . . . . . . . . . . . . . . . . . . . . . . . . . . .

170

D.25 Experiment on sensitacoustic Dataset. Top: Generalization Performance; Bottom: Random Initialization. . . . . . . . . . . . . . . . . . . . . . . . 171 D.26 Experiment on sensitcombined Dataset. Top: Generalization Performance; Bottom: Random Initialization. . . . . . . . . . . . . . . . . . . . . . . . 171 D.27 Experiment on sensitseismic Dataset. Top: Generalization Performance; Bottom: Random Initialization. . . . . . . . . . . . . . . . . . . . . . . .

172

D.28 Experiment on usps Dataset. Top: Generalization Performance; Bottom: Random Initialization. . . . . . . . . . . . . . . . . . . . . . . . . . . .

172

xix

ABSTRACT Ding, Nan Ph.D., Purdue University, May 2013. Statistical Machine Learning in the T Exponential Family of Distributions. Major Professor: S.V.N. Vishwanathan. The exponential family of distributions plays an important role in statistics and machine learning. They underlie numerous models such as logistic regression and probabilistic graphical models. However, exponential family based probabilistic models are vulnerable to outliers. This dissertation aims to design machine learning models based on a more generalized distribution family, namely the t-exponential family of distributions, and show that efficient inference algorithms exist for these models. We first focus on the classification problem and propose t-logistic regression, which replaces the exponential family in logistic regression by a t-exponential family and is more robust in the presence of label noise. Second, inspired by variational inference in the exponential family, we define a new t-entropy which is the Fenchel conjugate to the log-partition function of the t-exponential family. By minimizing the t-divergence, the Bregman divergence of t-entropy, between the approximate and the true distribution, we develop efficient variational inference approaches for t-exponential family based graphical models. Using our inference procedure, we generalize conditional random fields (CRF) to t-CRF, and show how t-divergence based mean field approach can be used to approximate the logpartition function. Finally, t-divergence is combined with t-logistic regression to obtain a generalized family of convex and non-convex loss functions for classification. Empirical evaluation of our models on a variety of datasets is presented to demonstrate their advantages.

1

1. INTRODUCTION Consider the classic machine learning problem of binary classification: we are given m training data points {x1 , . . . , xm } and their corresponding labels {y1 , . . . , ym }, with xi drawn from some vector space X and yi ∈ {+1, −1}. The task is to learn a function f : X → {+1, −1} which can predict the labels on unseen data. In this dissertation, we focus on linear models: f (x) := sign (hΦ(x), θi). Here Φ is a feature map, θ are the parameters of the model, h·, ·i denotes an inner product, and sign(z) = +1 if z > 0 and −1 otherwise. One way to learn θ is to define a loss function l(x, y, θ) and minimize the averaged loss, or empirical risk: m

1 X min Remp (θ) := l (xi , yi , θ) . θ m i=1 In order to prevent overfitting to the training data, it is customary to add a regularizer Ω(θ) to Remp (θ) and minimize the regularized risk: min J(θ) := λΩ(θ) + Remp (θ). θ

Here λ is a scalar which trades off the importance of the regularizer and the empirical risk. While a variety of regularizers are commonly used (see e.g. [1]), we will restrict our attention to the L2 -regularizer: Ω(θ) =

1 kθk22 . 2

Let b a = hΦ(x), θi and u(x, y, θ) := y · b a denote the margin of (x, y). Where it is clear from context, we will use u to denote u(x, y, θ) and ui to denote u(xi , yi , θ). Note that

2 u > 0 if, and only if, f (x) = y, that is, x is correctly classified. Therefore, a natural loss function to define is the 0-1 loss:

l(x, y, θ) =

  0

if u > 0

 1

otherwise.

(1.1)

Unfortunately, the 0-1 loss is non-convex, non-smooth, and it is NP-hard to even approximately minimize the empirical risk with this loss [2]. Therefore, a lot of research effort has been directed towards finding surrogate losses which are computationally tractable. In particular, convex loss functions are in vogue mainly because the regularized risk minimization problem can be solved efficiently with readily available tools [3]. Table 1.1 summarizes a few popular convex losses and Figure 1.1 contrasts them with the 0-1 loss1 .

Table 1.1 Some popular convex losses used for binary classification. The loss functions are plotted in Figure 1.1. Name

Loss Function

Hinge

l(x, y, θ) = max(0, 1 − u)

Exponential

l(x, y, θ) = exp(−u)

Logistic

l(x, y, θ) = log(1 + exp(−u)

Despite many successes of the binary classification algorithms based on convex losses, as [4, 5] point out, those algorithms are not noise-tolerant2 (see Section 2.1). Intuitively, as can be seen from Figure 1.1 the convex loss functions grow at least linearly as u ∈ (−∞, 0), which causes data with u  0 to become too important. There has been some recent and not-so-recent work on using non-convex loss functions to alleviate the above problem. Although these non-convex losses are empirically more robust, they also lose some key advantages of convex losses. For instance, finding 1

Note that the logistic loss and the later t-logistic loss are plotted by dividing the losses by log(2). Although, the analysis of [4] is carried out in the context of boosting, we believe, the results hold for a larger class of algorithms which minimize a regularized risk with a convex loss function. 2

3

logistic

exp

loss

hinge

6

4

2 0-1 loss

-4

-2

0

2

4

margin

Fig. 1.1. Some commonly used convex surrogate loss functions, including hinge loss, logistic loss, and exponential loss, for binary classification.

4 the global optimum becomes very hard because the empirical risk may have multiple local minima. Unlike certain convex losses, such as logistic regression (see Section 2.2), those non-convex losses do not have a proper probabilistic interpretation. The probabilistic interpretation is important to generalize these losses to more complex settings, such as modeling interacting factors where probabilistic graphical models are widely applied [6, 7]. In this dissertation, we propose to investigate a non-convex loss function which is firmly grounded in probability theory. By extending logistic regression from the exponential family to the t-exponential family3 , a natural extension of exponential family of distributions studied in statistical physics [8–10] (reviewed in Section 2.3), we obtain the t-logistic regression (as shown in Figure 1.2). Furthermore, we show that our loss can be generalized to more complicated probabilistic models, e.g. t-conditional random fields. In order to make efficient inference in these complicated models, we study a new t-entropy which is the Fenchel conjugate to the log-partition function of t-exponential family. We develop two variational inference methods by minimizing the t-divergence, the Bregman divergence of t-entropy. Finally, we show that t-divergence can also be combined with t-logistic regression to obtain a more generalized family of loss functions for classification.

1.1

Dissertation Outline Our dissertation is structured as follows:

Chapter 2. Background In this chapter, we review some related background material, including noise tolerance of convex losses, probabilistic interpretation of logistic regression, exponential family of distributions and its generalization the t-exponential family of distributions. Chapter 3. T -Logistic Regression

In this chapter, we try to improve the robustness

of logistic regression for classification. Our main idea is to use t-exponential family to 3

Also known as the q-exponential family or the Tsallis distribution in statistical physics. C. Tsallis is one of pioneers of nonextensive entropy and generalized exponential family.

5

loss

t = 1 (logistic)

6 t = 1.3 4 t = 1.6 t = 1.9 2 0-1 loss -4

-2

0

2

4

margin

Fig. 1.2. T -logistic loss for binary classification with different t: t = 1.3, t = 1.6, t = 1.9. Unlike the logistic loss (which is t = 1), t-logistic loss bends down as u  0, which caps the influence from outliers.

6 replace the exponential family for modeling the conditional likelihood of the examples. We demonstrate the robustness of t-logistic regression both theoretically and empirically. We also show that the algorithm is empirically stable under random initialization. Chapter 4. T -Divergence Based Approximate Inference

In order to work with multi-

variate probabilistic models, one key challenge is to make efficient inference. Approximate inference is an important technique for dealing with large-scale graphical models based on exponential family of distributions. In this chapter, we extend the idea to t-exponential family by defining a new t-divergence. This divergence measure is obtained via convex conjugacy between the log-partition function of the t-exponential family and a new t-entropy. We propose two approximate inference algorithms for t-exponential family of distributions. Chapter 5. T -Conditional Random Fields In this chapter, we propose the t-conditional random field (t-CRF) which generalizes the conditional random fields to t-exponential family. This new t-CRF abandons the Markov properties as well as the Hammersley-Clifford theorem, and appears to be more robust. It applies the mean field method based on tdivergence to make efficient inference. Chapter 6. Generalized T -Logistic Regression This chapter combines t-divergence and t-logistic regression for classification. We obtain a family of convex and non-convex loss functions with all types of robustness. Chapter 7. Summary

We summarize our contributions and provide a discussion on

future work in the last chapter of the dissertation. Appendix A. Fundamentals of Convex Optimizations Appendix A provides a brief review of some concepts and properties in convex analysis as well as two well-known numerical optimization methods used in this dissertation.

7 Appendix B. Technical Proofs and Verifications

Appendix B provides the technical

proofs and verifications in this dissertation. Appendices C/D/E. Additional Figures and Tables

Appendices C, D and E provide

additional figures and tables from the empirical results in this dissertation.

1.2

Collaborators and Related Publications Chapter 3 was the joint work with S.V.N. Vishwanathan, Vasil Denchev, and Manfred

Warmuth. The work was first published in ”t-logistic regression” in Advances in Neural Information Processing Systems 23, 2010. Chapter 4 was the joint work with S.V.N. Vishwanathan and Alan Qi. The work was first published in ”t-divergence based approximate inference” in Advances in Neural Information Processing Systems 24, 2011. Chapter 5 was the joint work with S.V.N. Vishwanathan and Changyou Chen. The work has not yet been published. Chapter 6 was the joint work with S.V.N. Vishwanathan and Manfred Warmuth. The work has not yet been published.

8

2. BACKGROUND In this chapter, we present some existing literature and background material that is used later in this dissertation. We will first review the famous example proposed in [4] which shows that uniform random label noise defeats all convex classifiers. Next, we review logistic regression, discuss its probabilistic interpretation and relation to exponential family of distributions. Finally, we review t-exponential family of distributions.

2.1

Noise Tolerance of Convex Losses Convexity is a very attractive property because it ensures that the regularized risk min-

imization problem has a unique global optimum1 [3]. However, as was recently shown by [4], learning algorithms based on convex loss functions are not robust to noise. In [4], the authors constructed an interesting dataset to show that convex losses are not tolerant to uniform label noise (label noise is added by flipping a portion of the labels of the training data). In their dataset, each data point has a 21-dimensional feature vector and plays one of three possible roles: large margin examples (25%, x1,2,...,21 = y); pullers (25%, x1,...,11 = y, x12,...,21 = −y); and penalizers (50%, Randomly select and set 5 of the first 11 coordinates and 6 out of the last 10 coordinates to y, and set the remaining coordinates to −y). Note that all the data has the same magnitude in terms of L1 , L2 , L∞ norm. This dataset is illustrated in Figure 2.1. We use the red blobs to represent the points with label +1, and the blue blobs to represent the points with label −1. Each blob plays one of the three roles as marked on the figure. Without adding label noise, the black double arrow (N-S) is the optimal classifier of the convex losses, which classifies the clean data perfectly. However, if we add 10% label noise into the dataset (represented by narrow red or blue circles surrounding the blue or red blobs), the optimal classifier of the convex losses 1

By unique global optimum, we mean the uniqueness of minimum objective.

9 puller

Large

penalizer

Large

Margin

Margin penalizer

puller

Fig. 2.1. The Long-Servedio dataset. Points with label +1 are in red, while points with label −1 are in blue. Each blob of data points plays one of the three roles: large margin (25%), puller (25%), penalizer (50%). The black double arrow represents the true classifier. The red double arrow represents the optimal classifier of convex losses when 10% of data labels are flipped (represented by the circles surrounding the blobs). The red double arrow is no longer able to classify the penalizers.

changes to the red double arrow (NW-SE). Obviously, the new classifier is no longer able to distinguish the penalizers. We can intuitively see the reason from the shape of the convex loss functions. According to Figure 1.1, the convex losses grow at least linearly with slope |l0 (0)| as u ∈ (−∞, 0), which introduces an extremely large loss from the data point with u  0. Therefore, the flipped large margin examples in Figure 2.1 dramatically increase the empirical risk of the black classifier, which becomes larger than that of the red classifier. Since convex losses are non-robust against random label noise, many nonconvex losses have been investigated to improve the robustness of the classifier. We list some but not all commonly used nonconvex losses in Table 2.1. However, those non-convex losses have their own problems. First of all, although the non-convex losses are empirically more robust, they also lose some key advantages of convex losses, especially they may stuck into

10 local minima of their empirical risk. More importantly, over the past decades, probabilistic graphical models [6, 7] have been widely used as powerful and efficient tools to model interacting factors in multivariate data. However, none of those losses have a proper probabilistic interpretation (e.g. see Section 2.2), which largely limits their generalization to more complicated applications.

Table 2.1 A few examples of non-convex losses for binary classification. The loss functions are plotted in Figure 2.2. erf is error function. Name

Loss Function

Probit

l(x, y, θ) = 1 − erf (u)

Ramp

l(x, y, θ) = min(2, max(0, 1 − u))

Savage Sigmoid

2.2

l(x, y, θ) =

4 (1+exp(u))2

l(x, y, θ) =

2 1+exp(u)

Logistic Regression and Exponential Family of Distributions In contrast, logistic loss, the loss function of logistic regression, is well motivated from

a probabilistic perspective. As shown in [11, 12] (also see Section 5.1), its generalization to probabilistic graphical models, e.g. conditional random fields, will be natural and convenient. In this section, we briefly review logistic regression and its relation to exponential family of distributions [13]. In statistics, the data points in a dataset are typically assumed to be independently identically distributed (i.i.d.), which allows us to write the conditional likelihood of the entire datasets (X, y) = {(xi , yi )} with i = 1, . . . , m as, p(y | X, θ) =

m Y i=1

p(yi | xi , θ).

(2.1)

11

loss 4 Savage

ramp 2 sigmoid 0-1 loss

-3

-2

-1

0

1

2

3

margin

Fig. 2.2. Some commonly used non-convex loss functions, including ramp loss, sigmoid loss, and Savage loss, for binary classification. We omit the probit loss because it is very close to sigmoid loss.

12 To avoid overfitting to the data, we add a prior p(θ) on the parameter θ. Therefore, according to the Bayes rule, the posterior of θ is obtained, p(θ | y, X) = p(y | X, θ)p(θ)/p(y | X), and the maximum a-posteriori (MAP) estimate of θ is obtained by minimizing, − log p(θ | y, X) = −

m X i=1

log p(yi | xi ; θ) − log p(θ) + const.

(2.2)

where log p(y | X) is neglected since it is independent of θ. − log p(θ) serves as the regularizer. In logistic regression, p(y| x, θ) is modeled using the exponential family of distributions. The exponential family of distributions [13] of a set of random variables z is a parametric distribution family defined as2 : p(z; θ) := exp (hΦ(z), θi − G(θ)) ,

(2.3)

where h·, ·i is the inner product, Φ(z) is a map from z ∈ Z to the sufficient statistics, θ is commonly referred to as the natural parameter, and it lives in the space dual to Φ(z) (see Theorem 4.1.1). G(θ) is a normalizer, also known as the log-partition function, which ensures that p(z; θ) is properly normalized, Z exp (hΦ(z), θi) d z . G(θ) = log

(2.4)

Z

Exponential family of distributions has many important properties and applications. Since many of them are non-trivial, we will review and compare them with their generalizations later in Section 2.3 and Section 4.1. For binary logistic regression, p(y| x; θ) = exp (hΦ(x, y), θi − G(x; θ)) ,

(2.5)

where Φ(x, y) = 12 yΦ(x), so that      1 1 G(x; θ) = log exp hΦ(x), θi + exp − hΦ(x), θi . 2 2 2

Traditionally exponential family distributions are written as p(z; θ) := p0 (z) exp (hΦ(z), θi − G(θ)). For ease of exposition we ignore the base measure p0 (z) in this paper.

13 The function l(x, y; θ) := − log p(y| x; θ) is the logistic loss of the data (x, y), because      1 1 y hΦ(x), θi + exp − hΦ(x), θi l(x, y, θ) = − hΦ(x), θi + log exp 2 2 2 = log (1 + exp(−y hΦ(x), θi)) . 2.3 Φ-Exponential Family of Distributions The convexity of logistic regression is essentially because it uses exponential family to model the conditional distribution. The thin-tailed nature of the exponential family makes it unsuitable for designing robust algorithms against noisy data. In the past several decades, effort has also been devoted to develop alternate, generalized distribution families in statistics [14, 15], statistical physics [8, 10], and most recently in machine learning [16]. Of particular interest to us is the t-exponential family, which was first proposed by Tsallis and co-workers [10, 17, 18]. It is a special case of the more general φ-exponential family of Naudts [8, 9]. In this section, we begin by reviewing the generalized logφ and expφ functions which were introduced in statistical physics by [8, 9]. Then, these generalized exponential functions are used to define φ-exponential family of distributions [9] and t-exponential family of distributions as special cases.

2.3.1 Φ-Logarithm The φ-logarithm, logφ 3 , is defined as follows: Definition 2.3.1 (φ-logarithm [8, 9]) Let φ : [0, +∞) → [0, +∞) be strictly positive and non-decreasing on (0, +∞). Define logφ via Z logφ (x) := 1

x

1 dy φ(y)

(2.6)

If this integral converges for all finite x > 0, then logφ is called a φ-logarithm. 3 Note that throughout this dissertation, logφ or logt are defined in (2.6) and (2.7). The subscripts do not represent the log base.

14 To see that this definition generalizes log, simply set φ(y) = y. Clearly, the gradient of logφ (x) is 1/φ(x) from which it follows that logφ is a concave increasing function. Furthermore, logφ is negative on (0, 1), positive on (1, +∞), with logφ (1) = 0. Of course the integral may diverge at x = 0. All these are properties of the familiar log function. An important example is Example 1 (t-logarithm) Let φ(x) = xt , t > 0. Then   log(x) if t = 1 logt (x) :=   x1−t −1 otherwise 1−t

(2.7)

and d logt (x) = x−t . dx

(2.8)

Figure 2.3 visualizes logt for various values of t and contrasts it with the familiar log. logt

2.5

logt 0.2

t = 0.5 log(x)

t→0

2.0 1.5 1.0 0.5

2

3

4

0.6

0.8

1.0

−0.5

t = 1.5 1

0.4

5

−1.0

t→0 t = 0.5

−1.5

−0.5 −1.0

−2.0

−1.5 −2.0

log(x) t = 1.5

−2.5

Fig. 2.3. The left figure depicts logt for the various values of t indicated. The right figure zooms in to better depict the interval [0, 1] in which logt are negative.

2.3.2

Φ-Exponential

The inverse of logφ is the φ-exponential function, denoted expφ . When logφ takes on a finite value, this is well defined. But, unlike log, there is no guarantee that logφ takes on all

15 values in R. Therefore, define expφ (x) = 0 if x is less than every element of range(logφ ) and expφ (x) = +∞ if x is larger than range(logφ ). Properties of expφ , such as convexity, mirror those of logφ [8,9]. A key difference involves the fact that exp is the only non-trivial function which is its own derivative. However expφ has the following property: d expφ (x) = φ(expφ (x)). dx

(2.9)

Example 2 (t-exponential) Let [x]+ be x if x > 0 and 0 otherwise. Continuing with φ(x) = xt , t > 0, we have   exp(x)

expt (x) =

if t = 1

1  [1 + (1 − t)x] 1−t

(2.10)

otherwise.

+

Elementary calculus shows that d expt (x) = expt (x)t dx

expt 10

(2.11)

expt

t = 1.5

1.0

exp(x) 8

0.8

6

0.6

t = 0.5 4

0.4

t = 1.5 t→0

2

exp(x)

0.2

t = 0.5 −3

−2

−1

1

2

−3.0

−2.5

−2.0

−1.5

−1.0

t→0

−0.5

Fig. 2.4. The left figure depicts expt for the various values of t indicated. The right figure zooms in to better depict when expt can achieve the value zero.

16 Figure 2.4 shows some t-exponential functions. They are convex, increasing functions. It is obvious that expt decays towards 0 more slowly as t increases. This property leads to a family of heavy-tailed distributions as t > 1. Since logφ is an increasing function, it follows that its inverse expφ is also an increasing function. Since φ is a non-decreasing function, ∇x expφ (x) = (φ ◦ expφ )(x) is also an increasing function. Using Theorem 24.1 in [19] it follows that expφ is a strictly convex function.

2.3.3 Φ-Exponential Family of Distributions [9] used the φ-exponential function to define the parametric distribution family: p(z; θ) := expφ (hΦ(z), θi − Gφ (θ)) .

(2.12)

where Φ(z) is a map from z ∈ Z to the sufficient statistics, and θ is the natural parameter. Gφ (θ) is the normalizer of the φ-exponential family such that Z expφ (hΦ(z), θi − Gφ (θ)) d z = 1 Z

and Gφ (θ) 6=

R Z

expφ hΦ(z), θi d z in general. A closely related distribution, which often

appears when working with φ-exponential families is the so-called escort distribution: Definition 2.3.2 (Escort distribution) Let φ : [0, +∞) → [0, +∞) be strictly positive and non-decreasing on (0, +∞). For a φ-exponential family of distributions, q(z; θ) := φ(p(z; θ))/Z(θ) is called the escort distribution of p(z; θ). Here Z(θ) =

R Z

(2.13) φ (p(z; θ)) d z is the normaliz-

ing constant which ensures that the escort distribution integrates to 1. One of the crucial properties of exponential families is that the log-partition function G is convex, and it can be used to generate cumulants of the distribution simply by taking derivatives.

17 Theorem 2.3.1 (Log partition function [13]) If the regularity condition Z Z ∇θ p(z; θ) d z = ∇θ p(z; θ) d z Z

(2.14)

Z

holds, then ∇θ G(θ) = E [Φ(z)] ,

∇2θ G(θ) = Var [Φ(z)] ,

(2.15)

and G(θ) is convex. The proof of the above theorem is included in the Appendix B.1. Somewhat surprisingly, Gφ (θ) of φ-exponential family shares some of the similar properties as G(θ) of exponential family. As following theorem asserts, its first derivative can still be written as an expectation of Φ(z) but now with respect to the escort distribution in contrast with Theorem 2.3.1. The proof of the theorem is included in the Appendix B.2. Theorem 2.3.2 (φ-log partition function [9, 16]) The function Gφ (θ) is convex. Moreover, if the following regularity condition Z Z ∇θ p(z; θ) d z = ∇θ p(z; θ) d z Z

(2.16)

Z

holds, then ∇θ Gφ (θ) = Eq(z;θ) [Φ(z)] .

(2.17)

Before moving on, we briefly discuss the regularity condition (2.16), which concerns the legality of swapping the differentiation over a parameter with the integration over the variables. Readers not interested in the following discussion may skip to Section 2.3.4. This is a fairly standard, yet technical requirement, which is often proved using the Dominated Convergence Theorem (see e.g. Section 9.2 of [20]). This holds, for instance, when Eq(z;θ) |Φ(z)| < ∞ and |∇θ Gφ (θ)| < ∞. Here | · | denotes the L1 norm. This condition may not hold for any arbitrary φ-exponential family. Here is one example: Example 3 Let z ∈ [1, +∞) and Φ(z) = z. Consider the φ-exponential family where φ(x) = xt (later referred as the t-exponential family), using (2.10) and (2.12) the resulting density can be written as p(z; θ) = (1 + (1 − t)(θz − Gt (θ)))1/(1−t) .

18 If we compute Eq(z;θ) |Φ(z)| = Eq(z;θ) |z| Z +∞ 1 (1 + (1 − t)(θz − Gt (θ)))t/(1−t) |z|dz = Z(θ) 1 Z +∞ t/(1−t) 1 dz = z (1−t)/t + (1 − t)(θz − Gt (θ))z (1−t)/t Z(θ) 1 t/(1−t)  Z +∞ 1 (1 − (1 − t)Gt (θ))z (1−t)/t + (1 − t)θz 1/t  = dz | {z } | {z } Z(θ) 1 :=T1

:=T2

Whenever t ≥ 2 the integral diverges because lim T1 + T2 = O(z

z→+∞

2.3.4

1/t

Z ), and hence 1

+∞

(T1 + T2 )t/(1−t) dz → +∞.

T -Exponential Family of Distributions

One of the most important members of the φ-exponential family distributions is the t-exponential family of distributions, which is defined using the expt function (2.10) in (2.12) p(z; θ) = expt (hΦ(z), θi − Gt (θ)) .

(2.18)

In fact, the t-exponential family was first proposed in 1980s by Tsallis [10, 21]4 . The corresponding escort distribution is given by q(z; θ) = R

p(z; θ)t . p(z; θ)t d z

(2.19)

As can be seen in Figure 2.4, expt , for t > 1, decays towards 0 more slowly than the exp function. Consequently, the t-exponential family of distributions becomes a family of heavy-tailed distribution as t > 1. Although the concept of the t-exponential family is relatively new, distributions that belong to this family have been widely used for years. For example, in linear regression problems, it is well-known that the Gaussian distribution is not robust if extreme outliers exist. 4

Note that Tsallis used the term q-exponential family. However, we prefer using t-exponential family to avoid confusion between the exponent q and the escort distribution q.

19 Instead, the Student’s t-distribution is a common substitute in noisy dataset, see e.g. [22]. Interestingly, the Student’s t-distribution is actually a member of the t-exponential family. Example 4 (Student’s-t distribution) Recall that a k-dimensional Student’s-t distribution St(z |µ, Σ, v) with 0 < v < 2 degrees of freedom has the following probability density function: St(z |µ, Σ, v) =

Γ ((v + k)/2) (πv)k/2 Γ(v/2)| Σ |1/2

−(v+k)/2 1 + (z −µ)> (v Σ)−1 (z −µ) . (2.20)

Here Γ(·) denotes the usual Gamma function. To see that the Student’s-t distribution is a member of the t-exponential family, first set −(v + k)/2 = 1/(1 − t) and !−2/(v+k) Γ ((v + k)/2) Ψ= (πv)k/2 Γ(v/2)| Σ |1/2 to rewrite (2.20) as St(z |µ, Σ, v) = Ψ + Ψ · (z −µ)> (v Σ)−1 (z −µ)

1/(1−t)

.

(2.21)

Next set Φ(z) = [z; z z> ], θ = [−2Ψ K µ/(1 − t); Ψ K /(1 − t)] with K defined as K = (v Σ)−1 . Then define



  Ψ hΦ(z), θi = z> K z −2µ> K z and 1−t    Ψ 1 Gt (θ) = − µ> K µ + 1 + 1−t 1−t to rewrite (2.21) as St(z |µ, Σ, v) = (1 + (1 − t) (hΦ(z), θi − Gt (θ)))1/(1−t) . Comparing with (2.10) clearly shows that St(z |µ, Σ, v) = expt (hΦ(z), θi − Gt (θ)) . Using (2.19) and some simple algebra yields the escort distribution of Student’s-t distribution: q(z; θ) = St(z |µ, v Σ /(v + 2), v + 2) Interestingly, the mean of the Student’s-t pdf is µ, and its variance is v Σ /(v − 2) while the mean and variance of the escort are µ and Σ respectively.

20 2.4

Chapter Summary In this chapter, we reviewed some background materials that is used later in this dis-

sertation. We illustrated the example by [4] and showed that convex losses are not robust tolerant against random noise. We reviewed logistic regression and discussed its relation to exponential family of distributions. Finally, we reviewed t-exponential family as a special case of the more general φ-exponential family of distributions.

21

3. T -LOGISTIC REGRESSION Logistic regression is not robust against random noise, essentially because its conditional distribution is modeled by an exponential family distribution. This chapter introduces a new algorithm, t-logistic regression. The motivation of t-logistic regression follows the same as using the Student’s t-distribution in linear regression [22]. In classification, we believe that the robustness of logistic regression can also be improved by using a heavy-tailed texponential family distribution. We show that t-logistic regression is Bayes-risk consistent, and more robust against outliers than convex losses, although it may yield multiple local minima due to non-convexity. Finally, we conduct extensive experiments including tens of large-scale datasets and show that t-logistic regression is robust against various types of label noises and in practice does not stuck in local minima.

3.1

Binary Classification In t-logistic regression, we model the conditional likelihood of a data point (x, y) by a

t-exponential family distribution, p(y| x; θ) = expt (hΦ(x, y), θi − Gt (x; θ)) y  = expt hΦ(x), θi − Gt (x; θ) , 2 where t > 1, and the normalizer Gt (x; θ) is the solution of     1 1 expt hΦ(x), θi − Gt (x; θ) + expt − hΦ(x), θi − Gt (x; θ) = 1. 2 2

(3.1)

(3.2)

By defining b a = hΦ(x), θi, Gt (b a) = Gt (x; θ), we can simplify (3.2) to, b a b a expt ( − Gt (b a)) + expt (− − Gt (b a)) = 1. 2 2 Note that Gt (b a) = Gt (−b a).

(3.3)

22 The key challenge in using the t-exponential family is that no closed form solution1 exists for computing Gt (b a) in (3.3). However, we provide an iterative method which computes Gt (b a) efficiently. The outline of the algorithm is described in Algorithm 1. Algorithm 1: Iterative algorithm for computing Gt for binary t-logistic regression. Input: b a≥0 Output: Gt (b a) a ˜←b a; while a ˜ not converged do Z(˜ a) ← 1 + expt (−˜ a); a ˜ ← Z(˜ a)1−tb a;

end

Gt (b a) ← − logt (1/Z(˜ a)) + ba2 ; The convergence of this iterative algorithm is verified in Appendix B.8. In practice, the algorithm takes less than 20 iterations to converge to an accuracy of 10−10 . We plot the t-logistic loss u l(x, y, θ) = − log expt ( − Gt (u)) 2 as the negative logarithm of (3.1) as a function of margin u = yb a in Figure 3.1. We find that the t-logistic loss is quasi-convex and bends down as the margin of a data point becomes too negative. The larger the t, the more is the bending down effect. As t = 1, the t-logistic regression reduces to logistic regression, and the loss function becomes convex. This is not surprising since at t = 1, the t-exponential family becomes the exponential family. 1

There are a few exceptions. For example, when t = 2, Gt (b a) =

q

1+

b a2 4 .

23 Mathematically, the bending of the loss is directly related to the gradient of t-logistic loss function. For a data point (x, y), the gradient with respect to θ is,  y hΦ(x), θi − Gt (x; θ) ∇θ l(x, y, θ) = − ∇θ log expt 2  y t−1 y hΦ(x), θi − Gt (x; θ) expt hΦ(x), θi − Gt (x; θ) = − ∇θ 2 2 (3.4) 1 (yΦ(x) − Eq [yΦ(x)]) p(y| x; θ)t−1 2 1 = − (y − yq(y| x; θ) + yq(−y| x; θ)) Φ(x)p(y| x; θ)t−1 2

=−

(3.5)

= − yq(−y| x; θ)Φ(x) p(y| x; θ)t−1 | {z }

(3.6)

ξ

where q(y| x; θ) =

p(y| x;θ)t , p(y| x;θ)t +p(−y| x;θ)t

(3.4) is from (2.9), and (3.5) is from (3.1) and The-

orem 2.3.2. In (3.6), the gradient of the loss function of (x, y) is associated with a forgetting variable ξ, which disappears as t = 1. As u = y hΦ(x), θi gets more negative, ξ decreases accordingly. Intuitively, the existence of forgetting variable improves the robustness of tlogistic regression by forgetting the influence of the outliers with low likelihood. We will discuss robustness in more detail in Section 3.2.2.

3.2

Properties In this section, we are going to discuss three key properties of t-logistic regression.

Firstly, we verify Bayes-risk consistency, which is an important statistical property of a binary loss function. Secondly, we formally show that t-logistic regression is robust against outliers compared to logistic regression. Thirdly, from a practical point of view, we investigate the local minima of non-convex loss functions. We show that the empirical risks of almost all non-convex losses including t-logistic regression may have multiple local minima. However, in practice, we will show in Section 3.5 that t-logistic regression is stable.

24

loss

t = 1 (logistic)

6

t = 1.3 4 t = 1.6 t = 1.9 2 0-1 loss -4

-2

0

2

4

margin

Fig. 3.1. T -logistic loss for binary classification with four different t: t = 1.3, t = 1.6, t = 1.9. Unlike the logistic loss (which is t = 1), t-logistic loss bends down as u  0, which caps the influence from outliers.

25 3.2.1

Bayes-Risk Consistency

Since all surrogate losses are a substitute for the 0 − 1 loss, it is natural to ask whether a surrogate loss is statistical consistent. To answer this question, a crucial criterion which is known as Bayes-risk consistency is used (see e.g. [23, 24]). Let us denote η(x) = p(y = 1| x) to be the underlying true conditional distribution of the label y given x, and let b a= hΦ(x), θi. The expected risk of a binary loss function l is, Cl (η, b a) = Eη [l(yb a)] = ηl(b a) + (1 − η)l(−b a). Since sign(b a) predicts the label of point based on its feature x, Bayes-risk consistency requests the optimal b a∗ of the expected risk Cl (η, b a) given η to have the same sign as the Bayes decision rule, sign[b a∗ ] = sign[2η − 1].

(3.7)

[23] further shows that all the three convex surrogate loss functions in Table 1.1 are Bayes-risk consistent. Now let us verify the Bayes-Risk consistency for t-logistic loss l(yb a) = − log expt (

yb a − Gt (b a)). 2

We have Cl (η, b a) = ηl(b a) + (1 − η)l(−b a)

b a b a = −η log expt ( − Gt (b a)) − (1 − η) log expt (− − Gt (b a)) 2 2 b a b a = −η log expt ( − Gt (b a)) − (1 − η) log(1 − expt ( − Gt (b a))). 2 2 | {z }

(3.8)

=expt (− ab2 −Gt (b a))

where (3.8) is because of (3.3). Let us define r = expt ( ba2 − Gt (b a)), and then (3.8) becomes, −η log(r) − (1 − η) log(1 − r). We can obtain the optimal r∗ by taking the derivative of r and set it to 0, −

η 1−η − = 0, r∗ 1 − r∗

26 ∗

a∗ )). Since which yields r∗ = η. Therefore, the optimal b a∗ satisfies η = expt ( ba2 − Gt (b ∗

a∗ )), we can take logt of the two and substract them, which yields, 1 − η = expt (− ba2 − Gt (b b a∗ = logt η − logt (1 − η).

(3.9)

It is clear that (3.9) satisfies (3.7), because logt is an increasing function and b a∗ > 0 if and only η > 12 . Therefore, t-logistic loss is Bayes-risk consistent.

3.2.2

Robustness

In this section, we theoretically investigate the robustness of t-logistic regression. There is no unique definition of robustness (see e.g. [25, 26]), and we will mainly focus on two of them. However, both definitions require the computation of the global optimum, which is infeasible for non-convex losses. Instead, we use the necessary conditions of the definitions and propose a function Il (u) for visualization. Finally, we use Il (u) to classify binary losses to three robust types, and show that t-logistic regression is fundamentally different from convex losses and many other non-convex losses in Table 2.1.

Definitions of Robustness Consider a dataset containing m data points x1 , . . . , xm with their labels y1 , . . . , ym , assume that θ ∗ is the global optimum of the regularized risk m

1 X λ l(xi , yi , θ) + k θ k22 . J(θ) = m i=1 2 For simplicity, let us assume that the loss function l(x, y, θ) is continuous and differentiable. From the optimality condition of a differentiable2 objective function, θ ∗ must satisfy, m

1 X ∇θ J(θ ) = ∇θ l(xi , yi , θ ∗ ) + λ θ ∗ = 0. m i=1 ∗

2

(3.10)

For nondifferentiable functions, one can replace the gradient by the subgradient and obtain a similar optimality condition.

27 Now assume that the dataset is augmented by a contaminated example (ˆ x, yˆ). Then the ˆ ∗ and it must satisfy, optimum on the contaminated dataset becomes θ ˆ ∗ ) = 0. ˆ ∗ ) + 1 ∇θ l(ˆ x, yˆ, θ ∇θ J(θ m

(3.11)

The robustness of a loss function is basically determined by the sensitivity of the optimum before and after the addition of a contaminated example, namely the difference ∗

ˆ . The two definitions of robustness that we consider are, between θ ∗ and θ Definition 3.2.1 (Inspired by the influence function in [25]) For any dataset (x1 , y1 ), . . . , (xm , ym ) and (ˆ x, yˆ), ˆ ∗ → θ∗ . lim θ

m→∞

Definition 3.2.2 (Outlier proneness in [26]) For any dataset (x1 , y1 ), . . . , (xm , ym ) and (ˆ x, yˆ), lim

kΦ(ˆ x)k2 →∞

ˆ ∗ → θ∗ . θ

where Φ(x) is a feature map from X to Rd . Roughly speaking, Definition 3.2.1 states that a robust model should not be affected too much by changing a small portion of data; and Definition 3.2.2 states that a robust model should ignore any extreme outliers. However, for non-convex losses, it is very hard to characterize the difference between ˆ ∗ because the regularized risk may have multiple local minima (see Section 3.2.3). θ ∗ and θ ˆ ∗ → θ ∗ is that On the other hand, from (3.11), it is clear that a necessary condition for θ 1 k∇θ l(ˆ x, yˆ, θ ∗ )k2 → 0. m

(3.12)

Therefore, instead of directly working on Definition 3.2.1 and 3.2.2, we will investigate the robustness by their necessary conditions, which are defined in Definition 3.2.3 and 3.2.4 respectively.

28 Definition 3.2.3 For any x, y and θ, k∇θ l(x, y, θ)k2 < ∞. Definition 3.2.4 For any x, y and θ, lim

kΦ(x)k2 →∞

k∇θ l(x, y, θ)k2 = 0.

Robust Types Since k∇θ l(x, y, θ)k2 involves both θ and (x, y), it would be more straightforward to have a function of margin u = y hΦ(x), θi for visualization. To this end, we do an innerproduct between ∇θ l(x, y, θ) and θ, and define a new function Il (u), h∇θ l(x, y, θ), θi = hl0 (u)yΦ(x), θi = l0 (u)u := Il (u), where l(u) := l(x, y; θ) and l0 (u) ≤ 0 for all losses that we are interested in this dissertation. Furthermore, the following two lemmas show that |Il (u)| and k∇θ l(x, y, θ)k2 can almost equivalently define robustness. The proofs of the lemmas are provided in Appendix B.3 and B.4. Lemma 3.2.1 If |Il (u)| < ∞, then for any θ, x and y, the probability p(k∇θ l(x, y, θ)k2 < ∞) = 1. Furthermore, k∇θ l(x, y, θ)k2 → ∞ if and only if ψ the angle between Φ(x) and θ is equal to π/2 and kΦ(x)k2 → ∞. Lemma 3.2.2 If limu→∞ |Il (u)| = 0, then for any θ, x and y, the probability p(

lim

kΦ(x)k2 →∞

k∇θ l(x, y, θ)k2 = 0) = 1.

Furthermore, limkΦ(x)k→∞ k∇θ l(x, y, θ)k2 6= 0 if and only if ψ the angle between Φ(x) and θ is equal to π/2.

29

logistic(type-0)

Il (u)

4

2 t-logistic(type-I)

Savage(type-II)

0

u

-2

Fig. 3.2. An illustration of the three robust types. All the three types of losses behave similarly as u > 0. When u → −∞, Type-0 loss goes to +∞; Type-I loss goes to a constant; and Type-II loss goes to 0.

Since all the losses that are commonly used are continuously defined on u ∈ R ∪{±∞},

we have |l0 (u)| < ∞ for |u| < ∞. Therefore, |Il (u)| may only be unbounded as u → ∞. Based on limu→∞ |Il (u)|, we classify binary losses into three robust types: • Robust Loss 0: limu→∞ |Il (u)| → ∞. • Robust Loss I: 0 < limu→∞ |Il (u)| < ∞. • Robust Loss II: limu→∞ |Il (u)| = 0. An illustration of the three types of binary losses is provided in Figure 3.2. In Table 3.1, we classify some binary losses based on their robustness. It is easy to verify that all convex losses belongs to Robust Loss 0. Some other verifications are provided in Appendix B.9. In particular, it differentiates the t-logistic regression (Type-I) from Type-0 losses, e.g. logistic regression as well as the Type-II non-convex losses, e.g. Savage loss. In later experiments, we will empirically compare these different types of losses.

30

Table 3.1 The robustness of some loss functions for binary classification based on Il (u). The verifications are provided in Appendix B.9. Name

Loss Function

Robust Type

Hinge

l(x, y, θ) = max(0, 1 − u)

0

Exponential

l(x, y, θ) = exp(−u)

0

Logistic

l(x, y, θ) = log(1 + exp(−u)

0

T -logistic

l(x, y, θ) = expt ( u2 − Gt (u))

I

Probit

l(x, y, θ) = 1 − erf (u)

II

Ramp

l(x, y, θ) = min(2, max(0, 1 − u))

II

Savage Sigmoid

3.2.3

l(x, y, θ) =

4 (1+exp(u))2

II

2 1+exp(u)

II

l(x, y, θ) =

Multiple Local Minima

One of the key disadvantages of the non-convex losses is that its empirical risk may have multiple local minima. To illustrate this, we used a two-dimensional toy dataset which contains 50 points drawn uniformly from [−2, 2] × [−2, 2]. In comparison, we plot the empirical risk of t-logistic loss as well as Savage loss in Figure 3.3. As can be seen, Savage loss yields a highly non-convex objective function with a large number of local optima. In contrast, even though we are averaging over non-convex loss functions, the resulting function of t-logistic regression has a single global optimum. This behavior persists when we use different random samples, change the sampling scheme, or vary the number of data points. Moving over to higher dimensional datasets such as Adult, USPS, and Web83 , we initialize the algorithm with different randomly chosen starting points and check the solution obtained. The algorithm always arrives at the same solution (within numerical precision) [27]. 3

All available from http://www.csie.ntu.edu.tw/˜cjlin/libsvmtools/datasets/.

31

100

50 −10

10 −5

0 0

5

10−10

60 50 40 −10

10 −5

0 0

5

10−10

Fig. 3.3. The empirical risk of t-logistic regression (upper) and Savage loss (lower) on a toy two dimensional dataset. T -logistic regression appears to be easier to optimize than Savage loss.

32 This interesting behavior once led us to conjecture that t-logistic regression has only one local minimum which is the global minimum. However, in the next theorem, we show that the conjecture is wrong. For any non-convex loss function, under some mild conditions, one can always construct a dataset whose empirical risk has multiple local minima. To the best of our knowledge, all the existing non-convex losses satisfy the conditions. The following theorem considers the case when the feature dimension is 1-dimensional and generalization to multi-dimensional setting is straightforward. The proof of the theorem is in the Appendix B.5. Theorem 3.2.3 Consider a loss function l(u) := l(x, y, θ) that is smooth at u := yθx = 0. If l0 (0) < 0, and there exist u1 < 0, u2 > 0, and  > 0 where l0 (ui ) ≥ l0 (0) +  for i = 1, 2, then there exists a dataset whose empirical risk Remp (θ) has at least two local minima. An interesting observation is that those local minima are related to the robustness of non-convex losses. To see that, consider the following 1-dimensional example, which includes 30 clean data points where (xi , yi ) = (1, 1), and one outlier where (x, y) = (−200, 1). We plot the empirical risk as a function of θ for both the logistic regression and t-logistic regression in Figure 3.4. Without the outlier being added, both logistic loss (red dashed) and t-logistic loss (purple solid) yield the same optimum. However, once the outlier is added, the optimum of the logistic regression is severely impacted by the outlier (blue dashed). On the other hand, t-logistic regression, although it creates another local minimum, retains the same global optimum as the θ∗ without the outlier (green solid).

3.3

Multiclass Classification In this section, we extend t-logistic regression to multiclass classification, where the

dataset consists of data points {x1 , . . . , xm } and the corresponding label {y1 , . . . , ym } with yi taking values from {1, . . . , C}. Let us first briefly review the multiclass logistic regression. The generalization to the t-logistic regression is straightforward.

33

50 logisticc t-logisticc logistico t-logistico

Remp (θ)

40 30 20 10 0 −0.5

0

0.5

1

1.5

2

θ

Fig. 3.4. Empirical risk of logistic regression and t-logistic regression on the one dimensional example. The optimal solutions before and after adding the outlier are significantly different for logistic regression. In contrast, the global optimum of t-logistic regression stays the same.

34 In the multiclass logistic regression, the conditional likelihood of a label y given x is p(y| x; θ) = exp (hΦ(x, y), θi − G(x; θ)) where, 



Φ(x, y) = 0, . . . , 0, Φ(x), 0, . . . , 0 , θ = (θ 1 , . . . , θ C ) . | {z } | {z } 1,...,y−1

y+1,...,C

with Φ(x) : X → Rd . Here 0 denotes the d-dimensional all-zero vector, and θ a d × Cdimensional vector. Therefore, p(y| x; θ) = exp (hΦ(x), θ y i − G(x; θ)) ,

(3.13)

where the log-partition function is G(x; θ) = log

C X c=1

! exp (hΦ(x), θ c i) .

(3.14)

The multiclass logistic loss is the negative log-likelihood of a point (x, y), which equals ! C X l(x, y, θ) = − log p(y| x; θ) = log exp (hΦ(x), θ c − θ y i) . c=1

The main idea of the multiclass t-logistic regression is the same as in binary t-logistic regression. The conditional likelihood of the data point (x, y) is modeled by a conditional t-exponential family of distributions (t > 1): p(y| x, θ) = expt (hΦ(x, y), θi − Gt (x; θ))

(3.15)

= expt (hΦ(x), θ y i − Gt (x; θ)) where the log-partition function Gt satisfies C X c=1

expt (hΦ(x), θ c i − Gt (x; θ)) = 1.

Let b ac = hΦ(x), θ c i, Gt (b a) = Gt (x; θ), then we can simplify (3.16) as C X c=1

expt (b ac − Gt (b a)) = 1.

(3.16)

35

Algorithm 2: Iterative algorithm for computing Gt in multiclass t-logistic regression. Input: b a Output: Gt (b a) b a∗ ← max(b a); ˜←b a a−b a∗ ;

˜ not converged do while a P ac ); Z(˜ a) ← C c=1 expt (˜ ˜ ← Z(˜ a a)1−t (b a−b a∗ );

end

Gt (b a) ← − logt (1/Z(˜ a)) + b a∗ ;

36

Table 3.2 Average time (in milliseconds) spent by our iterative scheme and fsolve in ˆ t (b computing G a) for multiclass t-logistic regression. C

10

20

30

40

50

60

70

80

90

100

fsolve

8.1 8.3 8.1

8.7 9.6 9.8 10.0

10.2 10.3

10.7

iterative

0.3 0.3 0.3

0.4 0.4 0.4

0.3

0.5

0.3

0.4

An iterative algorithm which generalizes the one used in binary classification is applied to compute Gt (b a). The algorithm is described in Algorithm 2. In practice, Algorithm 2 scales well with C, the number of classes, thus making it efficient enough for problems involving a large number of classes. To illustrate, we let C ∈ {10, 20, . . . , 100} and we randomly generate b a ∈ [−10, 10]C , and compute the correˆ t (b ˆ t (b sponding G a). We compare the time spent in estimating G a) by the iterative scheme

and by calling Matlab fsolve function averaged over 100 random generations using Matlab 7.1 in a 2.93 GHz Dual-Core CPU. We present the results in Table 3.2. For a data point (x, y), the partial derivative of multiclass t-logistic loss function over θ n , where n ∈ {1, . . . , C}, is −

∂ ∂ log p(y| x; θ) = − log expt (hΦ(x), θ y i − Gt (x; θ)) ∂ θn ∂ θn = − (δ(y = n)Φ(x) − Eq [Φ(x, y)])p(y| x; θ)t−1 = − Φ(x) · (δ(y = n) −

C X

δ(c = n)q(c| x; θ))p(y| x; θ)t−1

c=1

= − Φ(x) · (δ(y = n) − q(n| x; θ)) p(y| x; θ)t−1 | {z }

(3.17)

ξ

where q(n| x; θ) =

p(n| x;θ)t PC . t c=1 p(c| x;θ)

In (3.17), the gradient of (x, y) contains a forgetting

variable ξ = p(y| x; θ)t−1 . Just like the binary classification, when t > 1, the influence of the points with low likelihood p(y| x; θ) will be capped by the ξ variable.

37 The definition of Bayes-risk consistency of multiclass classification losses was first discussed in [28]. As one can easily verify, multiclass t-logistic regression is also Bayesrisk consistent (see Appendix B.10 for verification).

3.4

Optimization Methods In this section, let us consider some of the practical issues including how to optimize

the objective function of t-logistic regression. The most straightforward way is to use a gradient-based method such as L-BFGS (please refer to Section A.2 for more details). In particular, the gradient of t-logistic regression is given in (3.6) for binary classification and (3.17) for multiclass classification. Although in our experiment we find that the algorithm converges every time using the L-BFGS solver, it is important to note that there is no convergence guarantee for using L-BFGS solver on non-convex objective functions. In the remainder of the section, we provide a different approach which is guaranteed to converge. For clarity, we discuss how to optimize its empirical risk. The regularized risk can be optimized in a similar way, which was applied in [27].

3.4.1

Convex Multiplicative Programming

For t > 1, instead of directly minimizing Remp (θ) = − log p(y | X, θ), one can equivalently minimize the objective function p(y | X, θ)1−t , 1−t

P(θ) , p(y | X; θ) =

m Y i=1

=

m Y i=1

p(yi | xi ; θ)1−t

(1 + (1 − t)(hΦ(xi , yi ), θi − Gt (xi ; θ))) | {z }

(3.18) (3.19)

li (θ)

Since t > 1, and Gt (xi ; θ) is convex, it is easy to see that each component li (θ) is positive and convex. Therefore, P(θ) becomes the product of positive convex functions li (θ). Minimizing such a function P(θ) is also called convex multiplicative programming [29].

38 The optimal solutions to the problem (3.19) can be obtained by solving the following parametric problem (see Theorem 2.1 of [29]): min min MP(θ, ζ) , ζ

θ

m X

ζi li (θ) s.t. ζ > 0,

i=1

m Y i=1

ζi ≥ 1.

(3.20)

Exact algorithms have been proposed for solving (3.20) (for instance, [29]). However, the computational cost of these algorithms grows exponentially with respect to m, which makes them impractical for our purposes. Instead, we apply the following block coordinate descent based method, namely the ζ-θ algorithm. The main idea of the algorithm is to minimize (3.20) with respect to θ and ζ separately. ζ-Step: Assume that θ is fixed, and denote ˜li = li (θ) to rewrite (3.20) as: min MP(θ, ζ) = min ζ

ζ

m X

ζi ˜li s.t. ζ > 0,

i=1

m Y i=1

ζi ≥ 1.

(3.21)

Since the objective function is linear in ζ and the feasible region is a convex set, (3.21) is a convex optimization problem. By introducing a non-negative Lagrange multiplier γ ≥ 0, the Lagrangian and its partial derivative with respect to ζi0 can be written as ! m m X Y L(ζ, γ) = ζi ˜li + γ · 1 − ζi i=1

(3.22)

i=1

Y ∂ ζi . L(ζ, γ) = ˜li0 − γ ∂ζi0 0 i6=i

(3.23)

˜ l Setting the gradient to 0 obtains γ = Q i0 0 ζi . Since ˜li0 > 0, it follows that γ cannot be 0. i6=i Q ˜ By the K.K.T. conditions [3], m i=1 ζi = 1. This in turn implies that γ = li0 ζi0 or

(ζ1 , . . . , ζm ) = (γ/˜l1 , . . . , γ/˜lm ), with γ =

m Y

1

˜l m . i

(3.24)

i=1

There is an obvious connection between ζi and the forgetting variable ξi , because ζi ∝ 1/˜li = p(yi | xi , θ)t−1 = ξi .

θ-Step: In this step we fix ζ and solve for the optimal θ. This step is essentially the same as logistic regression, except that each component has a weight ζi here. min MP(θ, ζ) = min θ

θ

m X i=1

ζi li (θ)

(3.25)

39 and the gradient is ∇θ MP(θ, ζ) = (1 − t)

m X i=1

ζi (Φ(xi , yi ) − Eq [Φ(xi , yi )]).

(3.26)

The gradient in (3.26) is very similar to the gradient in (3.6), (3.17). The main difference is that the ζ-θ algorithm computes ζ and θ in two steps, while the gradient based method computes ξ and θ in one step. However, the advantage of the ζ-θ algorithm is its convergence guarantee as shown in the following theorem. The proof is provided in the Appendix B.6. Theorem 3.4.1 The ζ-θ algorithm converges to a stationary point of the convex multiplicative programming problem.

3.5

Empirical Evaluation We used 26 publicly available binary classification datasets and 9 multiclass classifica-

tion datasets and focused our study on two aspects: the generalization performance under various noise models and the stability of the solution under random initialization. As our comparator we use logistic regression, and Savage loss4 . Our main observation from these extensive empirical experiments is that t-logistic regression is more robust than logistic regression, when the dataset is mixed with label noise. On the other hand, compared to Savage loss which often gets stuck in different local minima under random initializations, t-logistic regression appears to be much more stable. These two observations make t-logistic regression an attractive algorithm for classification. Datasets Table 3.3 summarizes the binary classification datasets used in our experiments. adult9, astro-ph, news20, real-sim, reuters-c11, reuters-ccat are from the same source as in [30]. aut-avn is from Andrew McCallum’s home page5 , 4

The multiclass Savage loss is defined as, l(x, y; θ) =

C X c=1

5

exp(hΦ(x), θ c i)

δ(y = c) − PC

c=1

exp(hΦ(x), θ c i)

http://www.cs.umass.edu/˜mccallum/data/sraa.tar.gz.

!2 .

40 covertype is from the UCI repository [31], worm is from [32], kdd99 is from KDD Cup 19996 , while web8, webspam-u, webspam-t7 , as well as the kdda and kddb8 are from the LibSVM binary data collection9 . The alpha, beta, delta, dna, epsilon, gamma, ocr and zeta datasets were obtained from the Pascal Large Scale Learning Workshop website [33]. measewyner is a synthetic dataset proposed in [34]. The input x is a 20-dimensional vector where each coordinate is uniformly distributed on [0, 1]. P The label y is +1 if 5j=1 xj ≥ 2.5 and −1 otherwise. Table 3.4 summarizes the multiclass classification datasets. In dna and ocr binary classification datasets, we used the same training and testing partition as in [35] (80% for training and 20% for testing). For all other datasets, we used 70% of the labeled data for training and the remaining 30% for testing. In all cases, we added a constant feature as bias. Optimization algorithms We choose to optimize the empirical risk with L2 regularizer using the L-BFGS. We implemented all the loss functions using PETSc and TAO, which allows efficient use of large-scale parallel linear algebra. We used the Limited Memory Variable Metric (lmvm) variant of L-BFGS which is implemented in TAO. The convergence criterion of the optimization algorithms is when the decrease in the objective function value and the norm of the gradient are less than 10−10 or the maximum number of 1000 function evaluations is reached. Implementation and Hardware

All experiments are conduced on the Rossmann com-

puting cluster at Purdue University, where each node has two 2.1GHz 12-core AMD 6172 processors with 48 Gb physical memory. We ran our algorithms with 4 cores in one single 6

http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html. webspam-u is the webspam-unigram and webspam-t is the webspam-trigram dataset. Original dataset can be found at http://www.cc.gatech.edu/projects/doi/WebbSpamCorpus.html. 8 These datasets were derived from KDD CUP 2010. kdda is the first problem algebra 2008 2009 and kddb is the second problem bridge to algebra 2008 2009. 9 http://www.csie.ntu.edu.tw/˜cjlin/libsvmtools/datasets/binary.html. 7

41

Table 3.3 Summary of the binary classification datasets used in our experiments. n is the total # of examples, d is the # of features, and n+ : n− is the ratio of the number of positive vs negative examples. M denotes a million. ∗ deontes the datasets that are unused in Chapter 6 due to high computational costs.

dataset

n

dataset

n

adult9

48,842

123 0.32

alpha

500,000

500 1.00

astro-ph

94,856

99,757 0.31

aut-avn

71,066

20,707 1.84

beta

500,000

500 1.00

covertype

581,012

54 0.57

delta

500,000

500 1.00

dna∗

50.00 M

epsilon∗

800 3e − 3

500,000

2000 1.00

gamma

500,000

500 1.00

kdd99

5.21 M

127 4.04

kdda∗

kddb∗

d

n+ : n−

20.01 M 29.89 M 6.18 2000

measewyner

20 1.00

news20

19,954 7.26 M 1.00

real-sim

72,201 2.97 M 0.44

reuters-ccat 804,414 1.76 M 0.90 webspam-t∗ worm

350,000 16.61 M 1.54 1.03 M

804 0.06

d

n+ : n−

8.92 M 20.22 M 5.80

longservedio

2000

21 1.00

mushrooms

8124

112 1.07

3.50 M

1156 0.96

ocr∗

reuters-c11 804,414 1.76 M 0.03 web8

59,245

300 0.03

webspam-u

350,000

254 1.54

zeta

500,000 800.4 M 1.00

node for all datasets, except dna and ocr datasets for binary classification, where we used 16 cores across 16 nodes with 30 Gb memory in each node.

3.5.1

Noise Models

One of the main objectives of our experiment is to test the robustness of the classification algorithms under different label noise models. Therefore, we implement the following three kinds of noise models. For binary classification, the three types of noise models are generated in the following ways using a flipping constant ρ ∈ [0, 1]:

42

Table 3.4 Summary of the multiclass classification datasets used in our experiments. n is the total # of examples, d is the # of features, nc is the # of classes. ∗ deontes the datasets that are unused in Chapter 6 due to high computational costs.

n

dataset

d

nc

dataset

n

d

nc

dna

2,586

182 3

letter

15,500

mnist

70,000

782 10

protein

21,516 359 3

rcv1∗

534,130 47,238 52 sensitacoustic 98,528

sensitcombined 98,528

102 3

9298

258 10

usps

sensitseismic 98,528

18 26

52 3 52 3

Uniform Noise (Noise-1) In noise-1 model, we uniformly flip the label of a training example with probability ρ (See Algorithm 3). Algorithm 3: Algorithm for generating noise-1 model. Input: Dataset (X, Y ) := {(xi , yi )}, where i = 1, . . . , m.

Output: Dataset (X, Yˆ ) := {(xi , yˆi )}, where i = 1, . . . , m. for i = 1, . . . , m do rand = U ni[0, 1]; if rand < ρ then yˆi = −yi ; end

end

Unbalanced Noise (Noise-2)

Noise-2 model generates unbalanced label noise. In other

words, we only flip the labels of negative-label training examples with probability ρ (See Algorithm 4).

43

Algorithm 4: Algorithm for generating noise-2 model. Input: Dataset (X, Y ) := {(xi , yi )}, where i = 1, . . . , m.

Output: Dataset (X, Yˆ ) := {(xi , yˆi )}, where i = 1, . . . , m. for i = 1, . . . , m do rand = U ni[0, 1]; if yi < 0 and rand < ρ then yˆi = 1; end

end

44 Unbalanced Large-Margin Noise (Noise-3)

Noise-3 model is intended to generate

large-margin outliers. In addition, only examples with the negative labels will be flipped. In order to estimate the margin of each example, we first run logistic regression on the clean dataset. We then flip the labels of examples by using the probability which favors the large margin examples (See Algorithm 5). Algorithm 5: Algorithm for generating noise-3 model. Input: Dataset (X, Y ) := {(xi , yi )}, where i = 1, . . . , m.

Output: Dataset (X, Yˆ ) := {(xi , yˆi )}, where i = 1, . . . , m. Train θ by running logistic regression on (X, Y ) for 30 iterations; for i = 1, . . . , m do Compute ui = yi hΦ(xi ), θi;

end Compute umax = maxi {ui }; for i = 1, . . . , m do Compute u˜i = ui − umax ; end Compute u˜min = min{˜ ui }; for i = 1, . . . , m do ui Compute bi = exp(− u10·˜ ); ˜min end Compute Z =

P

i

ui exp(− u10·˜ ); ˜min

for i = 1, . . . , m do rand = U ni[0, 1]; if yi < 0 and rand · Z < m · bi · ρ then yˆi = 1; end end

For multiclass classification, in noise-1 model, we assign yi to a uniformly random new label with probability ρ. In noise-2 and noise-3 model, we only change the labels of the

45 examples with yi ≤ C/2, where C is the total number of classes. The new assigned label yi will be in (C/2, C] based on uniform distribution.

3.5.2

Experiment Design

Since most of our datasets contain a large amount of features, we used the identity feature map Φ(x) = x throughout the experiment. For t-logistic regression, we set t = 1.5.

Generalization Performance Our first experiment is to compare the test error among the three algorithms under three different noise models with ρ = {0.00, 0.05, 0.10}. We split the training set into 5 partitions for 5-fold cross validation. The candidates of regularization constant λ are {10−2 , 10−4 , . . . , 10−10 }, and the one which in average performs best in the validation sets is chosen. The model parameters in this experiment are initialized to be all zero.

Random Initialization The second experiment is intended to compare the stability of non-convex losses. In particular, we want to test whether the non-convex losses get stuck in different local minima when we initialize the model parameters differently. we use the regularization constant chosen with 5-fold cross validation in the previous experiment and pick one of the five folds for training. In order to obtain random initialization of the model parameters, each variable of the model parameter is initialized uniformly randomly from [−10, 10]. The mean and the standard deviation of the test error is reported from nine random initializations and one all-zero initialization. For dna and ocr dataset, due to the large computational cost, we only report the generalization performance with all-zero initialization. We do not split the training set for cross validation, but train the algorithm on the entire training set with λ = 10−10 .

46 3.5.3

Results

From Figure C.1 to Figure C.35, we plot the performance of the three algorithms under three noise models from left to right. Each figure is the performance on one dataset. For each noise model, we report the test error of the algorithms with ρ = 0.00 (blue), 0.05 (red), 0.10 (yellow). On the first row of each figure, we report the test error of the three algorithms using 5-fold cross validation with the optimal λ on that dataset. For large values of λ (e.g. λ = {10−2 , 10−4 }), it appears that the test performance of the algorithms are mostly inferior. This is because most of the datasets we use in our experiment contain a large number of examples, and therefore requires very mild regularization. On the other hand, the dataset with higher noise tends to require larger regularization. For instance, for the binary classification, if ρ = 0.00, the distribution for the optimal λ equal to [10−2 , 10−4 , 10−6 , 10−8 , 10−10 ] is [7, 18, 90, 43, 58], while the distribution becomes [16, 36, 82, 36, 46] if ρ = 0.10. To quickly overview the improvement of robustness for t-logistic regression, in Table 3.5 and Table 3.6, we summarize the number of datasets where the test error difference between logistic regression and t-logistic regression are significant in three noise models with ρ = {0.00, 0.05, 0.10}. Across a variety of binary classification datasets in Table 3.5, it appears that t-logistic regression performs better when label noise is added. In particular, when ρ = 0.05, t-logistic regression has significant advantage in 48 cases, while logistic regression only has 5 cases. When ρ = 0.10, the advantage of t-logistic reduces, but still it is better in 42 cases versus 12 for logistic regression. In multiclass classification as shown in Table 3.6, the advantage of t-logistic regression is even more salient. Savage loss appears to be even more robust than t-logistic loss in a few datasets. However, it is unstable under random initialization of the model parameter. On the second row of each figure, we report the test errors when the model parameter is randomly initialized. We can see that the performance of Savage loss fluctuates in more than half of the datasets. In contrast, t-logistic loss converges to similar results in all except the longservedio dataset. Therefore, empirically t-logistic regression appears to be very stable. Logistic re-

47

Table 3.5 The number of binary classification datasets that logistic regression or tlogistic regression is significantly better than the other. The left part of the table is the number of datasets where logistic regression is significantly better; the right part is the number of datasets where t-logistic regression is significantly better. The total number of datasets is 24 (dna and ocr datasets are excluded).

Logistic 0.00 0.05 0.10 t = 1.5 0.00 0.05 0.10 Noise-1

4

1

6

Noise-1

5

15

11

Noise-2

4

2

4

Noise-2

5

16

17

Noise-3

4

2

2

Noise-3

5

17

14

Table 3.6 The number of multiclass classification datasets that logistic regression or tlogistic regression is significantly better than the other. The left part of the table is the number of datasets where logistic regression is significantly better; the right part is the number of datasets where t-logistic regression is significantly better. The total number of datasets is 9. Logistic 0.00 0.05 0.10 t = 1.5 0.00 0.05 0.10 Noise-1

1

1

1

Noise-1

5

6

8

Noise-2

1

1

1

Noise-2

5

7

6

Noise-3

1

0

0

Noise-3

5

6

7

gression always converge to the similar result regardless of the initialization because of its convexity. In order to highlight the main difference between t-logistic regression and logistic regression, on the third row, we plot the distribution of the forgetting variable ξ of t-logistic regression on one of the five folds as ln = 0.10. To distinguish the points with noisy labels we plot them in red while the other points in blue. Recall that ξ denotes the influence of a

48 point. In most of cases, we can observe that the ξ of the noisy data is smaller than that of the clean data, which indicates that the algorithm is able to effectively identify these noisy points and cap their influence.

Detailed Discussion on Selected Datasets On the alpha dataset (Figure C.2), the performance of the three algorithms are close to each other in the noise-1 model. However, in the noise-3 model, the test error of logistic regression rises from 21.9% of the clean dataset to about 1.1% in the noisy dataset (ρ = 0.10). On the other hand, although t-logistic regression also suffers from unbalanced label noise, the test error rise is only about 0.6%. Similar phenomena are observed on the astro-ph (Figure C.3), delta (Figure C.7), epsilon (Figure C.8) and gamma (Figure C.9) datasets. To understand why t-logistic regression works better, it is helpful to read its ξ-distribution on the bottom row. In the noise-3 model of these datasets, the mean of the distribution of the forgetting variable ξ of the noisy points appears to be much smaller than that of the clean points. Therefore, the influence of those noisy examples is capped. On the other hand, Savage loss also works well in those four datasets when the model parameter is initialized to be all-zero, but its performance sometimes fluctuates with different random initializations, e.g. on the delta and gamma datasets. On the covertype dataset (Figure C.6), t-logistic regression has better test performance with or without label noise in noise-1 and noise-2 model. The reason that t-logistic regression works better on clean dataset may be because the original dataset is mixed with outliers. The generalization performance of Savage loss is comparable to t-logistic regression, but it is unstable against random initialization. In the noise-3 model, the performances of all three algorithms becomes worse. For t-logistic regression, the ξ-distribution indicates that the influence of half of the noisy examples are not successfully capped. On the three KDD datasets, kdd99 (Figure C.10), kdda (Figure C.11), kddb (Figure C.12), the number of positive labels is a few times larger than that of the negative labels. Therefore, it appears that all the algorithms perform better in noise-2 and noise-3 model,

49 since the latter contains much fewer noisy examples. t-logistic regression outperforms logistic regression on the kdd99 and kdda dataset. However, somewhat surprisingly, it performs worse on the kddb dataset although its ξ-distribution looks similar to that of the kdda dataset. On the longservedio dataset (Figure C.13), all the three algorithms are able to perfectly classify the examples when ρ = 0.00. In the noise-1 and noise-2 model, t-logistic regression is apparently more robust against the label noise than logistic regression. In particular, on the ξ-distribution we observe the 4 distinct spikes. From left to right, the first spike corresponds to the noisy large margin examples, the second spike represents the noisy pullers, the third spike denotes the clean pullers, while the rightmost spike corresponds to the clean large margin examples. On the other hand, logistic regression is unable to discriminate between clean and noisy training samples which leads to its bad performance. In the noise-3 model, more large-margin examples are flipped. Although the overall quantity of noisy examples may be smaller, the impact from these noisy examples is actually strengthened. Not only logistic regression does not perform well in such case, even t-logistic regression performs much worse. The ξ-distribution of noise-3 model is apparently different from that of noise-1 and noise-2 model, as there is no spike representing the noisy large margin examples. Furthermore, the flipped large-margin examples clearly create multiple local minima in the empirical risk of t-logistic regression, as the performance of t-logistic regression fluctuates with random initialization. The measewyner dataset (Figure C.14) is another dataset where t-logistic regression demonstrate a clear edge over logistic regression. Here t-logistic regression outperforms logistic regression in all the three noise models. One can clearly see from the ξ-distribution that all the red bars lies to the left of the blue bars. The distance between the red bars and the blue bars are even larger in noise-2 and noise-3 model. Similar phenomena are observed on the ocr (Figure C.26), reuters-ccat (Figure C.19), webspamunigram (Figure C.22), worm (Figure C.23), zeta (Figure C.24) datasets. The multiclass datasets seem to give a clearer edge to t-logistic regression. On the letter (Figure C.28), sensitacoustic (Figure C.32), sensitcombined (Figure

50 C.33), sensitseismic (Figure C.34) datasets, the test performances of t-logistic regression are all significantly better than logistic regression even without adding label noise. It is therefore reasonable to conjecture that mislabeling is more likely to occur in the multiclass datasets. On the dna (Figure C.27), and protein (Figure C.30) dataset, t-logistic regression is comparable or slightly better than logistic regression. On the mnist (Figure C.29) and usps (Figure C.35) dataset, t-logistic regression performs much better when label noises are added. As is to be expected in such extensive empirical evaluation, there are a few other anomalies. On the aut-avn (Figure C.4), dna (Figure C.25), real-sim (Figure C.17) and webspamtrigram (Figure C.21) dataset, logistic regression has the best test accuracy in some of the noise models, although the ξ-distribution indicates that t-logistic regression caps the influence of the noisy examples. On the news20 dataset (Figure C.16), the ξ variables of t-logistic regression is almost identical for all examples, which makes its performance close to or not as good as logistic regression. On the beta dataset (Figure C.5), the test error of all the algorithms are around 50%.

CPU Time Comparison One of the drawbacks of t-exponential family is that there is no closed form solution for the log-partition function. The main additional cost of t-logistic regression is the iterative numerical method to compute Gt (xi ; θ) for each example (xi , yi ). This may impair the efficiency of the algorithm. In order to compare the time efficiency among the algorithms, we provide the time experiment result of noise-3 model with ρ = 0.1, containing the total CPU time spent as well as the averaged CPU time spent for every function evaluation in Table E.1 and Table E.2. It is not surprising that t-logistic regression takes longer time to train than the logistic regression and the Savage loss in most of the datasets. As the number of the samples is significantly larger than the dimensions e.g. the covertype and kdd99 datasets, the

51 computing cost of Gt (xi ; θ) becomes the primary bottleneck of t-logistic regression and the time efficiency reduces more.

3.6

Chapter Summary In this chapter, we generalize the logistic regression to the t-logistic regression by using

t-exponential family of distributions. The new algorithm has a probabilistic interpretation and is more robust to label noise than logistic regression. We investigate the algorithm in binary classification and multiclass classification. Although the loss function is nonconvex, we show that the t-logistic regression is Bayes-risk consistent and empirically stable against random initialization.

52

4. T -DIVERGENCE BASED APPROXIMATE INFERENCE This chapter is devoted to using t-exponential family of distributions in complicated models with large number of random variables. From a computational perspective, the central issue here is the efficient computation of the log-partition function Gt (θ). During the last decade, variational inference has become an important technique for dealing with large, intractable exponential family of distributions, especially probabilistic graphical models. We will first review the main idea of variational inference and introduce two well-known inference algorithms for the exponential family. Then we will extend them to t-exponential family by defining a new t-entropy and a new t-divergence. Finally, two approximate inference algorithms for the exponential family will be generalized to t-exponential family based on t-divergence.

4.1

Variational Inference in Exponential Family of Distributions One of the prominent applications of exponential family of distributions is their use

in modeling conditional independence between random variables via a graphical model. However, when the number of random variables is large, and the underlying graph structure is complex, a number of computational issues need to be tackled in order to make inference feasible. The key challenge here is the computation of the log-partition function. A number of inference techniques have been developed to solve this problem approximately. Two prominent approximate inference techniques include the Monte Carlo Markov Chain (MCMC) methods [36], and variational methods [7, 37]. Variational methods are gaining significant research traction, mostly because of their high efficiency and practical success in many applications. Essentially, these methods are premised on the search for a proxy in an analytically solvable distribution family that approximates the true underlying distribution. To measure the closeness between the true and

53 approximate distribution, a proper divergence measure between the two distributions has to be defined. Among all types of divergence measures, the Kullback-Leibler (K-L) divergence has been mostly widely studied. In particular, the K-L divergence between two distributions p1 (z) := p(z; θ 1 ) and p2 (z) := p(z; θ 2 ) is defined as, Z D(p1 kp2 ) = p1 (z) log p1 (z) − p1 (z) log p2 (z)d z,

(4.1)

which is the Bregman divergence1 associated with the Shannon-Boltzmann-Gibbs (SBG) entropy, H(p(z)) := −

Z

p(z) log p(z) d z = − Ep(z) [log p(z)] .

(4.2)

The reason that the K-L divergence has been so popular is mainly because the SBG entropy has close connection to the exponential family of distributions and its log-partition function. As is demonstrated in the following theorem, the negative SBG entropy is the Fenchel conjugate of log-partition function of the exponential family of distributions. Theorem 4.1.1 (Theorem 2 [7]) For a k-dimensional exponential family distribution p(z; θ) = exp(hΦ(z), θi − G(θ)), and a k-dimensional vector µ, θ(µ) (if it exists) is the parameter of p(z; θ) such that Z µ = Ep(z;θ(µ)) [Φ(z)] = Φ(z)p(z; θ(µ)) d z . (4.3) Furthermore, G∗ (µ) =

  −H(p(z; θ(µ))) if θ(µ) exists

(4.4)

 +∞ otherwise . By duality it also follows that G(θ) = sup {hµ, θi − G∗ (µ)} . µ

1

The Bregman divergence associated with F for points p, q ∈ Ω is DF (p, q) = F (p) − F (q) − h∇F (q), p − qi .

See Appendix A.1 for more details.

(4.5)

54 Variational methods try to find an approximate distribution p ˜ from an analytically tractable exponential family distribution which minimizes the K-L divergence with the true distribution p. Since the Bregman divergence is not symmetric, the results of minimizing D(˜ p kp) and D(pk p ˜) are different. Therefore, there are mainly two types of variational inference methods, and in the following we review two classical algorithms.

4.1.1

Mean Field Methods

We briefly review mean field methods [7]. Suppose we are interested in approximating a k-dimensional multivariate distribution p(z; θ) = exp (hΦ(z), θi − G(θ)) , where z = (z 1 , . . . , z k ). Since G(θ) and −H(p(z; θ(µ))) are the Fenchel conjugates (Theorem 4.1.1) G(θ) = sup {hµ, θi + H(p(z; θ(µ)))} , µ∈M

where M denotes the set: n

o ˆ M = µ|∃θ s.t. Ep(z;θ) ˆ [Φ(z)] = µ .

(4.6)

The problem which arises in computing G(θ) is that M and H(p(z; θ)) are generally hard to characterize for most non-trivial multivariate distributions. Instead, mean field ˜ from simpler distributions p approximation replaces the set M by a subset of µ ˜. Let Z ˜ µ))d ˜ = Φ(z) p ˜ µ ˜(z; θ( z = Ep˜(z;θ( (4.7) ˜ µ)) ˜ [Φ(z)] c denote the set of all such µ, c ⊆ M, then clearly ˜ where M and M G(θ) ≥ sup

c ˜ M µ∈

n o ˜ µ))) ˜ θi + H(˜ ˜ hµ, p(z; θ( .

(4.8)

55 Moreover, the approximation error incurred as a result of replacing p with p ˜ is D(˜ p kp). To see this, use (4.2) and (4.7) to write ˜ θi + H(˜ ˜ G(θ) − sup {hµ, p(z, θ(µ)))} c ˜ M µ∈

n o ˜ ˜ − hµ, ˜ θi + G(θ) = inf −H(˜ p(z, θ(µ))) c ˜ M µ∈ Z  Z ˜ ˜ ˜ ˜ log p ˜ ˜ (hΦ(z), θi − G(θ)) d z = inf p ˜(z; θ(µ)) ˜(z; θ(µ))d z− p ˜(z; θ(µ)) c ˜ M µ∈ Z  Z ˜ µ)) ˜ µ))d ˜ µ)) ˜ log p ˜ ˜ log p(z; θ)d z = inf p ˜(z; θ( ˜(z; θ( z− p ˜(z; θ( c ˜ M µ∈

= inf D(˜ p kp). c ˜ M µ∈

Perhaps the simplest approximating distribution is to assume that each of the random variables z 1 , . . . , z k are independent, that is, ˜ µ)) ˜ = p ˜(z; θ(

k Y

j p ˜(z j ; θ˜ (˜ µj )),

j=1

where j p ˜(z ; θ˜ (˜ µj )) = exp j

D

E  j j j ˜j j ˜ Φ (z ), θ (˜ µ ) − G (θ (˜ µ )) . j

j

j For brevity, if we denote p ˜j = p ˜(z j ; θ˜ (˜ µj )), then the KL divergence D(˜ p kp) can be written

as D(˜ p kp) =

(Z

Z p ˜n

) ˜ µ)) ˜ log p ˜(z; θ(

Y

p ˜j dz j

dz n

j6=n



(Z

Z p ˜n

) log p(z; θ)

Y

p ˜j dz j

dz n ,

j6=n

j for any n ∈ {1, . . . , k}. If we keep all θ˜ (˜ µj ) for j 6= n fixed, and minimize D(˜ p kp) with n

respect to θ˜ (˜ µn ), then it is easy to verify that the infimum is attained by setting Z Z Y Y j ˜ ˜ log p ˜(z; θ(µ)) p ˜j dz = log p(z; θ) p ˜j dz j + const. j6=n

j6=n

56 Since D E n n ˜ µ)) ˜ = Φn (z n ), θ˜ (˜ log p ˜(z; θ( µn ) − Gn (θ˜ (˜ µn )) E XD j j j ˜j j + Φ (z ), θ (˜ µ ) − Gj (θ˜ (˜ µj )), j6=n

log p(z; θ) = hΦ(z), θi − G(θ), RQ

p ˜j dz j = 1, the infimum condition can be rearranged as D E D E n Φn (z n ), θ˜ (˜ µn ) = Ep˜j6=n [Φ(z)] , θ + const. (4.9) R Q where Ep˜j6=n [Φ(z)] = Φ(z) j6=n p ˜j dz j . We have absorbed all the terms which do not

and

j6=n

depend on z n into the constant. n In summary, the mean field algorithm updates θ˜ (˜ µn ) to equalize the terms with Φn (z n ) j and satisfy (4.9) by keeping all θ˜ (˜ µj ) for j 6= n fixed. Cyclically different n are picked n and θ˜ (˜ µn ) updated until a stationary point is achieved.

n Next, we want to compute the lower bound to G(θ) using the computed θ˜ (˜ µn ). To-

˜ µ)) ˜ is clearly the summation of the wards this end, observe that the SBG entropy of p ˜(z, θ( SBG entropy of each random variable (which we are able to compute efficiently), ˜ µ))) ˜ = H(˜ p(z; θ(

k X

j H(˜ pj (z j ; θ˜ (˜ µj ))).

j=1

Plugging this into (4.8) obtains the desired lower bound.

4.1.2

Assumed Density Filtering

This subsection reviews assumed density filtering [38]. Given an original distribution ˜ by minimizing p(z), assumed density filtering obtains the approximate distribution p ˜(z; θ) the K-L divergence D(pk p ˜). If D E ˜ = exp( Φ(z), θ ˜ − G(θ)), ˜ p ˜(z; θ) ˜ = Ep˜ [Φ(z)], one can take the derivative of D(pk p using the fact that ∇θ˜ G(θ) ˜) with respect ˜ and obtain: to θ

Ep [Φ(z)] = Ep˜ [Φ(z)].

(4.10)

57 (4.10) is widely known as ”moment matching”. It is also the key idea of Expectation Propagation [37], which has many applications in graphical models. One of the major applications of assumed density filtering is to estimate approximate posterior in Bayesian online learning [39]. The objective of Bayesian online learning is to train a binary classification model based on an online stream of m training examples Dm = {(x1 , y1 ), . . . , (xm , ym )}. For simplicity, let us consider a linear model with Φ(xi ) = xi parameterized by w, such that the label yi is predicted by sign (hxi , wi). For each training data example (xi , yi ), the conditional distribution of the label yi given xi and w is modeled as in [37]: ti (w) := p(yi | xi ; w) =  + (1 − 2)Θ(yi hxi , wi),

(4.11)

where Θ(z) is the step function: Θ(z) = 1 if z > 0 and = 0 otherwise, and  is a small error tolerance variable to increase the robustness of the model2 . By making a standard i.i.d. assumption about the data, the posterior distribution after seeing the m-th example can be written as p(w | Dm ) ∝ p0 (w)

m Y

ti (w),

i=1

where p0 (w) denotes a prior distribution which is usually assumed to be a Gaussian distribution p0 (w) = N (w; 0, I). As it turns out, the posterior p(w | Dm ) is infeasible to obtain as m ≥ 2. However, by using assumed density filtering, we can find a multivariate Gaussian distribution to approximate the true posterior, ˜ (m) ) := N (w; µ ˜ (m) ). ˜ (m) , Σ p(w | Dm ) ' p ˜(w; θ 2

(4.11) is equivalent to the 0-1 loss. Define u = y hx, wi and l(x, y, w) = − log p(y| x; w), then   − log(1 − ) u > 0 l(x, y, w) =  − log() u ≤ 0.

58 ˜ (0) ) = p0 (w) = N (w; 0, I), and denote the approximate distribuWe initialize p ˜(w; θ

˜ (i) ) for i ≥ 1. Define tion after processing (x1 , y1 ), . . . , (xi , yi ) to be p ˜i (w) := p(w; θ pi (w) ∝ p ˜i−1 (w)ti (w), then the approximate posterior p ˜i (w) is updated as ˜ (i) ) = argmin D(pi (w)k N (w; µ, Σ)). ˜ (i) , Σ p ˜i (w) = N (w; µ

(4.12)

µ,Σ

As was shown in [37], the solution of (4.12) is, ˜ (i−1) xi , ˜ (i) = Ep [w] = µ ˜ (i−1) + α(i) yi Σ µ ˜ (i) = Ep [w w> ] − Ep [w] Ep [w]> Σ ˜ (i−1) − (Σ ˜ (i−1) xi ) =Σ

! ˜ (i) α(i) yi xi , µ ˜ (i−1) xi )> , (Σ ˜ (i−1) xi x> Σ i

where α(i)

˜ (i−1) yi x i , µ (1 − 2) N (z(i) ; 0, 1) . =q   and z(i) = q R > ˜ ˜ (i−1) xi  + (1 − 2) z(i) N (z; 0, 1)dz x> Σ x Σ x (i−1) i i i −∞

4.2 T -Entropy and T -Divergence We have briefly reviewed the main ideas of variational inference as well as two wellknown algorithms in exponential family of distributions. Our objective in this chapter is to generalize the variational inference to t-exponential family of distributions. First, we need to find a new entropy which plays the same role as the SBG entropy in the exponential family. There are various generalizations of the SBG entropy which have been proposed in statistical physics, and paired with the t-exponential family of distributions. Perhaps the most well-known among them is the Tsallis entropy [10]: Z Htsallis (p) := − p(z)t logt p(z)d z .

(4.13)

Naudts in [8, 9] proposed the more general φ-exponential family of distributions (see Section 2.3). Corresponding to this family, an entropy-like measure called the information

59 content Iφ (p) as well as its divergence measure are defined. The information content is the Fenchel conjugate of a function F (θ), where ∇θ F (θ) = Ep [Φ(z)] .

(4.14)

Setting φ(x) = xt in the Naudts framework recovers the t-exponential family. Interestingly when φ(x) = 1t x2−t , the information content Iφ is exactly the Tsallis entropy (4.13). One another well-known non-SBG entropy is the R´enyi entropy [40]. The R´enyi αentropy (when α 6= 1) of the probability distribution p(z) is defined as: Z  1 α Hα (p) = log p(z) d z . 1−α

(4.15)

Besides these entropies proposed in statistical physics, there are other efforts that work with generalized linear models or utilize different divergence measures, such as [14, 41–43]. Although all of the above generalized entropies are useful in their own way, unfortunately none of them is the Fenchel conjugate of the log-partition function Gt (θ) of the t-exponential family. As has been shown in Section 4.1 as well as [7], this property is crucial in developing efficient variational inference approaches. In the following subsection, we define a new entropy, which to the best of our knowledge, has not been studied before. Note that although our main focus is the t-exponential family, we believe that our results can also be extended to the more general φ-exponential family of Naudts [9].

4.2.1 T -Entropy Definition 4.2.1 (Inspired by Theorem 2 [7]) The t-entropy of a probabilistic distribution p(z; θ) is defined as Ht (p(z; θ)) : = −

Z

q(z; θ) logt p(z; θ) d z = − Eq [logt p(z; θ)] .

where q(z; θ) = p(z; θ)t /Z(θ) and Z(θ) =

R

(4.16)

p(z; θ)t d z.

It is straightforward to verify that the t-entropy is non-negative. Furthermore, if p(z; θ) is a t-exponential family of distribution, the following theorem establishes the Fenchel

60 conjugacy between −Ht (p(z; θ(µ))) and Gt (θ), the log-partition function of p(z; θ). The theorem extends Theorem 3.4 of [7] to t-exponential family of distributions. The proof of the theorem is provided in Appendix B.7. Theorem 4.2.1 For a k-dimensional t-exponential family of distributions p(z; θ) = expt (hΦ(z), θi − Gt (θ)), and a k-dimensional vector µ, θ(µ) (if it exists) is the parameter of p(z; θ) such that Z µ = Eq(z;θ(µ)) [Φ(z)] = Φ(z)q(z; θ(µ)) d z . (4.17)

Then

G∗t (µ) =

  −Ht (p(z; θ(µ))) if θ(µ) exists

(4.18)

 +∞ otherwise . where G∗t (µ) denotes the Fenchel conjugate of Gt (θ). By duality it also follows that Gt (θ) = sup {hµ, θi − G∗t (µ)} .

(4.19)

µ

From Theorem 4.2.1, we know that −Ht (p(z; θ(µ))) is a convex function because it is the Fenchel conjugate of a function (Theorem 1.1.2 in Chapter X of [44]). Below, we derive the t-entropy of two commonly used distributions. See Figure 4.1 for a graphical illustration. Example 5 (T -entropy of Bernoulli distribution) Assume the Bernoulli distribution is p(z; µ) with parameter µ. The t-entropy is, Ht (p(z; µ)) =

1 − (µt + (1 − µ)t )−1 −µt logt µ − (1 − µ)t logt (1 − µ) = . µt + (1 − µ)t t−1

(4.20)

As t → 1, Ht (p(z; µ)) = −µ log µ − (1 − µ) log(1 − µ). Example 6 (T -entropy of Student’s t-distribution) Assume that a k-dimensional Student’s t-distribution p(z; µ, Σ, v)3 is given by (2.20), then the t-entropy of p(z; µ, Σ, v) is given by Ht (p(z; µ, Σ, v))) = − 3

 Ψ 1 1 + v −1 + , 1−t 1−t

We abuse the notation µ here to denote the first moment of Student’s t-distribution.

(4.21)

61 where K = (v Σ)−1 , v =

2 t−1

v → +∞, Ht (p(z; µ, Σ, v))) =

− k, and Ψ = k 2

Γ((v+k)/2) (πv)k/2 Γ(v/2)| Σ |1/2

−2/(v+k)

+ k2 log(2π) + 21 log | Σ |.

1

Ht (σ 2 )

0.6

. As t → 1,

15

t=0.1 t=0.5 t=1.0 t=1.5 t=1.9

0.8 Ht (µ)



0.4

t=1.0 t=1.3 t=1.6 t=1.9

10

5

0.2 0

0

0.2

0.4

0.6 µ

0.8

1

0

0

2

4

6 σ

8

10

2

Fig. 4.1. T -entropy corresponding to two well known probability distributions. Left: the Bernoulli distribution p(z; µ). Right: the 1-dimensional Student’s tdistribution p(z; 0, σ 2 , v), where v = 2/(t − 1) − 1. One recovers the SBG entropy by letting t = 1.0.

Although t-entropy has not been studied in the past, as the following examples will show, it is closely related to some well-known generalized entropies.

Relation with the Tsallis Entropy Using (2.13), (4.13), and (4.16), it is straightforward to see that, the t-entropy is a normalized version of the Tsallis entropy, Z 1 1 Ht (p(z)) = − R p(z)t logt p(z)d z = R Htsallis (p(z)). t p(z) d z p(z)t d z Relation with the R´enyi Entropy We can equivalently rewrite the R´enyi Entropy as: Z  Z −1/(1−α) 1 α α Hα (p(z)) = log p(z) d z = − log p(z) d z . 1−α

62 The t-entropy of p(z) (when t 6= 1) is equal to R p(z)t logt p(z)d z R Ht (p(z)) = − p(z)t d z R p(z)t (p(z)1−t − 1) d z R =− (1 − t) p(z)t d z R 1 − p(z)t d z R =− (1 − t) p(z)t d z Z −1/(1−t) t = − logt p(z) d z . where (4.22) is because

R

(4.22) (4.23)

p(z)d z = 1, and (4.23) is because logt (x) = (x1−t − 1) /(1 − t).

Therefore, when α = t, Ht (p(z)) = − logt (exp(−Hα (p(z)))) When t and α → 1, both entropies reduce to the SBG entropy. 4.2.2

T -Divergence

The next step is to define a divergence measure which pairs with t-exponential family. Analogous to K-L divergence, we define the t-divergence4 as the Bregman divergence based on the t-entropy. Definition 4.2.2 The t-divergence is the relative t-entropy between two distributions p1 (z) and p2 (z). It is equal to, Dt (p1 kp2 ) =

Z

q1 (z) logt p1 (z) − q1 (z) logt p2 (z)d z .

(4.24)

The mathematical verification of (4.24) is provided in Appendix B.11. The t-divergence plays a central role in the variational inference that will be derived shortly. Because it is a Bregman divergence, it preserves the following properties: • Dt (p1 kp2 ) ≥ 0, ∀p1 , p2 . The equality holds only for p1 = p2 . 4

Note that the t-divergence is not a special case of the divergence measure of Naudts [9] because the entropies are defined differently. The derivations are fairly similar in spirit though.

63 • Dt (p1 kp2 ) 6= Dt (p2 kp1 ). We give two examples of t-divergence in the below. For corresponding graphical illustrations see Figure 4.2. Example 7 (T -divergence between Bernoulli distributions) Assume that two Bernoulli distributions p1 := p(z; µ1 ) and p2 := p(z; µ2 ), then the t-divergence Dt (p1 kp2 ) between these two distributions is: µt1 logt µ1 + (1 − µ1 )t logt (1 − µ1 ) − µt1 logt µ2 − (1 − µ1 )t logt (1 − µ2 ) µt1 + (1 − µ1 )t 1 − µt1 µ1−t − (1 − µ1 )t (1 − µ2 )1−t 2 . = (1 − t)(µt1 + (1 − µ1 )t )

Dt (p1 kp2 ) =

1−µ1 As t → 1, Dt (p1 kp2 ) = µ1 log µµ12 + (1 − µ1 ) log 1−µ . 2

Example 8 (T -divergence between Student’s t-distributions) Assume that two Student’s t-distributions p1 := p(z; µ1 , Σ1 , v) and p2 := p(z; µ2 , Σ2 , v) are given, then the tdivergence Dt (p1 kp2 ) between these two distributions is: Dt (p1 kp2 ) =

 2Ψ2 > Ψ1 µ K2 µ2 1 + v −1 + 1−t 1−t 1   Ψ2 Ψ2 > Ψ2 − T r K> µ1 K2 µ1 − µ> 2 Σ1 − 2 K2 µ2 + 1 , 1−t 1−t 1−t (4.25)

where the definition of Ki and Ψi are the same as (4.21). T r is the trace of the matrix. As t → 1, v → +∞, 1 Dt (p1 kp2 ) = 2

4.3

  | Σ2 | −1 −1 > tr(Σ2 Σ1 ) + (µ2 − µ1 ) Σ2 (µ2 − µ1 ) + log −k . | Σ1 |

Variational Inference in T -Exponential Family of Distribution In this section, we extend the two variational inference methods from Section 4.1 to

t-exponential family.

64

1

t=0.1 t=0.5 t=1.0 t=1.5 t=1.9

Dt (p1 kp2 )

0.8 0.6 0.4 0.2 0

0

0.2

0.4

0.6

0.8

1

µ

Dt (p1 kp2 )

40

t=1.0 t=1.3 t=1.6 t=1.9

4 3

Dt (p1 kp2 )

t=1.0 t=1.3 t=1.6 t=1.9

60

2

20 1 0

−4

−2

0 µ

2

4

0

0

2

4

6 σ

8

10

2

Fig. 4.2. T -divergence between two distributions. Top: Bernoulli distributions p1 = p(z; µ) and p2 = p(z; 0.5). Bottom: Student’s t-distributions. Left: p1 = p(z; µ, 1, v) and p2 = p(z; 0, 1, v). Right: p1 = p(z; 0, σ 2 , v) and p2 = p(z; 0, 1, v). v = 2/(t − 1) − 1. One recovers the K-L divergence by letting t = 1.0.

65 4.3.1

Mean Field Methods

This subsection introduces the mean field method for the t-exponential family. Consider the k-dimensional multivariate t-exponential family of distribution p(z; θ) = expt (hΦ(z), θi − Gt (θ)) . where z = (z 1 , . . . , z k ). From Theorem 4.2.1, Gt (θ) = sup {hµ, θi + Ht (p(z; θ(µ)))} , µ∈M

where n o ˆ s.t. E ˆ [Φ(z)] = µ . M = µ|∃θ q(z;θ)

(4.26)

Similar to the case of the exponential family (see Section 4.1.1), if we replace p by ˜ µ)), ˜ a subset p ˜(z; θ( then the approximation error of Gt (θ) incurred is given by the tdivergence Gt (θ) − sup

c ˜ M µ∈

n

o ˜ ˜ θi + Ht (˜ ˜ hµ, p(z; θ(µ))) = inf Dt (˜ p kp),

(4.27)

c ˜ M µ∈

where n o ˆ s.t. E ˆ [Φ(z)] = µ c= µ ˜ |∃θ ˜ M . q ˜(z;θ) The simplest way is to approximate p(z; θ) by ˜ µ)) ˜ = p ˜(z; θ(

k Y

j p ˜(z j ; θ˜ ), where

(4.28)

j=1 j p ˜(z j ; θ˜ ) = expt

D

E  j j Φj (z j ), θ˜ − Gjt (θ˜ ) .

j

Denote p ˜j = p ˜(z j ; θ˜ ), and q ˜j the corresponding escort distribution, for any n, one can write the t-divergence as Z Dt (˜ p kp) = q ˜n

Z

! ˜ logt p ˜(z; θ)

Y j6=n

q ˜j dz j

! Z Z Y dz n − q ˜n logt p(z; θ) q ˜j dz j dz n . j6=n

66 j If we keep all θ˜ for j 6= n fixed, then the t-divergence is minimized by setting Z Z Y Y j ˜ logt p ˜(z; θ) q ˜j dz = logt p(z; θ) q ˜j dz j + const. j6=n

(4.29)

j6=n

RQ Using the fact that ˜j dz j = 1, we can write j6=n q Z Z Y Y 1 1 j ˜ ˜ p ˜1−t (z; θ) q ˜j dz j − logt p ˜(z; θ) q ˜j dz = 1−t 1−t j6=n j6=n Z Z Y Y 1 1 logt p(z; θ) . q ˜j dz j = p1−t (z; θ) q ˜j dz j − 1 − t 1 − t j6=n j6=n Since p(z; θ) is t-exponential family, Z Z Y Y 1−t j p (z; θ) q ˜j dz = (1 + (1 − t) hΦ(z), θi − Gt (θ)) q ˜j dz j j6=n

j6=n



D E  = 1 + (1 − t)( Eq˜j6=n [Φ(z)] , θ − Gt (θ)) ,

(4.30)

R Q where Eq˜j6=n [Φ(z)] = Φ(z) j6=n q ˜j dz j . Similarly, Z Y ˜ p ˜1−t (z; θ) q ˜j dz j j6=n

=

Z 

E  n n ˜n ˜ 1 + (1 − t) Φ (z ), θ − Gt (θ ) E  D Y j j ˜j dz j 1 + (1 − t) Φj (z j ), θ˜ − Gjt (θ˜ ) q D

n

n

j6=n

E  D n n = 1 + (1 − t)( Φn (z n ), θ˜ − Gnt (θ˜ )) D  Y  j j  jE j ˜j ˜ 1 + (1 − t)( Eq˜j Φ (z ) , θ − Gt (θ )) , 

(4.31)

j6=n

where Eq˜j [Φj (z j )] =

R

Φj (z j ) q ˜j dz j . Putting together (4.30) and (4.31) by using (4.29)

yields 

D E  1 + (1 − t)( Eq˜j6=n (Φ(z)) , θ − Gt (θ))  D E  n n = 1 + (1 − t)( Φn (z n ), θ˜ − Gnt (θ˜ )) D E  Y j j 1 + (1 − t)( Eq˜j [Φj (z j )], θ˜ − Gjt (θ˜ )) + const. j6=n

67 Absorbing all the terms which do not depend on z n into the constant, we obtain the following update equation, D

E D EY D t−1   jE n j Φn (z n ), θ˜ = Eq˜j6=n [Φ(z)] , θ expt Eq˜j Φj (z j ) , θ˜ − Gjt (θ˜ ) + const. j6=n

(4.32) n Cyclically different n are picked and θ˜ (˜ µn ) updated until a stationary point is achieved.

˜ µ)), ˜ µ))) ˜ one can compute Ht (˜ ˜ and plug into (4.27) to obtain After obtaining p ˜(z; θ( p(z; θ( ˜ µ))) ˜ the lower bound of Gt (θ). Note that unlike the SBG entropy, the t-entropy Ht (˜ p(z; θ( does not factorize easily. However, as is shown in Appendix B.12, one can still compute a closed form solution, ˜ µ))) ˜ = (1 − t)k−1 Ht (˜ p(z; θ( where Zj =

R

k  Y j=1

1 Ht (pj (zj )) + (1 − t)Zj



k 1 Y 1 − . 1 − t j=1 Zj

(4.33)

pj (zj )t dzj .

Approximating Multivariate Student’s T -Distribution We illustrate the mean field method by using an example which approximates a kdimensional Student’s t-distribution with degree of freedom v and parameters µ and Σ. According to Example 4, St(z; µ, Σ, v) = expt (hΦ(z), θi − Gt (θ)) , where Φ(z) = [z; z z> ], θ = [−2Ψ K µ/(1 − t), Ψ K /(1 − t)], 1 v+k = , t−1 2 !−2/(v+k) Γ ((v + k)/2)

K = (v Σ)−1 , Ψ=

, (πv)k/2 Γ(v/2)| Σ |1/2    Ψ 1 µ> K µ + 1 + . Gt (θ) = − 1−t 1−t

68 In particular, our task is to approximate it by k one-dimensional Student’s t-distributions with degrees of freedom v˜. The reason that we choose this example is because the true logpartition Gt (θ) is analytically computable, so that we can see how close the approximation is. In order to make the approximate distribution have the same t, we choose v˜ such that 1 t−1

=

v+k 2

=

v ˜ +1 , 2

which yields v˜ = v + k − 1. The approximate distribution is ˜ = p ˜(z; θ)

k Y

˜j ) = p ˜(z ; θ j

j=1

k Y

St(z j ; µ ˜j , σ ˜ j , v˜).

j=1

Using the representation of t-exponential family of distributions ˜ j ) = expt p ˜(z j ; θ

D

E  ˜ j − Gt (θ ˜j ) Φj (z j ), θ

j

˜ = [−2Ψ ˜ jK ˜j µ ˜ jK ˜ j /(1 − t)], where with Φj (z j ) = [z j ; (z j )2 ] and θ ˜j /(1 − t), Ψ ˜j

−1

j −2

˜j

K = v˜ (˜ σ ) , Ψ =



Γ((˜ v +1)/2) Γ(˜ v /2)(π v˜)1/2 σ ˜j

−2/(˜v +1) .

Now we apply the variational updates (4.32) to obtain the variational parameters µ ˜j and σ ˜ j for j = 1, . . . , k. Detailed derivations are provided in Appendix B.13. The resulting iterative updates are,  1 ˜ j6=n )> kj6=n,n , −2µ> kn +2(µ nn k  −(˜v +1)/ v˜ Γ(˜ v /2)2/ v˜ π 1/ v˜ n 2 n ˜n ˜ (˜ σ ) = K Ψ , · Γ((˜ v +1)/2)2/ v˜ v˜ !−1 Y Ψ ˜j ˜ nΨ ˜ n = Ψk nn ˜j where K +Ψ . v ˜ j6=n µ ˜n =

 j ˜ j6=n denotes the vector µ Here µ ˜ j=1...k,j6=n , kn denotes the n-th column of K, and kj6=n,n denotes the n-th column of K after its n-th element is deleted. To empirically validate these updates, we use the product of ten 1-dimensional Student’s t-distributions to approximate a 10-dimensional Student’s t-distribution with degrees of freedom v = 5, which corresponds to setting t = 1.13. Both µ and Σ are generated randomly using Matlab. Overall 500 variational updates were performed and the negative

69 t-divergence (−Dt (˜ p kp)) is plotted as a function of the number of iterations in Figure 4.3. One can see that the t-divergence between the approximate distribution and the true distribution monotonically decreases until it reaches a stationary point. At that point, the t-divergence between the approximate distribution and the true distribution appears to be close to 0, which indicates that a reasonable approximation has been obtained.

Negative t-divergence

0 −200 −400 −600 −800 −1000 0

100

200 300 # of iterations

400

500

Fig. 4.3. The negative t-divergence between the product of ten 1-dimensional Student’s t-distributions and one 10-dimensional Student’s t-distribution using the mean field approach for 500 iterations.

4.3.2

Assumed Density Filtering

This subsection introduces the assumed density filtering in t-exponential family. We D E ˜ ˜ ˜ by minapproximate the original distribution p(z) by p ˜(z; θ) = expt ( Φ(z), θ − Gt (θ)) imizing ˜ Dt (p(z)k p ˜(z; θ)) Z ˜ z. = q(z) logt p(z) − q(z) logt p ˜(z; θ)d

(4.34)

70 ˜ = Eq˜ [Φ(z)], one can take the derivative of (4.34) with respect Using the fact that ∇θ˜ Gt (θ)

˜ and obtain: to θ

Eq [Φ(z)] = Eq˜ [Φ(z)].

(4.35)

In other words, the approximate distribution is obtained by matching the escort expectation of Φ(z) between the two distributions. To illustrate our ideas on a non-trivial problem, we again apply it on the Bayesian online learning problem. But this time, instead of using multivariate Gaussian distribution as a prior as was done by [37], we use a Student’s t-prior, p0 (w) = St(w; 0, I, v).

(4.36)

In addition, we will find a multivariate Student’s t-distribution to approximate the true posterior p(w | Dm ) ∝ p0 (w)

m Y

ti (w).

(4.37)

i=1

˜ (0) ) = p0 (w) = where ti (w) = p(yi | xi , w) is defined in (4.11). We initialize p ˜(w; θ St(w; 0, I, v), and denote the approximate distribution after processing (x1 , y1 ), . . . , (xi , yi ) ˜ (i) ) = St(w; µ ˜ (i−1) , v) for i ≥ 1. Define ˜ (i−1) , Σ to be p ˜i (w) := p(w; θ pi (w) ∝ p ˜i−1 (w)ti (w), then the approximate posterior p ˜i (w) is updated as ˜ (i) , v) = argmin Dt (pi (w)kSt(w; µ, Σ, v)). ˜ (i) , Σ p ˜i (w) = St(w; µ

(4.38)

µ,Σ

Assume that w is a k-dimensional parameter vector, then p ˜i (w) is a k-dimensional Student’s t-distribution with degree of freedom v, for which Φ(w) = [w, w w> ] and t = 1 + 2/(v + k). From (4.35), the result of (4.38) matches the moments Φ(w) between qi (w) and q ˜i (w), Z

Z qi (w) w d w =

Z

qi (w) w w> d w =

Z

q ˜i (w) w d w, and

(4.39)

q ˜i (w) w w> d w,

(4.40)

71 where q ˜i (w) ∝ p ˜i (w)t , qi (w) ∝ p ˜i−1 (w)t t˜i (w), and  t˜i (w) = ti (w)t = t + (1 − )t − t Θ(yi hw, xi i),

(4.41)

where Θ(z) is the step function: Θ(z) = 1 if z > 0 and = 0 otherwise. Solving (4.39) and (4.40) yields to the following simple update rules, which are reminiscent of the Bayesian online learning algorithm by Gaussian distribution [37]. The detailed derivation is provided in Appendix B.14. ˜ (i−1) xi ˜ (i) = Eq [w] = µ ˜ (i−1) + α(i) yi Σ µ ˜ (i) = Eq [w w> ] − Eq [w] Eq [w]> Σ ˜ (i−1) − (Σ ˜ (i−1) xi ) = r(i) Σ

! ˜ (i) α(i) yi xi , µ ˜ (i−1) xi )> . (Σ > ˜ x Σ(i−1) xi i

where ((1 − )t − t ) St(z(i) ; 0, 1, v) q , r(i) = Z1(i) /Z2(i) , > ˜ Z2(i) xi Σ(i−1) xi Z = p ˜i−1 (w)t˜i (w)d w Z  z(i) t t t =  + (1 − ) −  St(z; 0, 1, v)dz, −∞ Z = q ˜i−1 (w)t˜i (w)d w Z  z(i) t t t =  + (1 − ) −  St(z; 0, v/(v + 2), v + 2)dz, −∞

˜ (i−1) yi xi , µ =q . > ˜ xi Σ(i−1) xi

α(i) = Z1(i)

Z2(i)

and z(i)

Synthetic Results In classical binary classification problems, it is assumed that the underlying true classifier is fixed. However, in an online learning problem, it is possible that the underlying classifier changes from time to time. In such scenario, we require the learning algorithm to relearn the classifier quickly. The Student’s t-distribution is a more conservative prior

72 than the Gaussian distribution because of its heavy-tailed nature. As we will see in the following synthetic online dataset, the Bayesian online learning algorithm based on Student’s t-distribution is able to relearn the classifier much faster than the one based on Gaussian distribution. In our experiments, we generate a sequence of data which is composed of 4000 data points. Each data example xi is randomly generated by a 100 dimension isotropic Gaussian distribution N (0, I). In order to periodically change the underlying classifier, we partition the sequence evenly into 10 subsequences of length 400, and assign each subsequence a ¯ (s) ∈ {−1, +1}100 where s ∈ {1, 2, . . . , 10}. Each data base weight parameter vector w

 ¯ (s) + n(i) , s = di/400e, and point xi is labeled as yi = sign xi , w(i) where w(i) = w ¯ (s) is n(i) is generated from the uniform distribution in [−0.1, 0.1]. The base weight vector w generated in two ways: (I) from U {−1, +1}100 where U {a, b} denotes p(a) = p(b) = 0.5; ¯ (s−1) such that, for s ∈ {2, . . . , 10}, (II) based on the previous base weight vector w   U {−1, +1} j ∈ [10s − 9, 10s] j w¯(s) =  w¯ j otherwise. (s−1) We compare the Bayesian online learning algorithm with Student’s t-prior (with v = 3 and v = 10) and the one with the Gaussian prior. For both methods, we let  = 0.01. We report the discrepancy D(i) between the true weight vector w(i) and Ep˜i [w] the posterior mean of p ˜i (w) at each data i in Figure 4.4, where D(i)

100   X j = δ w(i) Ep˜i [wj ] > 0 , j=1

and the accumulated prediction error rate 4000

 1 X E= δ yi xi , Ep˜i [w] > 0 4000 i=1

in Table 4.1. Here δ(·) is 0 if the condition inside (·) holds and 1 otherwise. According to Figure 4.4, we can see that the discrepancy curve is periodical due to the change of the base weight parameter every 400 data points. It is obvious that discrepancy curves by Student’s t-distributions (red and green) drop much faster than the one

73 by Gaussian distribution (black). As a result, the accumulated error rates by Student’s t-distributions are also lower than the one by Gaussian distribution as shown in Table 4.1.

80 60 40 20 0

0

1,000

2,000

3,000

4,000



20 Gauss v=3 v=10

Discrepancy D(i)

Discrepancy D(i)



100

Gauss v=3 v=10

15 10 5 0

0

Data Example (i)

1,000

2,000

3,000

4,000

Data Example (i)

Fig. 4.4. The discrepancy D(i) between the true weight vector w(i) and Ep˜i [w] the posterior mean of p ˜i (w) at each data i from the synthetic online dataset using Bayesian online learning. Left: case I. Right: case II.

Table 4.1 The accumulated prediction error rate on the synthetic online dataset using Bayesian online learning. Gauss

4.4

v=3

v=10

Case I

0.337

0.242 0.254

Case II

0.150

0.130 0.128

Chapter Summary In this chapter, we investigated the conjugacy of the log-partition function of the t-

exponential family of distributions, and studied a new t-entropy. By minimizing the tdivergence, the Bregman divergence based on t-entropy, we generalized two well-known approximate inference approaches for the exponential family to the t-exponential family.

74

5. T -CONDITIONAL RANDOM FIELDS In classification, a classifier predicts the label of a data point without considering the other examples or labels. However, real-world data usually has underlying structure. Taking into account this structure is beneficial both for prediction and for modeling. In exponential family, conditional random field (CRF) is a well-known statistical modeling method, which extends logistic regression by modeling the structure of the data and labels. We will first briefly review graphical models and CRF. Then, we will introduce a new model, the tconditional random field (t-CRF), as a generalization of CRF. In order to perform inference in our new model, a novel mean field based approach is presented. Finally, we will give two examples which demonstrate the robustness of t-CRF.

5.1

Conditional Random Fields

5.1.1

Undirected Graphical Models

Many real-world applications involve a large number of variables which depend on each other [45]. Examples include parsing natural language sentences, annotating images, etc. Over the past two decades, probabilistic graphical models have been used to model such dependencies [46]. Our focus in this dissertation is on undirected graphical models G = (V, E), which contain two main components: • V : the set of graph vertices(nodes), which represent the variables; • E: the set of graph edges, which represent the dependencies between variables. Probably the simplest illustrative example (see Figure 5.1.1) would be a 3-node chain which consists of three variables (z 1 , z 2 , z 3 ). There are two edges in the graph, (z 1 , z 2 ) and (z 2 , z 3 ), while z 1 and z 3 are not directly connected. Intuitively, the way to interpret the model is that the value of z 1 depends on z 2 , while the value of z 2 depends on z 1 and z 3 .

75 z1

z2

z3

Fig. 5.1. The 3-node chain model. Each node indicates a variable. Each edge on the graph represents a dependency.

At the heart of a graphical model lies the Markov property: If a set U1 ⊆ V is separated from U3 by another set U2 in G, then this corresponds to the conditional independence property namely U1 ⊥ U3 | U2 in the joint distribution over V . In Figure 5.1.1, since z 1 and z 3 are separated by z 2 , we can conclude that z1 ⊥ z3 | z2. In other words, p(z 1 |z 2 , z 3 ) = p(z 1 |z 2 ) and p(z 3 |z 2 , z 1 ) = p(z 3 |z 2 ). Furthermore, the Hammersley-Clifford theorem [46] states that a probability distribution that satisfies the Markov property with respect to an undirected graph if, and only if, its density can be factorized over the cliques (fully connected subgraphs) of the graph. In Figure 5.1.1, there are two cliques (z 1 , z 2 ) and (z 2 , z 3 ). Therefore, the probability distribution factorizes as p(z) = where Z =

5.1.2

R

1 Ψ(z1 , z2 )Ψ(z2 , z3 ), Z

Ψ(z1 , z2 )Ψ(z2 , z3 )d z is the normalization constant.

Conditional Random Fields

Conditional random field [11, 12, 45], commonly abbreviated as CRF, is a graphical model, which models the conditional distribution p(y | X) of the label vector y based on the observed feature variables X. In order to illustrate the main idea of CRF, we will focus on the simplest 3-node chain model by slightly extending Figure 5.1.1 to a conditional model (see Figure 5.1.2). Interested readers may refer to [11, 12, 45] for a more detailed discussions. The 3-node conditional chain model consists of three labels y = (y 1 , y 2 , y 3 ) as the random variables and

76 three observed (feature) variables X = (x1 , x2 , x3 ). The observed variables are in red in the figure. In addition to the chain structure among the labels, each y j is connected with a feature variable xj ∈ Rd . For simplicity, we assume that y j is binary and takes values from {0, 1}. Therefore, there are 23 = 8 possible configurations of this conditional model. y1

y2

y3

x1

x2

x3

Fig. 5.2. The 3-node conditional chain model. Blue nodes indicate the labels; red nodes indicate the data variables. Each edge on the graph represents a factor.

The chain CRF in Figure 5.1.2 satisfies the Markov property, therefore, y 1 ⊥ y 3 | y 2 , X. Furthermore, thanks to the Hammersley-Clifford theorem, the conditional distribution p(y | X) can be factorized into cliques. Figure 5.1.2 contains two types of cliques: node cliques (xj , y j ) and edge cliques (y j , y j+1 ), therefore, p(y | X) =

3 2 1 Y v j j Y e j j+1 Ψ (x , y ) · Ψ (y , y ), Z(X) j=1 j=1

(5.1)

where Z(X) =

3 X Y y∈{0,1}3

j=1

Ψv (xj , y j ) ·

2 Y

Ψe (y j , y j+1 )

(5.2)

j=1

is a normalization constant which does not depend on y. Computing Z(X) requires summing over 23 = 8 different configurations of y. As the chain gets longer, the number of terms in the summation grows exponentially with the length of chain. Therefore, we need efficient algorithms for computing Z(X). We will discuss one such algorithm in Section 5.1.4.

77 As for the choice of the conditional distribution, exponential family distributions have been widely used. Each clique is modeled by an exponential factor,

Ψv (xj , y j ) = exp Φv (xj , y j ), θ v ,

Ψe (y j , y j+1 ) = exp Φe (y j , y j+1 ), θ e , and 3 2 Y



v j j Y 1 exp Φ (x , y ), θ v · exp Φe (y j , y j+1 ), θ e p(y | X; θ) = Z(X; θ) j=1 j=1

= exp

3 X

Φv (xj , y j ), θ v +

j=1

2 X

j=1

! Φe (y j , y j+1 ), θ e − G(X; θ) , (5.3)

where G(X; θ) the log-partition function which is equal to log Z(X; θ). In (5.3), θ v denotes the parameter for node cliques, θ e denotes the parameter for edge cliques. For simplicity, we assume that  Φv (xj , y j ) = δ0j xj , δ1j xj , jr jr jr jr Φe (y j , y r ) = (δ00 , δ01 , δ10 , δ11 ), jr where δgj := δ(y j = g) = 1 if y j = g and δgj = 0 otherwise; and δgf is the abbreviation of

δ(y j = g)δ(y r = f ), where f , g ∈ {0, 1}. 5.1.3

Parameter Estimation

Similar to logistic regression, the loss function of a CRF is defined as its negative loglikelihood, l(X, y; θ) = − log p(y | X; θ) =−

3 X

j=1

v

j

j



Φ (x , y ), θ v −

2 X

j=1

Φe (y j , y j+1 ), θ e + G(X; θ),

78 whose gradient with respect to θ e and θ v can be computed as follows, 3 3 X X ∂ l(X, y; θ) = − Φv (xj , y j ) + Ep [Φv (xj , y j )], ∂ θv j=1 j=1 2 2 X X ∂ e j j+1 l(X, y; θ) = − Φ (y , y ) + Ep [Φe (y j , y j+1 )]. ∂ θe j=1 j=1

Here Ep means the expectation with respect to p(y | X; θ). Given m training examples {(X1 , y1 ), . . . , (Xm , ym )}, the empirical risk is equal to the average loss, m

m

1 X 1 X l(Xi , yi ; θ) = − log p(yi | Xi ; θ), Remp (θ) = m i=1 m i=1 and as before, the regularized risk is given by J(θ) := λΩ(θ) + Remp (θ). We will use Ω(θ) =

1 k θ k22 2

as our regularizer. The model parameter θ is obtained by

minimizing the regularized risk J(θ). Although other algorithms exist (e.g. SGD in [47]), we will use the L-BFGS algorithm.

5.1.4

Inference

The main computational issue in parameter estimation for a CRF is how to estimate p(y | X; θ) and compute Z(X). As discussed earlier, the naive way of computing Z(X) based on (5.2) grows exponentially with respect to the length of the chain, which is prohibitive for long chains. Fortunately, belief propagation [48], also known as the sum-product message passing algorithm, can be applied for efficient inference on the chain model. The basic idea of belief

79 propagation is to take advantage of the factorization so that the sum over y are distributed and reused. For example, in the 3-node chain model, Z(X) =

3 X Y

v

=

j

Ψ (x , y ) ·

y∈{0,1}3 j=1

X

j

Ψv (x3 , y 3 )

y 3 ∈{0,1}

2 Y

Ψe (y j , y j+1 )

j=1

X

Ψe (y 2 , y 3 )Ψv (x2 , y 2 )

y 2 ∈{0,1}

X

Ψe (y 1 , y 2 )Ψv (x1 , y 1 ).

y 1 ∈{0,1}

(5.4) By computing α1 (y 1 ) = Ψv (x1 , y 1 ), α2 (y 2 ) = Ψv (x2 , y 2 )

X y 1 ∈{0,1}

α3 (y 3 ) = Ψv (x3 , y 3 )

X y 2 ∈{0,1}

we have Z(X) =

P

y 3 ∈{0,1}

Ψe (y 1 , y 2 ) · α1 (y 1 ), Ψe (y 2 , y 3 ) · α2 (y 2 ),

α3 (y 3 ). Therefore, the number of summations drops from

exponential (23 ) to linear (2 × 3) with respect to the chain length. The belief propagation algorithm can be further applied to any acyclic undirected graph including tree graphs. However, for more general graphical models which contains cycles, a variant called loopy belief propagation can be applied. However, loopy belief propagation may not converge, because the summations in Z(X) cannot be distributed like in (5.4). In such cases, other approximate inference methods such as mean field methods are widely applied.

5.2 T -Conditional Random Fields The graphical models based on Markov properties are important and powerful for compactly representing multivariate distributions. However, they encode independence relations which are sometimes too strong. There is some existing research devoted to providing solutions to long-range dependencies by adding more edges, e.g. skip-chain CRF [12]. However, to consider long-range dependencies among all variables would need a complete graph, which dramatically increases the computational complexity. Furthermore, as

80 we have already seen in previous chapters, the thin-tailed nature of the exponential family makes it potentially vulnerable against extreme outliers. In order to provide a solution to the above issues, we introduce the t-conditional random field (t-CRF), which uses expt to replace the exp function in CRF. Let us illustrate t-CRF using the 3-node chain model in Figure 5.1.2, where p(y | X; θ) = expt

3 X

Φv (xj , y j ), θ v +

j=1

2 X

j=1

! Φe (y j , y j+1 ), θ e − Gt (X; θ) . (5.5)

Here, Gt (X; θ) is the log-partition function which does not have an analytical solution for t 6= 1. It satisfies X

expt

y∈{0,1}3

3 X

Φv (xj , y j ), θ v

j=1



2 X

e j j+1 + Φ (y , y ), θ e − Gt (X; θ)

! = 1.

j=1

We will discuss efficient computation of Gt (X; θ) in Section 5.3. Here we focus on the modeling implications. Unlike the exponential function, expt (a + b) 6= expt (a) expt (b), therefore (5.5) cannot be factorized like (5.1) as CRF. In addition, as t > 1, expt decays towards 0 slower than exp. Because of these differences between expt and exp, the following three properties of t-CRF are different from a CRF: • The Markov property and the Hammersley-Clifford theorem do not hold for t-CRF; • Even variables that are not adjacent to each other in the graph may have dependencies with each other; • As t > 1, t-CRF is more conservative1 than CRF. The following example illustrates the above properties of t-CRF. Example 9 Consider the 3-node conditional chain model in Figure 5.1.2. Let xj = 1 for j ∈ {1, 2, 3}, and θ v = (1, −1), θ e = (2, −2, −2, 2). 1

By more conservative, we mean that the distribution is closer to uniform distribution.

81 We calculate the conditional probability of y 1 given y 2 , y 3 , X; as well as the marginal probability of y 1 with t = 1.1 and 1.5 in Table 5.1. To see the difference from a CRF (t = 1.0), we also include its probabilities for reference. For brevity, we use p(1|1, 1, X) to represent p(y 1 = 1|y 2 = 1, y 3 = 1, X).

Table 5.1 Comparisons of p(y | X) and p(y |y 2 , y 3 , X) between t-CRF (t = 1.1 and 1.5) and CRF (t = 1.0) in the 3-node chain example. 1

1

CRF

t = 1.1 t = 1.5

p(1| X)

0.9947

0.9801

0.8294

p(1|1, 1, X)

0.9975

0.9913

0.9256

p(1|0, 0, X)

0.1192

0.2316

0.3945

p(1|1, 0, X)

0.9975

0.9627

0.7466

p(1|0, 1, X)

0.1192

0.2542

0.4128

From Table 5.1, we can see y 3 and y 1 in t-CRF are not conditionally independent given y 2 . The larger the t, the stronger the dependency. For example, when t = 1.5, p(1|1, 1, X) = 0.9256 and p(1|1, 0, X) = 0.7466; when t = 1.1, p(1|1, 1, X) = 0.9913 and p(1|1, 0, X) = 0.9627; when t = 1.0, t-CRF reduces to CRF and the two conditional probabilities are both equal to 0.9975. In addition, by comparing p(1| X), we can see the marginal probability of y 1 becomes more conservative as t gets larger.

Similar to CRF, we define the loss function of t-CRF as the negative log-likelihood, l(X, y; θ) = − log p(y | X; θ),

82 whose gradient with respect to θ v can be computed as, ∂ log p(y | X; θ) ∂ θv ! 3 2 X

v j j X

e j j+1 ∂ =− log expt Φ (x , y ), θ v + Φ (y , y ), θ e − Gt (X; θ) ∂ θv j=1 j=1 −

=

3 X j=1

 Eq [Φv (xj , y j )] − Φv (xj , y j ) p(y | X; θ)t−1 , | {z } ξ

and gradient with respect to θ e is, 2 X  ∂ log p(y | X; θ) = Eq [Φe (y j , y j+1 )] − Φe (y j , y j+1 ) p(y | X; θ)t−1 , − | {z } ∂ θe j=1 ξ

where q denotes the escort distribution q(y | X; θ) ∝ p(y | X; θ)t . Similar to t-logistic regression, the gradient of the loss contains a forgetting variable ξ := p(y | X; θ)t−1 , which caps the influence of the low-likelihood examples.

5.3

Approximate Inference In order to compute l(X, y; θ) and ∇θ l(X, y; θ), we need to estimate p(y | X; θ) which

requires efficient computation of Gt (X; θ). In t-CRF, since the probability distribution is not factorizable, belief propagation is inapplicable even for the 3-node chain model. The only way to estimate p(y | X; θ) is to use approximate inference. We apply the mean field

˜ such that l(X, y; θ) and ∇θ l(X, y; θ) method and approximate p(y | X; θ) by p ˜(y | X; θ), is approximated as ˜ l(X, y; θ) ' − log p ˜(y | X; θ),

(5.6)

3 X  ∂ ˜ t−1 l(X, y; θ) ' Eq˜ [Φv (xj , y j )] − Φv (xj , y j ) p ˜(y | X; θ) ∂ θv j=1

(5.7)

2 X  ∂ ˜ t−1 l(X, y; θ) ' Eq˜ [Φe (y j , y j+1 )] − Φe (y j , y j+1 ) p ˜(y | X; θ) ∂ θe j=1

(5.8)

˜ t is the escort of the approximate distribution. where q ˜∝p ˜(y | X; θ)

83 The most straightforward way to approximate the conditional distribution p(y | X; θ) is by the product of univariate probability distribution functions, ˜ = p˜(y | X; θ) =

3 Y j=1 3 Y

˜j ) p˜(y j | xj ; θ expt

D

j=1

E D E  ˜j ) , ˜ j + Φe (y j ), θ ˜j − G ˜ t (xj ; θ Φv (xj , y j ), θ v e

 where Φv (xj , y j ) = δ0j xj , δ1j xj the node feature is same as that of the true distribution,  and Φe (y j ) = δ0j , δ1j is a 2-dimensional feature reduced from the edge features Φe (y j , y r ) ˜ j ) is the probability distribution of a univariate in the true distribution. Note that p˜(y j | xj , θ

˜ j ) can be estimated efficiently given θ ˜ j using Algorithm ˜ t (xj ; θ discrete variable, in which G 2 in Section 3.3. ˜ is obtained by minimizing the t-divergence Dt (˜ The variational parameter θ p kp). By

˜ j while fixing all other θ ˜ r for r 6= j via using the variational updates in (4.32), we update θ D

E

Y ˜ j = Φv (xj , y j ), θ v ˜ r )t−1 + const., Φv (xj , y j ), θ p˜(Eq˜r [y]| xr , θ v r6=j



D

E j

˜ Φe (y j ), θ e

 X

Y ˜ r )t−1 + const., = Eq˜r [Φe (y j , y)], θ e  p˜(Eq˜r [y]| xr , θ r∈N (j)

r6=j

where N (j) = {z r ∈ V |(z j , z r ) ∈ E} denotes the neighborhood of the node j: N (1) = 2, r

r

˜ ) ∝ p(y| xr ; θ ˜ )t , N (2) = {1, 3}, N (3) = 2. q ˜r denotes the escort distribution q(y| xr ; θ

and ˜r) p˜(Eq˜r [y]| xr , θ D E D E  r ˜r ˜ r + Eq˜ [Φe (y)], θ ˜r − G ˜ = expt Eq˜r [Φv (xr , y)], θ (x ; θ ) . t v e r ˜ j for j = The algorithm iteratively sweeps through all the variational parameters θ ˜ j ), we can then {1, 2, 3}, until a stationary point is reached. After obtaining all p ˜(y j | x; θ compute the escort distributions and plug them into (5.6), (5.7) and (5.8).

84 5.4

2-D T -CRF Similar to a CRF, it is also possible to build a t-CRF from general graphs. In this

section, we briefly discuss t-CRF in a 2-D graph (see Figure 5.4) which will be used in the experiments. The conditional distribution is given by p(y | X; θ) = expt (

X

X

Φv (xj , y j ), θ v + Φe (y j , y r ), θ e − Gt (X; θ)), j∈V

(j,r)∈E

where with some abuse of notation we let V denote the set that contains all the label indices (or input variable indices), and E to denote the set that contains all the edges between the labels. Φv (xj , y j ) and Φe (y j , y r ) are defined as in the chain model. y1 y4

y2 y5

x1 y8

y7 x4 x7

y3 y6

x2 y9 x5

x8

x3 x6

x9

Fig. 5.3. A 2-D conditional model. Blue nodes indicate the labels; red nodes indicate the observed input variables.

The loss is defined as − log p(y | X; θ) and its gradient with respect to θ v and θ e is −

X  ∂ log p(y | X; θ) = Eq [Φv (xj , y j )] − Φe (xj , y j ) p(y | X; θ)t−1 , | {z } ∂ θv j∈V



∂ log p(y | X; θ) = ∂ θe

ξ

X (j,r)∈E

Eq [Φe (y j , y r )] − Φe (y j , y r ) p(y | X; θ)t−1 . | {z } 

ξ

In order to make inference, we again apply the mean field method which approximates the conditional distribution p(y | X; θ) by ˜ = p˜(y | X; θ)

Y j∈V

˜ j ). p˜(y j | xj ; θ

85 ˜ j are exactly the same as the chain model. The algorithm The variational updates for each θ ˜ j for j ∈ V , until a stationary iteratively sweeps through all the variational parameters θ point is reached.

5.5

Empirical Evaluation In order to empirically compare t-CRF and CRF, we conduct the following two experi-

ments. Image Denoising Task

The first experiment is the image denoising task as in [47]. The

objective of this task is to recover the original image (as a label) from noisy images (the input data). The original image is shown in Figure 5.5 (Top left). It is a 64×64-pixel binary image, where each pixel yi takes value from 0 (black) or 1 (white). The input images are created by adding random Gaussian noise on every pixel of the original image. An example of the input image is shown in Figure 5.5 (Top right). The synthetic dataset contains 30 training images and 20 test images. We model each image by the 2-D model in Figure 5.4, where each node in Figure 5.4 represents a pixel in the image. xj is equal to the normalized grey scale value of the pixel in the noisy image; and y j is the label of the pixel in the original image. Image Annotation Task

The second experiment is a man-made structure detection task

as in [47, 49]. This dataset consists of images of size 384 × 256 pixels from the Corel database. Each image is divided into 24 × 16 = 384 patches, each of size 16 × 16 pixels. The whole dataset contains 108 training images and 129 testing images. The objective of this task is to classify if a patch contains a man-made structure (labeled as 1) or not (labeled as 0). We again apply the 2-D model in Figure 5.4, where each node in Figure 5.4 represents a patch in the image. We use a different xj as the one used in [47, 49] because those features yield poor test performance for a CRF 2 . The reason might be because the extracted features are not rich enough. In order to get better results, other than including the 2

In [47], the test error of CRF is around 12%.

86 14-dimensional three-scale features as used in [49], we also include the edge orientation histograms (EOH) of the three scales, each of which has 36 bins. Extreme Noise We are especially interested in testing the robustness of the algorithms under extreme noise. Therefore, in addition to training the algorithms on the original dataset, we also generate some extreme noise. To this end, we randomly select 20% of the training images and turn all the labels on the selected images to 1. Parameter Setting We choose the regularization constant λ ∈ {1, 10−2 , 10−4 , 10−6 }. The choice of parameter t is more complicated. Similar to the Student’s t-distribution, the value of t is not only related to heavy-tailedness, but is also related to the number of random variables k. In Student’s t-distribution, the relation of t and k is (t − 1)(v + k) = 2, where v > 0. The smaller the v, the heavier the tail of the distribution. When v → +∞, it reduces to the exponential family. Here we use a similar relation. In image denoising task where k = 4096, we choose t ∈ {1.0005, 1.0003, 1.0001, 1.00008, 1.00006}. In image annotation task where k = 384, we choose t ∈ {1.005, 1.003, 1.001, 1.0008, 1.0006}. In order to choose λ and t, we split the training set into 5 partitions for 5-fold cross validation. Implementation and Optimization We implement t-CRF algorithm as well as its meanfield inference algorithm based on the UGM package, the Matlab code for undirected graphical models [50]. We use the L-BFGS method provided by the package for optimization. The model parameters θ are initialized to be all zero. We stop the L-BFGS algorithm when the change of loss function or the norm of gradient falls below 10−7 or a maximum of 1000 function evaluations have been performed. In each function evaluation, the maximum mean-field iteration is set to be 50. Results The selected λ and t parameters for each task are provided in Table 5.2.

87

Table 5.2 Optimal Parameters t and λ for CRF and t-CRF in image denoising task and image annotation task. (0%) denotes no extreme noise is added and (20%) denotes 20% extreme noise is added. CRF

t-CRF

Image Denoising (0%)

λ=1

λ = 10−2 , t = 1.0003

Image Denoising (20%)

λ = 10−6

λ = 10−6 , t = 1.0003

Image Annotation (0%)

λ = 10−6

λ = 10−6 , t = 1.0008

Image Annotation (20%)

λ=1

λ = 10−6 , t = 1.0008

88 In Figure 5.4, we compare the test errors between CRF and t-CRF. The blue bars represent the test error on the clean dataset; and the red bars represent the test error with 20% extreme noise. Both CRF and t-CRF perform well on the clean dataset. However, after the extreme noise is added, the test error of CRF significantly increases. In comparison, t-CRF performs much better than CRF with extreme noise. We display a test example from the image denoising task in Figure 5.5. Both CRF and t-CRF trained from the clean training dataset (Central left/right) are able to recover the original image. However, the recovery quality of CRF from the noisy dataset (Bottom left) visiably deteriorates, while the quality of t-CRF (Bottom right) stays almost the same. We also show two images from the image annotation task in Figure 5.6. We can clearly see the difference between the predictions by CRF before (1st and 3rd row left) and after the extreme noise (2nd and 4th row left), as the latter misclassifies the leaves and the river bank as man-made structures. On the other hand, the predictions by t-CRF appear to be the same before (1st and 3rd row right) and after adding extreme noise (2nd and 4th row right). The above two experiments empirically demonstrate that t-CRF is more robust against extreme noise than CRF. Clean 20% Noise

4.5 4 3.5 3

Clean 20% Noise

10 Test Error (%)

Test Error (%)

5

9 8 7

2.5 CRF

t-CRF

CRF

t-CRF

Fig. 5.4. Test error between t-CRF and CRF with and without extreme noise added. Left: image denoising task. Right: image annotation task.

89

Fig. 5.5. Image denoising task. Top row is the dataset: left is the input image; right is the true label. Middle row is the denoise result without extreme noise: left is CRF, right is t-CRF. Bottom row is the denoise result with extreme noise: left is CRF, right is t-CRF.

90

Fig. 5.6. Image annotation task. The first and the third rows are the annotation results without extreme noise: left is CRF, right is t-CRF. The second and the fourth rows are the annotation results with extreme noise: left is CRF, right is t-CRF.

91 5.6

Chapter Summary This chapter proposed the t-conditional random fields. T -CRF abandons the Markov

properties in graphical models and is able to model dependencies between variables which are not adjacent. In addition, t-CRF is more robust than CRF because of the heavytailedness of expt function as t > 1. The experiments empirically validate the robustness of t-CRF.

92

6. GENERALIZED T -LOGISTIC REGRESSION The previous chapter focused on applying the t-exponential family to structured models. In this chapter, we will return to the classification problem and study a generalization of t-logistic regression using t-divergence. The generalized t-logistic regression contains a family of convex and non-convex losses, which will be investigated theoretically and empirically.

6.1

Binary Classification To introduce our generalizations of the losses, let us begin with binary logistic regres-

sion. Logistic regression uses the conditional exponential family distribution to model a labeled example (x, y), y p(y| x; θ) = exp( hΦ(x), θi − G(x; θ)), 2 and the logistic loss is equal to l(x, y; θ) = − log p(y| x; θ). Based on the probabilistic interpretation, the empirical risk of the logistic loss follows from the i.i.d. assumption on the data (see Section 2.2). However, from the perspective of Bregman divergences (see definition in Appendix A.1), logistic loss is also the K-L divergence between the empirical

93 distribution of δ(c = y)1 and its conditional exponential family distribution p(c| x; θ). To see this, D(δ(c = y)kp(c| x; θ)) X X = δ(c = y) log p(c| x; θ) δ(c = y) log δ(c = y) − c∈{±1}

c∈{±1}

{z

|

}

=0

= − log p(y| x; θ) y = − log exp( hΦ(x), θi − G(x; θ)) 2 = log(1 + exp(y hΦ(x), θi)). T -logistic regression generalizes logistic regression by replacing exponential family distribution with a t-exponential family distribution. Since the t-divergence is a generalization of the K-L divergence, it is natural to consider replacing the K-L divergence with the t-divergence. This, however, abandons the i.i.d. assumption on the data. While this is somewhat controversial, in many case, it might actually capture the data generation process more accurately. Let t1 denote the parameter in t-divergence, and t2 the parameter in t-exponential family, such that y p(y| x; θ) = expt2 ( hΦ(x), θi − Gt2 (x; θ)). 2 Since the escort distribution of δ(c = y) is itself2 , the generalized t-logistic loss function of (x, y) is defined as: l(x, y; θ) =Dt1 (δ(c = y)kp(c| x; θ)) X X = δ(c = y) logt1 δ(c = y) − δ(c = y) logt1 p(c| x; θ) c∈{±1}

|

c∈{±1}

{z

=0

}

= − logt1 p(y| x; θ) y = − logt1 expt2 ( hΦ(x), θi − Gt2 (x; θ)). 2 1 2

δ(c = y) = 0 if c = y, and = 1 otherwise. It is easy to verify that δ(c = y)t = δ(c = y).

94 Note that, when t2 < 1, expt2 (x) is equal to 0 when x ∈ (−∞, t21−1 ]. Therefore, in this dissertation, we will restrict our focus on the case when t1 > 0 and t2 ≥ 1. The gradient of the generalized t-logistic regression with respect to θ is, y  hΦ(x), θi − Gt2 (x; θ) ∇θ l(x, y; θ) = − ∇θ logt1 expt2 2 y  y =− Φ(x) − Eq [ Φ(x)] p(y| x; θ)t2 −t1 2  y2 y  y =− − q(y| x; θ) + q(−y| x; θ) Φ(x)p(y| x; θ)t2 −t1 2 2 2 = − yq(−y| x; θ)Φ(x) p(y| x; θ)t2 −t1 , | {z }

(6.1)

ξ

where Eq denotes the expectation over q(y| x; θ) ∝ p(y| x; θ)t2 . When t2 = t1 , ξ reduces to 1. When t2 > t1 , ξ caps the influence of the examples with low likelihood. On the other hand, when t2 < t1 , ξ boosts the influence of the examples with low likelihood.

6.2

Properties In this section, we will discuss properties of the generalized t-logistic regression with

different t1 and t2 . To this end, it is convenient to write the loss function in terms of the margin u = y hΦ(x), θi, such that u l(x, y; θ) = − logt1 expt2 ( − Gt2 (u)), 2

(6.2)

where Gt2 (u) satisfies u u expt2 ( − Gt2 (u)) + expt2 (− − Gt2 (u)) = 1. 2 2 When t1 ≥ t2 , the loss function is always convex. To see this, when t1 = t2 , logt1 and expt2 cancel out and Gt2 (u) is convex. Logistic regression is a special case as t1 = t2 = 1. When t1 > t2 , the composite function − logt1 expt2 is convex and nonincreasing3 . Since u 2

− Gt2 (u) is concave, the composition of the two functions makes the loss for t1 > t2

convex (See (3.10) of Section 3.2.4 in [3]). 3

∂ It is easy to verify because − ∂x logt1 expt2 (x) = − expt2 (x)t2 −t1 .

95 loss

t1 = 1 (logistic)

6

t1 = 0.7

4

t1 = 0.4 t1 = 0.1

2

0-1 loss -4

-2

0

2

4

margin

Fig. 6.1. Generalized t-logistic regression with t2 = 1 and four different t1 : t1 = 1, t1 = 0.7, t1 = 0.4, t1 = 0.1.

When t1 < t2 , the loss function usually is not convex. We will refer to the case of t1 < t2 as the mismatch loss and the case of t1 = t2 as the matching loss. T -logistic regression is a special mismatch loss when t1 = 1 and t2 > 1. Another interesting mismatch loss is when t2 = 1 and t1 < 1, the conditional distribution is an exponential family and therefore there is no additional cost on estimating Gt2 (u). We plot this loss function with different t1 in Figure 6.1. We can see that the loss function bends down more as t1 decreases. Clearly the losses with t1 ≥ t2 belong to the class of Robust Loss-0 from Section 3.2.2 because of convexity. On the other hand, in order to investigate the robustness of mismatch losses with t1 < t2 , we need to compute limu→∞ |I(u)|. The detailed derivation is provided in Appendix B.15. Although their forgetting variables in (6.1) have similar functionality, somewhat surprisingly the mismatch losses contain all three robust types which depend on the value of t1 : • t1 > 1: Robust Loss-0;

96 • t1 = 1: Robust Loss-I; • t1 < 1: Robust Loss-II. In the following experiment, we will focus on comparing these three types of mismatch losses.

6.3

Multiclass Classification The extension to multiclass classification is quite straightforward for the generalized

t-logistic regression. Assume that the label y ∈ {1, . . . , C}, then the loss function is l(x, y; θ) = − logt1 expt2 (hΦ(x, y), θi − Gt2 (x; θ)), where, 



Φ(x, y) = 0, . . . , 0, Φ(x), 0, . . . , 0 , θ = (θ 1 , . . . ; θ C ) . | {z } | {z } 1,...,y−1

6.4

y+1,...,C

Empirical Evaluation In the experiment, we used 20 binary classification datasets from Table 3.3 and 8 mul-

ticlass classification datasets from Table 3.4. We focus on comparing the following three types of mismatch losses • Type-0: t1 = 2 and t2 > 2; • Type-I: t1 = 1 and t2 > 1; • Type-II: t1 < 1 and t2 = 1. For reference, we also include logistic regression and Savage loss. We will use the following shorthand notations: ’Logistic’ for logistic regression; ’Savage’ for Savage loss; ’Mis0’ for the Type-0 mismatch loss with {t1 = 2, t2 > 2}; ’Mis-I’ for the Type-I mismatch loss with {t1 = 1, t2 > 1}; and ’Mis-II’ for the Type-II mismatch loss with {t1 < 1, t2 = 1}.

97 Experimental Setting

The experimental setting, noise models, optimization algorithm,

implementation and hardware are identical to Section 3.5. The only difference is that we are going to select t1 or t2 parameter from a pool of candidates in order to extensively compare different types of mismatch losses, where for • Mis-0: t2 ∈ {2.3, 2.6, . . . , 3.8}; • Mis-I: t2 ∈ {1.3, 1.6, . . . , 2.8}; • Mis-II: t1 ∈ {0.1, 0.2, . . . , 0.9}. We also set the regularization constant to be 10−10 so that the impact from the regularizer is small compared to the loss functions. Results From Figure D.1 to Figure D.28, we report the test error from 5-fold cross validation with all-zero initialization (Top) and the test error under ten random initializations on one of the five folds (Bottom). All datasets are mixed with different noise models with4 ρ = 0.00 (blue), 0.05 (red), 0.10 (yellow). We find that the wings of all three losses bend down more as the dataset gets more noisy. For example, in Table 6.1, Table 6.2 and Table 6.3, we summarize the number of binary datasets under the noise-2 model that each value of t2 for Mis-0, Mis-I and t1 for Mis-II are optimal based on cross validation. As ρ increases, both t2 for Mis-0 and Mis-I tend to be larger; while t1 for Mis-II tends to be smaller. For example, t2 = 3.8 of Mis-0 is optimal in 7 datasets for ρ = 0.00 but in 11 datasets for ρ = 0.05; t2 = 2.5 of Mis-1 is optimal in 6 datasets for ρ = 0.00 but in 13 datasets for ρ = 0.10; t1 = 0.2 of Mis-3 is optimal in 4 datasets for ρ = 0.00 but in 7 datasets for ρ = 0.05. To quickly compare the generalization performance of different types of mismatch losses, in Table 6.4 and Table 6.5 we summarize the number of datasets where the test errors are significantly different under the noise-2 model. The comparisons under the noise-1 and noise-3 models are similar to the noise-2 model. When the model parameter is initialized to be all-zero, Mis-II appears to be the most robust while Mis-0 appears to be the 4

See the definition of ρ in Section 3.5.

98

Table 6.1 The number of binary datasets that each value of t2 for Mis-0 loss is optimal based on cross validation. The total number of datasets is 20. t2

2.3 2.6 2.9 3.2 3.5 3.8

ρ = 0.00 6

1

2

1

3

ρ = 0.05 3

1

1

0

4 11

ρ = 0.10 2

0

2

1

4 11

7

Table 6.2 The number of binary datasets that each value of t2 for Mis-I loss is optimal based on cross validation. The total number of datasets is 20. t2

1.3 1.6 1.9 2.2 2.5 2.8

ρ = 0.00 7

5

0

2

ρ = 0.05 3

7

0

0 10 0

ρ = 0.10 2

4

0

1 13 0

6

0

Table 6.3 The number of binary datasets that each value of t1 for Mis-II loss is optimal based on cross validation. The total number of datasets is 20. t1

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

ρ = 0.00 3

4

0

3

1

3

0

3

3

ρ = 0.05 3

7

0

3

4

1

0

1

1

ρ = 0.10 4

7

0

3

4

0

0

1

1

least robust. The result is consistent with three types of robustness each of the losses belongs. Furthermore, although Mis-0 and logistic regression both belong to type-0 robust

99

Table 6.4 The number of binary classification datasets that the test error between mismatch losses is significant. Each column of the table is the number of datasets where a certain type of mismatch loss has significantly lower test error than another. The total number of datasets is 20. 0
0

0

8

1

9

3

ρ = 0.05

0

0

12

0

12

9

ρ = 0.10

0

0

14

0

14

12

Table 6.5 The number of multiclass classification datasets that the test error between mismatch losses is significant. Each column of the table is the number of datasets where a certain type of mismatch loss has significantly lower test error than another. The total number of datasets is 8. 0
0

0

6

1

6

3

ρ = 0.05

0

0

6

0

7

5

ρ = 0.10

0

0

6

0

7

5

loss, Mis-0 appears to be more robust than logistic regression in most datasets because of its non-convexity. On the other hand, when the parameter is randomly initialized, Mis-II loss becomes unstable in around half of the datasets. This phenomenon is very similar to Savage loss which also belongs to type-II robust losses. On the other hand, the solutions of Mis-0 and Mis-I are very stable against random initialization.

100 6.5

Chapter Summary In this chapter, we combine t-divergence with t-logistic regression, and proposed the

generalized t-logistic regression for classification. By choosing different t1 and t2 , we can obtain both convex and nonconvex losses, as well as all three types of robustness. We empirically evaluate these losses on various binary and multiclass datasets.

101

7. SUMMARY We conclude this dissertation with a summary of contributions and a discussion of future work.

7.1

Contributions This dissertation is devoted to designing robust probabilistic models in machine learn-

ing based on t-exponential family of distributions. Below we list and detail our main contributions. Classification Using T -Exponential Family

Our first contribution is to apply the t-

exponential family in the probabilistic model for classification. Since the algorithm is based on the same probabilistic framework as logistic regression, we call it t-logistic regression. The algorithm is implemented using PETSc and TAO for efficient parallel computing. We testall our algorithm on a variety of publicly available datasets, which demonstrates the robustness and stability of the algorithm. T -Entropy, T -Divergence, and Approximate Inference Our second contribution is a new t-entropy and t-divergence. T -entropy is an important concept because it is the Fenchel conjugate of the log-partition function of t-exponential family. T -divergence is the Bregman divergence based on t-entropy. We further show that t-divergence can be used to perform efficient approximate inference on multivariate t-exponential family of distributions. Graphical Models in T -Exponential Family

Our third contribution is to propose a gen-

eralization of the conditional random fields (CRF) using t-exponential family. The new t-CRF appears to be more robust compared to the exponential family based CRF, and is

102 able to capture the interactions among nonadjacent nodes in a graphical model. The inference is based on the mean field method which minimizes the t-divergence between the approximate and the true distribution. Classification Using T -Exponential Family and T -Divergence Our fourth contribution is to further generalize t-logistic regression by replacing K-L divergence with t-divergence. A larger family of loss functions for classification is obtained, which includes losses with different types of robustness.

7.2

Future Work We list some potential future work in this section.

More Insights into Local Minima We have theoretically justified that all the non-convex losses may be stuck in local minima adverseral on some datasets. However, in our experiments, we find that certain loss functions such as t-logistic regression are quite stable against random initialization. It may be because t-logistic regression creates much fewer local minima; or it may be because the local minima do not appear in real-world data. It would be interesting to characterize the phenomemon theoretically. Boosting Maximum entropy (maxent) and maximum likelihood estimation are dual problems [51]. Maximum likelihood models work with distributions which need to be normalized. In contrast, one can drop the normalization constraint from maxent problems and derive novel algorithms. Although this does not lead to probabilistic models , one can sidestep the computation of the log-partition function which can be advantageous in some cases. As [51] show, dualizing classical maxent after dropping the normalization constraints yields AdaBoost. Similarly, a t-entropy based boosting algorithm can be derived and investigated.

103 φ-Exponential Family One can further generalize the algorithms and the theorems proposed in this dissertation to φ-exponential family [8, 9]. However, one needs to find φfunctions which yield interesting and useful properties. Also it would be interesting to investigate the physical meaning of the entropy and divergence proposed in this dissertation.

LIST OF REFERENCES

104

LIST OF REFERENCES

[1] C. H. Teo, S. V. N. Vishwanthan, A. J. Smola, and Q. V. Le, “Bundle methods for regularized risk minimization,” Journal of Machine Learning Research, vol. 11, pp. 311– 365, January 2010. [2] S. Ben-David, N. Eiron, and P. M. Long, “On the difficulty of approximately maximizing agreements,” Journal of Computer and System Sciences, vol. 66, no. 3, pp. 496–514, 2003. [3] S. Boyd and L. Vandenberghe, Convex Optimization. Cambridge, England: Cambridge University Press, 2004. [4] P. Long and R. Servedio, “Random classification noise defeats all convex potential boosters,” Machine Learning Journal, vol. 78, no. 3, pp. 287–304, 2010. [5] N. Manwani and P. S. Sastry, “Noise tolerance under risk minmization,” 2012. [http://arxiv.org/pdf/1109.5231]. [6] M. J. Wainwright and M. I. Jordan, “Graphical models, exponential families, and variational inference,” Report 649, UC Berkeley, Department of Statistics, September 2003. [7] M. J. Wainwright and M. I. Jordan, “Graphical models, exponential families, and variational inference,” Foundations and Trends in Machine Learning, vol. 1, no. 1–2, pp. 1–305, 2008. [8] J. Naudts, “Deformed exponentials and logarithms in generalized thermostatistics,” Physica A, vol. 316, pp. 323–334, 2002. [http://arxiv.org/pdf/cond-mat/0203489]. [9] J. Naudts, “Estimators, escort proabilities, and φ-exponential families in statistical physics,” Journal of Inequalities in Pure and Applied Mathematics, vol. 5, no. 4, 2004. [10] C. Tsallis, “Possible generalization of Boltzmann-Gibbs statistics,” Journal of Statistical Physics, vol. 52, pp. 479–487, 1988. [11] J. D. Lafferty, A. McCallum, and F. Pereira, “Conditional random fields: probabilistic modeling for segmenting and labeling sequence data,” in Proceedings of International Conference on Machine Learning, vol. 18, (San Francisco, CA), pp. 282–289, Morgan Kaufmann, 2001. [12] C. Sutton and A. McCallum, “An introduction to conditional random fields for relational learning,” Introduction to Statistical Relational Learning, 2006. [13] O. E. Barndorff-Nielsen, Information and Exponential Families in Statistical Theory. New York: John Wiley and Sons, 1978.

105 [14] P. Grunwald and A. Dawid, “Game theory, maximum entropy, minimum discrepancy, and robust Bayesian decision theory,” Annals of Statistics, vol. 32, no. 4, pp. 1367– 1433, 2004. [15] C. R. Shalizi, “Maximum likelihood estimation for q-exponential (tsallis) distributions,” 2007. [http://arxiv.org/abs/math.ST/0701854]. [16] T. D. Sears, Generalized Maximum Entropy, Convexity, and Machine Learning. PhD thesis, Australian National University, 2008. [17] A. Sousa and C. Tsallis, “Student’s t- and r-distributions: unified derivation from an entropic variational principle,” Physica A, vol. 236, pp. 52–57, 1994. [18] C. Tsallis, R. S. Mendes, and A. R. Plastino, “The role of constraints within generalized nonextensive statistics,” Physica A: Statistical and Theoretical Physics, vol. 261, pp. 534–554, 1998. [19] R. T. Rockafellar, Convex Analysis, vol. 28 of Princeton Mathematics Series. Princeton, NJ: Princeton University Press, 1970. [20] J. S. Rosenthal, A First Look at Rigorous Probability Theory. World Scientific Publishing, 2006. [21] M. Gell-Mann and C. Tsallis, eds., Nonextensive Entropy. Sante Fe Institute Studies in the Sciences of Complexity, Oxford, 2004. [22] A. Zellner, “Bayesian and non-Bayesian analysis of the regression model with multivariate student-t error terms,” Journal of the American Statistical Association, vol. 71, no. 354, pp. 400–405, 1976. [23] P. L. Bartlett, M. I. Jordan, and J. D. McAuliffe, “Convexity, classification, and risk bounds,” Journal of the American Statistical Association, vol. 101, no. 473, pp. 138– 156, 2006. [24] H. Masnadi-Shirazi, N. Vasconcelos, and V. Mahadevan, “On the design of robust classifiers for computer vision,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2010. [25] F. R. Hampel, E. M. Ronchetti, P. J. Rousseeuw, and W. A. Stahel, Robust Statistics: The Approach Based on Influence Functions. New York: Wiley, 1986. [26] A. O’Hagan, “On outlier rejection phenomena in Bayes inference,” Royal Statistical Society, vol. 41, no. 3, pp. 358–367, 1979. [27] N. Ding and S. V. N. Vishwanathan, “T -logistic regression,” in Advances in Neural Information Processing Systems 23, 2010. [28] A. Tewari and P. L. Bartlett, “On the consistency of multiclass classification methods,” Journal of Machine Learning Research, vol. 8, pp. 1007–1025, 2007. [29] T. Kuno, Y. Yajima, and H. Konno, “An outer approximation method for minimizing the product of several convex functions on a convex set,” Journal of Global Optimization, vol. 3, pp. 325–335, September 1993.

106 [30] C. J. Hsieh, K. W. Chang, C. J. Lin, S. S. Keerthi, and S. Sundararajan, “A dual coordinate descent method for large-scale linear SVM,” in Proceedings of International Conference on Machine Learning, pp. 408–415, 2008. [31] C. J. Merz and P. M. Murphy, “UCI repository of machine learning databases,” 1998. Irvine, CA: University of California, Department of Information and Computer Science. [32] V. Franc and S. Sonnenburg, “Optimized cutting plane algorithm for support vector machines,” in Proceedings of International Conference on Machine Learning, pp. 320–327, 2008. [33] S. Sonnenburg, V. Franc, E. Yom-Tov, and M. Sebag, “Pascal large scale learning challenge,” 2008. [http://largescale.ml.tu-berlin.de/workshop/]. [34] D. Mease and A. Wyner, “Evidence contrary to the statistical view of boosting,” Journal of Machine Learning Research, vol. 9, pp. 131–156, February 2008. [35] X. Zhang, A. Saha, and S. V. N. Vishwanathan, “Smoothing multivariate performance measures,” Journal of Machine Learning Research, vol. 13, pp. 3589–3646, 2013. [36] W. R. Gilks, S. Richardson, and D. J. Spiegelhalter, Markov Chain Monte Carlo in Practice. Chapman & Hall, 1995. [37] T. Minka, Expectation Propagation for Approximative Bayesian Inference. PhD thesis, MIT Media Labs, Cambridge, USA, 2001. [38] X. Boyen and D. Koller, “Tractable inference for complex stochastic processes,” in Proceedings of the Conference on Uncertain of Artificial Intelligence, 1998. [39] M. Opper, “A Bayesian approach to online learning,” in Online Learning in Neural Networks, pp. 363–378, Cambridge University Press, 1998. [40] A. R´enyi, “On measures of information and entropy,” in Proceedings of 4th Berkeley Symposium on Mathematics, Statistics and Probability, pp. 547–561, 1960. [41] J. D. Lafferty, “Additive models, boosting, and inference for generalized divergences,” in Proceedings of Annual Conference on Computational Learning Theory, vol. 12, pp. 125–133, ACM Press, New York, NY, 1999. [42] T. Minka, “Divergence measures and message passing,” Report 173, Microsoft Research, 2005. [43] I. Csisz´ar, “Information type measures of differences of probability distribution and indirect observations,” Studia Mathematica Hungarica, vol. 2, pp. 299–318, 1967. [44] J. Hiriart-Urruty and C. Lemar´echal, Convex Analysis and Minimization Algorithms, I and II, vol. 305 and 306. Springer-Verlag, 1996. [45] C. Sutton and A. McCallum, “An introduction to conditional random fields,” Foundations and Trends in Machine Learning, vol. 4, no. 4, pp. 267–373, 2011. [46] M. Meila, “Lecture 3: Graphical models of conditional independence,” STAT 535 Statistical Learning: Modeling, Prediction and Computing, 2011.

107 [47] S. V. N. Vishwanathan, N. Schraudolph, M. Schmidt, and K. Murphy, “Accelerated training conditional random fields with stochastic gradient methods,” in Proceedings of International Conference on Machine Learning, (New York, NY, USA), pp. 969– 976, ACM Press, 2006. [48] C. Bishop, Pattern Recognition and Machine Learning. Springer, 2006. [49] S. Kumar and M. Hebert, “Man-made structure detection in natural images using a causal multiscale random field,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2003. [50] M. Schmidt, “UGM, matlab code for undirected graphical models,” 2007. [http://www.di.ens.fr/ mschmidt/Software/UGM.html]. [51] G. Lebanon and J. Lafferty, “Boosting and maximum likelihood for exponential models,” in Advances in Neural Information Processing Systems 14 (T. G. Dietterich, S. Becker, and Z. Ghahramani, eds.), MIT Press, 2001. [52] D. C. Liu and J. Nocedal, “On the limited memory BFGS method for large scale optimization,” Mathematical Programming, vol. 45, no. 3, pp. 503–528, 1989.

APPENDICES

108

APPENDICES Appendix A: Fundamentals of Convex Optimizations A.1

Convex Analysis

In this section, we review some concepts and properties in the convex analysis. All definitions and most properties can be found in [3, 44]. Definition A.1 (Convex set) A set C ⊆ Rd is convex if for any two points x1 , x2 ∈ C and any λ ∈ (0, 1), we have λ x1 +(1 − λ) x2 ∈ C. In other words, the line segment between any two points must lie in C. Definition A.2 (Open set) A set C ⊆ Rd is open if for any point x ∈ C, there exists an  > 0, such that z ∈ C for all z : kz − xk < . In other words, there is an -ball around x: B (x) := {z : z : kz − xk < } which is contained in C. Definition A.3 (Convex function) Given a convex set C, a function f : C 7→ R is convex if for any two points x1 , x2 ∈ C and any λ ∈ (0, 1), we have f (λ x1 +(1 − λ) x2 ) ≤ λf (x1 ) + (1 − λ)f (x2 ). In general, we can define a generalized function f : Rd 7→ R = R ∪ {+∞}, such that f (x) is +∞ for all x ∈ / C. And we call C, on which f is finite, the domain of f  dom f := x ∈ Rd : f (x) < ∞ . Definition A.4 (Subgradient and subdifferential) Given a function f and a point x with f (x) < ∞, a vector u is called a subgradient of f at x if f (y) − f (x) ≥ hy − x, ui ,

∀ y ∈ Rd .

109 The set of all such u is called the subdifferential of f at x, and is denoted by ∂f (x). Function f is convex iff ∂f (x) is not empty for all x where f (x) < ∞. If f is further differentiable at x, then ∂f (x) is a singleton comprised of the gradient of f at x: ∇f (x). Definition A.5 (Bregman divergence) Let F : Ω → R be a continuously differentiable real-valued and strictly convex function defined on a closed convex set Ω. The Bregman divergence associated with F for points p, q ∈ Ω is the difference between the value of F at point p and the value of the first order Taylor expansion of F around point q evaluated at point p: DF (p, q) = F (p) − F (q) − h∇F (q), p − qi . Definition A.6 (Fenchel-Legendre conjugate) Given a function f : Rd → R, its Fenchel dual is defined as f ? (µ) := sup hx, µi − f (x). x∈Rd

Example 10 (Relative entropy) Suppose   Pd w ln wi if x ∈ ∆ d i=1 i 1/n f (x) = .  +∞ otherwise  P where ∆d is the d-dimensional simplex: x ∈ [0, 1]d : i xi = 1 . Then d

f ? (µ) = ln

1X exp µi d i=1

! .

Theorem A.1 (Dual connection) f ? , as the supremum of linear functions, is always convex and closed. If f is convex and closed, then f (x) + f ? (µ) − hx, µi ≥ 0. And the equality is attained iff µ ∈ ∂f (x) iff x ∈ ∂f ? (µ). Furthermore, f ?? = f .

110 A.2

Numerical Optimization

In this section, we review two of the widely used numerical optimization algorithms: Limited-Memory BFGS method is the most popular Quasi-Newton algorithm which has been used in large-scale optimizations; coordinate descent methods also have wide range of applications, including the Expectation Maximization algorithms in statistics.

Limited-Memory BFGS Method The BFGS method is named for its discoverers Broyden, Fletcher, Goldfarb, and Shanno. Just like other Quasi-Newton methods, its basic idea is to estimate the Hessian matrix by the changes in gradients. Let us denote the k-th iterate to be xk and the function value to be fk . Assume the change in the k-th iteration to be sk = xk+1 − xk and the change in gradients to be yk = ∇fk+1 − ∇fk . The approximate Hessian Hk+1 in the k-th iteration is estimated by, min k H − Hk kW B

s.t. H = HT , H sk = yk . Here, k A kW = k W1/2 A W1/2 kF where k · kF denotes the Frobenius norm, and W is any matrix satisfying W sk = yk . The update of Hk+1 turns out to be, Hk+1 = (I −ρk sk yTk ) Hk (I −ρk sk yTk ) + ρk sk sTk , where ρk = 1/(yTk sk ). The limited-memory BFGS (L-BFGS) algorithm is a limited memory variation of the BFGS method. Unlike the original BFGS method which stores a dense matrix Hk , L-BFGS only stores a few vectors that represent the approximation implicitly. There are multiple published approaches to using a history of updates to form the direction vector. Interested reader may refer to [52] for more details.

111 Coordinate Descent Method Coordinate descent method is a class of algorithms which search a descent direction on a subset of the coordinates in each iteration in order to find the optimum. The simplest version is that one does a line search along one coordinate direction at the current point in each iteration and uses different directions cyclically throughout the procedure. For example, in the n-th iteration, the j-th coordinate of xn is given by, j+1 k xjn+1 = argmin f (x1n , . . . , xj−1 n , y, xn , . . . , xn ) y∈R

If one searches a descent direction in more than one coordinate a time, people sometimes call it to be the block coordinate descent method. For example, the expectationmaximization (EM) algorithm can be viewed as a block coordinate descent method. In addition, the ζ-θ algorithm proposed to solve the convex multiplicative programming is also based on the block coordinate descent method.

Appendix B: Technical Proofs and Verifications B.1

Proof of Theorem 2.3.1

Proof Since the covariance matrix is positive semi-definite, if we show that ∇2 G(θ) = Var [Φ(z)], then it automatically implies that G is convex. To show (2.15) use the regularity condition and expand R Z Φ(z) exp hΦ(z), θi d z = Φ(z)p(z; θ) d z = E [Φ(z)] . ∇θ G(θ) = R exp hΦ(z), θi d z Next take the second derivative, use (B.1), and the definition of variance to write Z 2 ∇θ G(θ) = Φ(z) [Φ(z) − ∇G(θ)]> p(z; θ) d z   = E Φ(z)Φ(z)> − E [Φ(z)] E [Φ(z)]> = Var [Φ(z)] .

(B.1)

112 B.2

Proof of Theorem 2.3.2

Proof Because it is unclear how to compute the second derivative of Gφ we cannot use the same route as in Theorem 2.3.1 to prove convexity. Therefore the proof relies on more elementary arguments. Recall that expφ is an increasing and strictly convex function. Choose θ 1 and θ 2 such that Gφ (θ i ) < ∞ for i = 1, 2, and let α ∈ (0, 1). Set θ α = α θ 1 +(1−α) θ 2 , and observe that Z expφ (hΦ(z), θ α i − αGφ (θ 1 ) − (1 − α)Gφ (θ 2 )) d z Z Z ≤ α expφ (hΦ(z), θ 1 i − Gφ (θ 1 ))d z +(1 − α) expφ (hΦ(z), θ 2 i − Gφ (θ 2 )) d z = 1. On the other hand, we also have Z expφ (hΦ(z), θα i − Gφ (θ α )) d z = 1. Again, using the fact that expφ is an increasing function, we can conclude from the above two equations that Gφ (θ α ) ≤ αGφ (θ 1 ) + (1 − α)Gφ (θ 2 ). This shows that Gφ is a convex function. We now show (2.17), using (2.16) and (2.13) d combined with the fact that du expφ (u) = φ(expφ (u)): Z Z (2.16) 0 = ∇θ p(z; θ) d z = ∇θ expφ (hΦ(z), θi − Gφ (θ)) d z Z  = φ expφ (hΦ(z), θi − Gφ (θ)) (Φ(z) − ∇θ Gφ (θ)) d z Z  Z (2.13) = φ(p(z; θ)) d z q(z; θ)(Φ(z) − ∇θ Gφ (θ)) d z Z   = φ(p(z; θ)) d z Eq(z;θ) [Φ(z)] − ∇θ Gφ (θ) .

113 B.3

Proof of Lemma 3.2.1

Proof For simplicity, let us assume that k θ k = 1, then u = y hΦ(x), θi = ykΦ(x)k cos ψ

u = ykΦ(x)k cos ψ |u| ⇒ = kΦ(x)k | cos ψ| ⇒

Therefore, k∇θ l(x, y, θ)k =kl0 (u)yΦ(x)k =|l0 (u)

|I(u)| u |= cos ψ | cos ψ|

Since I(u) is bounded, assume that |I(u)| ≤ C, then, |I(u)| → ∞) | cos ψ| C ≤P ( → ∞) | cos ψ|

P (k∇θ l(x, y, θ)k → ∞) =P (

=P (cos ψ = 0) Because there is no singleton at ψ = π2 , P (cos ψ = 0) = P (ψ =

B.4

π )=0 2

Proof of Lemma 3.2.2

Proof Using the results in Section B.4, if ψ 6= π2 , then as kΦ(x)k → ∞, u = y hΦ(x), θi = ykΦ(x)k cos ψ → ∞. Furthermore, since limu→∞ I(u) = 0, we have lim

kΦ(x)k→∞

|I(u)| = 0. u→∞ | cos ψ|

k∇θ l(x, y, θ)k = lim

114 Because there is no singleton at ψ = π2 , we conclude P(

lim

kΦ(x)k→∞

k∇θ l(x, y, θ)k = 0) = 1,

if limu→∞ I(u) = 0.

B.5

Proof of Theorem 3.2.3

Proof Since l(u) is smooth around u = 0, for any given  there exists δ such that l0 (u) < l0 (0) + /2 , where u ∈ (−δ, δ).

(B.2)

Define U = max {−u1 , u2 }. We construct a set of data points which consist of x = {x1 , . . . , xn+1 } with label yi = 1, where x1 , . . . , xn = 1 and U xn+1 = − , δ 0 l (0) + /2 n= xn+1 . l0 (0) The gradient of the empirical risk is n

d X l(θxi ) + l(θxn+1 ) dθ i=1   xn+1 0 l (θxn+1 ) =n l0 (θ) + n    l0 (0) Uθ 0 0 =n l (θ) − 0 l − l (0) + /2 δ

∇Remp (θ) =

Now let us investigate the gradient at the following three points 0, δuU1 , δuU2 .   nl0 (0) l0 (0) 0 0 l (0) = 0 >0 when θ = 0, ∇Remp (0) = n l (0) − 0 l (0) + /2 2l (0) +    l0 (0) 0 0 when θ = δu1 /U, ∇Remp (δu1 /U ) = n l (δu1 /U ) − 0 l (u1 ) l (0) + /2   l0 (0) 2 0 0 < n l (0) + /2 − 0 (l (0) + ) = <0 l (0) + /2 4(l0 (0) + )   l0 (0) 0 0 l (u2 ) when θ = δu2 /U, ∇Remp (δu2 /U ) = n l (δu2 /U ) − 0 l (0) + /2   l0 (0) 2 0 0 < n l (0) + /2 − 0 (l (0) + ) = <0 l (0) + /2 4(l0 (0) + )

(B.3)

(B.4)

(B.5)

115 where the inequalities in (B.4) and (B.5) are due to (B.2) and the definition of U . Therefore, there are at least two θ’s between (δu1 /U, δu2 /U ) with ∇ Remp (θ) = 0. Since ∇ Remp (δu1 /U ) < 0 and ∇ Remp (0) > 0, one local minimum lies in (δu1 /U, 0). On the other hand, as ∇ Remp (δu2 /U ) < 0 and the function is lower bounded, the other local minimum lies in (δu2 /U, +∞).

B.6

Proof of Theorem 3.4.1

Proof Since the objective function has a lower bound at 0, we can prove the convergence by showing the algorithm monotonically decreases. In the k-th ζ-step, assuming the current variables are θ (k−1) and ζ (k−1) , we fix θ (k−1) , denote ˜l = l(θ (k−1) ), and minimize over ζ. It turns out that: (k) ζi

m 1 Y ˜m1 = li ˜li i=1

Therefore, MP(θ (k−1) , ζ (k) ) = min MP(θ (k−1) , ζ) = mP(θ (k−1) )1/m ≤ MP(θ (k−1) , ζ (k−1) ) ζ

The θ-step is to fix ζ (k) and minimize θ. The result is MP(θ (k) , ζ (k) ) = min MP(θ, ζ (k) ) ≤ MP(θ (k−1) , ζ (k) ) = mP(θ (k−1) )1/m θ

The above two equalities hold if and only if ζ k = ζ k−1 and θ k = θ k−1 , from which the convergence of the algorithm at the k-th iteration follows. Therefore, before convergence we have MP(θ (k) , ζ (k) ) < mP(θ (k−1) )1/m < MP(θ (k−1) , ζ (k−1) ). But since P(θ) > 0, the algorithm must converge at some point. ˜ is a stationary point of the P(θ). Assume that Next, we show that the converged point θ ˜ and ζ˜ is the convergence point, then the gradient at the θ-step satisfies: θ m m Qm X X ˜ m1 dli (θ) dli (θ) i=1 li (θ) 0= ζi ˜= ˜ ˜ d θ d θ l ( θ) i θ=θ θ=θ i=1 i=1

116 Since

Qm

˜

1

m i=1 li (θ)

is positive, it implies that, Q m Qm X ˜ d( m dP(θ) i=1 li (θ) dli (θ) i=1 li (θ)) 0= ˜= ˜ = dθ ˜ ˜ d θ d θ l ( θ) i θ= θ θ=θ θ=θ i=1

˜ is a stationary point of P(θ). Therefore, θ

B.7

Proof of Theorem 4.2.1

Proof In view of (2.17) and (4.17), µ = Eq(z;θ(µ)) [Φ(z)] = ∇θ Gt (θ). We only need to consider the case when θ(µ) exists since otherwise G∗t (µ) is trivially defined as +∞. When θ(µ) exists, clearly θ(µ) ∈ (∇Gt )−1 (µ). Therefore, we have, sup {hµ, θi − Gt (θ)} = sup θ

θ



Eq(z;θ(µ)) [Φ(z)] , θ − Gt (θ)

= Eq(z;θ(µ)) [Φ(z)] , θ(µ) − Gt (θ(µ)) Z = q(z; θ(µ)) (hΦ(z), θ(µ)i − Gt (θ(µ))) dx Z = q(z; θ(µ)) logt p(z; θ(µ))d z

(B.6)

(B.7)

= − Ht (p(z; θ(µ))) Equation (B.6) follows because of the duality between θ(µ) and µ, while (B.7) is because logt p(z; θ(µ)) = (hΦ(z), θ(µ)i − Gt (θ(µ))). B.8

Verification in Section 3.1 In this section, we verify that the iterative algorithm for computing Gt is going to con-

verge. We only need to verify that a ˜(k) converges to the corresponding a ˜ of b a. First of all, given b a, since t > 1 and Z(˜ a) > 1, it is clear that 0 < a ˜ < b a. On the domain of 0 < a ˜0 < b a, it is easy to verify that Z(˜ a0 )1−tb a−a ˜0 is a monotonically decreasing function and it crosses at 0 only at a ˜. Therefore, when a ˜(k) > a ˜, a ˜(k+1) < a ˜(k) ; when a ˜(k) < a ˜, a ˜(k+1) > a ˜(k) .

117 We then prove that a ˜(k) is a monotonically decreasing sequence. We prove this by mathematical induction. Since a ˜(0) = b a, a ˜(1) < b a = a ˜(0) . Next assume that in the k-th iteration, a ˜(k) < a ˜(k−1) . Since Z(˜ a(k) ) > Z(˜ a(k−1) ), we have a ˜(k+1) < a ˜(k) . Therefore, it follows that a ˜(k) is monotonically decreasing and it is lower bounded by a ˜. Furthermore, limk→+∞ a ˜(k) exists. Finally, lim a ˜(k) = lim a ˜(k+1)

k→+∞

k→+∞

= lim Z(˜ a(k) )1−tb a k→+∞

a, = Z( lim a ˜(k) )1−tb k→+∞

(B.8)

where (B.8) is because Z(˜ a0 )1−t is continuous. Therefore, it follows that limk→+∞ a ˜(k) = a ˜.

B.9

Verification in Section 3.2.2 In this section, we verify the robust types of the losses in Table 3.1.

Logistic Regression Il (u) = l0 (u)u = −

2 u 1 + exp(2u)

As u → −∞, Il (u) goes to infinity. Therefore, logistic regression belongs to Robust Loss 0. Furthermore, one can easily verify that all the convex losses are Robust Loss 0, because limu→−∞ |l0 (u)| ≥ |l0 (0)| > 0. T -Logistic Regression

Let us define p(u) := expt (u − Gt (u)) and q(u) its escort distri-

bution, then Il (u) = l0 (u)u = −2q(−u)u · p(u)t−1 .

118 As u → −∞, q(−u) = 1 and p(u) = 0. We have lim Il (u) = lim −2u · p(u)t−1

u→−∞

u→−∞

−2u u→−∞ 1 + (t − 1)(Gt (u) − u) −2 = lim u→−∞ (t − 1)(q(u) − q(−u) − 1) −1 = lim u→−∞ (t − 1)(−q(−u) − 1) 1 = . 2(t − 1) = lim

(B.9)

(B.10)

where (B.9) comes by applying the L’Hospital principle. As u → +∞, q(−u) = 0 and p(u) = 1. We have lim Il (u) = lim −2u ·

u→+∞

u→+∞

p(−u)t p(−u)t + p(u)t

= lim −2up(−u)t−1 p(−u) u→+∞

Similar to (B.10), we have limu→+∞ −2u · p(−u)t−1 =

1 . 2(t−1)

Furthermore, as p(−u) = 0,

we conclude that limu→+∞ Il (u) = 0. Therefore, t-logistic regression belongs to Robust Loss I. Savage Loss

The non-convex loss, Savage loss, is widely used in the community of neu-

ral network, l(u) = (1 − σ(u))2 = σ(−u)2 where σ(u) =

1 1+exp(−u)

and σ(u) + σ(−u) = 1.

Il (u) = −2u · σ(−u)σ 0 (−u) = 2u · σ(−u) · σ(−u)σ(u) Since, limu→+∞ σ(u) = 0, and limu→−∞ σ(−u) = 0, we have limu→∞ |Il (u)| = 0. Therefore, Savage loss belongs to Robust Loss II.

119 B.10

Verification in Section 3.3

In this section, we verify the Bayes-risk consistency property of the multiclass t-logistic regression. The Bayes-risk consistency of a multiclass classification loss was first discussed in [28]. Define, b a(x) = (b a1 , . . . , b aC ) where, b ac (x) : X → R the margin of x in class c. η = (η1 , . . . , ηC ) where, ηc = p(y = c| x) the true conditional probability of class c l(b a) = (l1 , . . . , lC ) where, lc = l(b a, c) The conditional risk of the multiclass loss l can be written as, Cl (η, b a) = Ec| x [l(b a, c)] =

C X

ηc lc

c=1

Definition B.7 A Bayes-risk consistent loss function for multiclass classification is the class of loss function l, for which given any η, b a∗ the minimizer of Cl (η, b a) satisfies argmin l(b a∗ ) ⊆ argmax η c

(B.11)

c

For t-logistic loss, we have lc = − log expt (b ac − Gt (b a)) And Cl (η, b a) =

C X

ηc lc

c=1

=−

C X c=1

ηc log expt (b ac − Gt (b a))

Minimizing over b a results in the b a∗ which satisfies ηc = expt (b a∗c − Gt (b a∗ )). Because that log is a monotonically increasing function, argmin l(b a∗ ) = argmax η c

c

Therefore, the multiclass t-logistic loss is also Bayes-risk consistent.

(B.12)

120 B.11

Verification in Definition 4.2.2

In this section, we verify that Equation (4.24) is the Bregman divergence between q and q ˜ based on t-entropy. Let us define qr (z) = q(z) + r(˜ q(z) − q(z)), where r = [0, 1]. Clearly, q0 (z) = q(z), R and q1 (z) = q ˜(z). Define pr (z) = qr (z)1/t / qr (z)1/t d z. First assume the regularity condition holds, let us take the drivative of Ht (qr (z)) with respect to r, Z d d H(qr (z)) = qr (z) logt pr (z)d z dr dr Z d (qr (z) logt pr (z)) d z = dr Z Z d = (˜ q(z) − q(z)) logt pr (z)d z + qr (z) (logt pr (z)) d z dr Z Z qr (z) dpr (z) = (˜ q(z) − q(z)) logt pr (z)d z + dz pr (z)t dr Z  Z Z d 1/t = (˜ q(z) − q(z)) logt pr (z)d z + qr (z) d z pr (z)d z dr | {z } =1 Z = (˜ q(z) − q(z)) logt pr (z)d z The Bregman divergence between q(z) and q ˜(z) based on −Ht (q) is equal to d Dt (qk q ˜) = −Ht (q1 (z)) + Ht (q0 (z)) − H(qr (z))|r=0 dr Z = q(z) logt p(z) − q ˜(z) logt p ˜(z) − (q(z) − q ˜(z)) logt p ˜(z)d z Z ˜(z)d z = q(z) logt p(z) − q(z) logt p B.12

Verification of Equation (4.33)

Assuming there are N independent variables x = (x1 , . . . , xN ) with p(x) =

QN

i=1

pi (xi ),

it is obvious that its escort is N

N

Y pt (xi ) pt (x) Y i qi (xi ) = q(x) = = Z Zi i=1 i=1

(B.13)

121 Q which indicates Z = N i=1 Zi . Now combining with (2.7), the t-entropy for p(x) is Z Ht (p(x)) = − q(x) logt p(x)d x (B.14)   Z Z 1 1 p1−t (x) − 1 t dx = − 1 − pt (x)d x (B.15) =− p (x) Z 1−t (1 − t)Z Using the fact that 1 Ht (pi (xi )) = − (1 − t)Zi

  Z t 1 − pi (xi )dxi

(B.16)

we have Z

pti (xi )dxi

Besides, since pt (x) = Z

t

p (x)dx =

QN

i=1

N Z Y

 = (1 − t)Zi Ht (pi (xi )) +

1 (1 − t)Zi

 (B.17)

pi (xi ), we further have

pti (xi )dxi

i=1

 N Y = (1 − t)Zi Ht (pi (xi )) + i=1

1 (1 − t)Zi

 (B.18)

Now combining with (B.15) gives N −1

Ht (p(x)) = (1 − t) B.13

N  Y Ht (pi (xi )) + i=1

1 (1 − t)Zi



N

1 Y 1 − 1 − t i=1 Zi

(B.19)

Verification in Section 4.3.1

In this section, we provide intermediate derivations to obtain the mean-field updates on approximating multivariate Student’s t-distribution. The approximate distribution is ˜ = p ˜(z; θ)

k Y

˜j ) = p ˜(z ; θ j

j=1

k Y

St(z j ; µ ˜j , σ ˜ j , v˜).

j=1

Using the representation of t-exponential family of distributions j

˜ ) = exp p ˜(z j ; θ t

D

˜ Φj (z j ), θ

j

E

 ˜j ) − Gt (θ

122 ˜ j = [−2Ψ ˜ jK ˜j µ ˜ jK ˜ j /(1 − t)], where with Φj (z j ) = [z j ; (z j )2 ] and θ ˜j /(1 − t), Ψ ˜ j = v˜−1 (˜ K σ j )−2 , −2/(˜v +1)  Γ((˜ v +1)/2) j ˜ = , Ψ Γ(˜ v /2)(π v˜)1/2 σ ˜j   ˜j ) = − 1 ˜ jK ˜ j (˜ ˜j − 1 . Gt (θ Ψ µj ) 2 + Ψ 1−t Now we can write   D E ˜n = 1 Ψ ˜ n · −2K ˜n µ ˜ n (z n )2 ˜n z n + K Φn (z n ), θ 1−t   D E  1 Eq˜j6=n [Φ(z)], θ = Ψ · −2µ> K Eq˜j6=n [z] + tr K Eq˜j6=n [z z> ] 1−t  1 ˜ j6=n )> kj6=n,n z n + k nn (z n )2 + const. = Ψ · −2µ> kn z n + 2(µ 1−t  j ˜ j6=n denotes the vector µ where µ ˜ j=1...k,j6=n , kn denotes the n-th column of K, and kj6=n,n denotes the n-th column of K after its n-th element is deleted. Using µ ˜j = Eq˜j [z j ] and (˜ σ j )2 = Eq˜j [(z j )2 ] − Eq˜j [z j ]2 , E t−1 D j ˜ , Eq˜ [Φj (z j )] − Gjt (θ ˜j ) expt θ j = expt = expt = expt =

˜j Ψ v˜

 ˜j  ˜ jK Ψ j j j 2 ˜j ) −2 µ ˜ Eq˜j [z ] + Eq˜j [(z ) ] − Gjt (θ 1−t !t−1 ˜j ˜ jK Ψ j ˜ ) (−2(˜ µj )2 + (˜ σ j )2 ) − Gjt (θ 1−t !!t−1 ˜j 1 Ψ j ˜ −1 +Ψ 1 − t v˜ !−1 ˜j +Ψ .

!t−1

Plugging in (4.32), the iterative updates for the Student’s t-distribution are given by  1 > n j6=n > j6=n,n ˜ −2µ k +2( µ ) k , k nn −(˜v +1)/ v˜ Γ(˜  v /2)2/ v˜ π 1/ v˜ ˜ nΨ ˜n · (˜ σ n )2 = K , Γ((˜ v +1)/2)2/ v˜ v˜ !−1 Y Ψ ˜j ˜ nΨ ˜ n = Ψk nn ˜j where K +Ψ . v˜ j6=n µ ˜n =

123 B.14

Verification in Section 4.3.2

In this section, we verify the updates of the Bayesian online learning algorithms based on Student’s t-distribution in Section 4.3.2. Assumed density filtering matches the moments of Z Z qi (w) w d w = q ˜i (w) w d w, and Z Z qi (w) w w> d w = q ˜i (w) w w> d w,

(B.20) (B.21)

In order to compute the moments, we first make use of ˜ (i−1) , v), ˜ (i−1) , Σ p ˜i−1 (w) = St(w; µ ˜ (i−1) /(v + 2), v + 2), ˜ (i−1) , v Σ q ˜i−1 (w) = St(w; µ and get the following relations: Z Z1(i) = p ˜i−1 (w)t˜i (w)d w Z  z(i) t t t =  + (1 − ) −  St(z; 0, 1, v)dz −∞ Z Z2(i) = q ˜i−1 (w)t˜i (w)d w Z  z(i) t t t =  + (1 − ) −  St(z; 0, v/(v + 2), v + 2)dz

(B.22) (B.23) (B.24) (B.25)

−∞

f (i) = F(i) =

1 Z2(i) 1 Z2(i)

∇µ Z1(i) = yi α(i) xi ∇Σ Z1(i)

˜ (i−1) 1 yi α(i) xi , µ =− xi x> i ˜ (i−1) xi 2 x> Σ i

(B.26) (B.27)

where, α(i)

˜ (i−1) y i xi , µ ((1 − )t − t ) St(z(i) ; 0, 1, v) q = and z(i) = q . > ˜ > ˜ Z2(i) xi Σ(i−1) xi xi Σ(i−1) xi

Equations (B.23) and (B.25) are analogous to Eq. (5.17) in [37]. By assuming that a reguR larity condition1 holds, and ∇ can be interchanged in ∇Z1(i) of (B.26) and (B.27). 1

This is a fairly standard technical requirement which is often proved using the Dominated Convergence Theorem (see e.g. Section 9.2 of [20]).

124 Next, by combining with (B.22) and (B.24), we obtain the expectations of qi (w) from Z1(i) and Z2(i) (similar to Eq. (5.12) and (5.13) in [37]), Z 1 ˜ (i−1) f (i) ˜ (i−1) + Σ Eq [w] = q ˜i−1 (w)t˜i (w) w d w = µ Z2(i) (B.28) Eq [w w> ] − Eq [w] Eq [w]> =

1

Z

Z2(i)

q ˜i−1 (w)t˜i (w) w w> d w − Eq [w] Eq [w]>

 ˜ (i−1) ˜ (i−1) − Σ ˜ (i−1) f (i) f > −2 F(i) Σ = r(i) Σ (i)

(B.29)

where r(i) = Z1(i) /Z2(i) and Eq [·] means the expectation with respect to qi (w). ˜ (i) , after combining with (B.26) and ˜ (i) and Σ Since the mean and variance of q ˜i (w) is µ (B.27), we obtain ˜ (i−1) xi ˜ (i) = Eq [w] = µ ˜ (i−1) + α(i) yi Σ µ ˜ (i) = Eq [w w> ] − Eq [w] Eq [w]> Σ ˜ (i−1) − (Σ ˜ (i−1) xi ) = r(i) Σ B.15

(B.30) (B.31)



˜ (i) α(i) yi xi , µ > ˜ x Σ(i−1) xi

!

˜ (i−1) xi )> . (Σ

(B.32)

i

Verification in Section 6.2

In this section, we verify the three types of mismatch losses by limu→−∞ I(u). Let us denote p(u) = expt2 ( u2 − Gt2 (u)) and q(u) ∝ expt2 ( u2 − Gt2 (u))t2 , where p(u) + p(−u) = 1 and q(u) + q(−u) = 1. Using (2.17), q(u) − q(−u) 1 ∂ Gt2 (u) = = q(u) − , ∂u 2 2 therefore, ∂p(u) u ∂ u = expt2 ( − Gt2 (u))t2 ( − Gt2 (u)) ∂u 2 ∂u 2 = p(u)t2 (1 − q(u)). The first derivative of l(u) is equal to ∂ logt1 p(u) ∂u ∂p(u) = − p(u)−t1 = −p(u)t2 −t1 (1 − q(u)). ∂u

l0 (u) = −

125 As u → −∞, p(u) = q(u) = 0, so that lim I(u) = lim l0 (u)u

u→−∞

u→−∞

= lim − u→−∞

u u(1 − q(u)) = lim − . t −t u→−∞ p(u) 1 2 p(u)t1 −t2

(B.33)

When t1 < t2 , both numerator and denominator of (B.33) goes to infinity. Therefore, we are going to apply the L’Hospital Principle on (B.33), lim I(u) = lim −

u→−∞

u→−∞

u p(u)t1 −t2 1

=−

(u))t1 −t2 −1

(t1 − t2 ) expt2 (u − Gt expt2 (u − Gt (u))t2 (1 − q(u)) 1 = lim (B.34) t u→−∞ (t2 − t1 )p(u) 1 −1 (1 − q(u)) 1 = lim , (B.35) u→−∞ (t2 − t1 )p(u)t1 −1

where (B.34) is because of L’Hospital Principle. As u → +∞, p(u) = q(u) = 1. Similar to the derivations in (B.35), we have lim I(u) = − lim q(−u)u

u→+∞

u→+∞

= lim p(−u)t2 u = lim u→+∞

u→+∞

p(−u) =0 t2

Based on (B.35), we classify the mismatch losses as t1 < t2 into three robust types: • t1 > 1: Robust Loss-0; • t1 = 1: Robust Loss-I; • t1 < 1: Robust Loss-II. Appendix C: Additional Figures of Section 3.5 In this chapter, we provide the additional figures from the empirical evalutation of tlogistic regression.

126

Noise-1

Noise-3

Noise-2 15.8 17.5

15.3 15.2

15.6

Test Error (%)

Test Error (%)

Test Error (%)

15.4

15.4 15.2

15 t = 1.5

logistic

Savage

t = 1.5

logistic

Savage

t = 1.5

logistic

Savage

Noise-3

Noise-2

Noise-1

17.5

15.4 15.3 15.2

Test Error (%)

22 Test Error (%)

Test Error (%)

16 15.5

15.1

20 18

t = 1.5

Savage

16

15

14 logistic

17 16.5

15.5

16 15.1 t = 1.5

logistic

Savage

t = 1.5

logistic

2,500

1,500

2,500

2,000

Savage

Noise-3

Noise-2

Noise-1

1,500 1,000 500

Frequency

2,000 Frequency

Frequency

17 16.5

1,500 1,000

1,000

500

500

0

0 0

0.2

0.4

0.6

0.8

1

0 0

0.2

0.4

0.6

0.8

1

0

0.2

0.4

0.6

0.8

1

Fig. C.1. Experiment on adult9 Dataset. Top: Generalization Performance; Middle: Random Initialization; Bottom: Forgetting Variables.

127

23 23

Test Error (%)

21.9 21.88 21.86

22.6 22.4 22.2

21.84

22

21.82

21.8 t = 1.5

logistic

Test Error (%)

22.8

21.92 Test Error (%)

Noise-3

Noise-2

Noise-1 21.94

Savage

22.5

22

t = 1.5

logistic

Savage

23

21.84

22.6

Test Error (%)

Test Error (%)

21.86

Savage

Noise-3

22.8 21.88 Test Error (%)

t = 1.5

logistic

Noise-2

Noise-1

22.4 22.2

22.5

22

22

21.8

21.82 logistic

t = 1.5

·104

Noise-1

Savage

t = 1.5

logistic

1.5

Savage

t = 1.5

logistic

Noise-2

·104

Savage

Noise-3

·104

1 1

0.6 0.4

1

Frequency

Frequency

Frequency

0.8

0.5

0.5

0.2 0

0

0 0

0.2

0.4

0.6

0.8

1

0

0.2

0.4

0.6

0.8

1

0

0.2

0.4

0.6

0.8

1

Fig. C.2. Experiment on alpha Dataset. Top: Generalization Performance; Middle: Random Initialization; Bottom: Forgetting Variables.

128

Noise-1

4

4

3

Test Error (%)

4 Test Error (%)

Test Error (%)

Noise-3

Noise-2

5

3.5 3 2.5

2

3 2.5

2 t = 1.5

logistic

Savage

2 t = 1.5

logistic

Noise-1

t = 1.5

logistic

Savage

Savage

Noise-3

Noise-2 4

5

4

3

Test Error (%)

4 Test Error (%)

Test Error (%)

3.5

3.5 3

3.5 3 2.5

2.5 2 t = 1.5

logistic

t = 1.5

logistic

Savage

Noise-1

Frequency

Frequency

6,000 4,000

8,000

1

6,000

0.8

4,000

2,000

2,000

0

0 0.2

0.4

0.6

0.8

1

t = 1.5

logistic

Frequency

8,000

0

Savage

Noise-2

Savage

Noise-3

·104

0.6 0.4 0.2 0

0

0.2

0.4

0.6

0.8

1

0

0.2

0.4

0.6

0.8

1

Fig. C.3. Experiment on astro-ph Dataset. Top: Generalization Performance; Middle: Random Initialization; Bottom: Forgetting Variables.

129

Noise-1

Noise-3

Noise-2 6

5

5

3 2

Test Error (%)

Test Error (%)

Test Error (%)

5 4

4 3 2

t = 1.5

logistic

Savage

logistic

15

3

Test Error (%)

Test Error (%)

Test Error (%)

4

10 5

−5

Noise-1 1

2,000

Frequency

Frequency

3,000

1,000

0.6

0.8

1

Savage

Noise-3

Noise-2

·104

0.8

8,000

0.6

6,000

0.4

4,000 2,000

0 0.4

t = 1.5

logistic

Savage

0.2 0

5

−5 t = 1.5

logistic

Frequency

Savage

10

0

0 2 t = 1.5

Savage

Noise-3

15

0.2

t = 1.5

logistic

Savage

Noise-2

5

0

3 2

t = 1.5

Noise-1

logistic

4

0 0

0.2

0.4

0.6

0.8

1

0

0.2

0.4

0.6

0.8

1

Fig. C.4. Experiment on aut-avn Dataset. Top: Generalization Performance; Middle: Random Initialization; Bottom: Forgetting Variables.

130

Noise-3

Noise-2

Noise-1 49.9

50.2

49.8

49.7

Test Error (%)

Test Error (%)

Test Error (%)

50 49.9 49.8

49.8

49.7

49.6 t = 1.5

logistic

Savage

t = 1.5

logistic

Noise-1

Savage

t = 1.5

logistic

49.88

50

49.86

49.95

Savage

Noise-3

Noise-2

49.84 49.82

Test Error (%)

50 Test Error (%)

Test Error (%)

50

49.9 49.85

49.95 49.9 49.85

49.8 49.8

49.78 t = 1.5

logistic

Savage

t = 1.5

logistic

Noise-1

·104

49.8

4

Savage

t = 1.5

logistic

Noise-2

·104

6

Savage

Noise-3

·104

6

2

Frequency

Frequency

Frequency

3 4

2

4

2

1

0

0

0 0

0.2

0.4

0.6

0.8

1

0

0.2

0.4

0.6

0.8

1

0

0.2

0.4

0.6

0.8

1

Fig. C.5. Experiment on beta Dataset. Top: Generalization Performance; Middle: Random Initialization; Bottom: Forgetting Variables.

131

22.9

23.2

22.8 22.7 22.6

28 Test Error (%)

Test Error (%)

Test Error (%)

Noise-3

Noise-2

Noise-1

23 22.8

26 24

22.6 22

22.5 t = 1.5

logistic

t = 1.5

logistic

Savage

Savage

t = 1.5

logistic

Savage

Noise-3

Noise-2

Noise-1 32 30

22.8 22.7

28

28

Test Error (%)

Test Error (%)

Test Error (%)

22.9

26 24 22

22.6

26 24

20 22 t = 1.5

logistic

Noise-1

·104

1.5

Savage

Noise-2

·104

logistic

t = 1.5

·104

Noise-3

Savage

1

0.5

0

1

Frequency

Frequency

1 Frequency

t = 1.5

logistic

Savage

0.5

0

0 0

0.2

0.4

0.6

0.8

1

0.5

0

0.2

0.4

0.6

0.8

1

0

0.2

0.4

0.6

0.8

1

Fig. C.6. Experiment on covertype Dataset. Top: Generalization Performance; Middle: Random Initialization; Bottom: Forgetting Variables.

132

Noise-3

Noise-2

Noise-1 21.6

22.4

22.2 21.56 21.54 21.52

Test Error (%)

Test Error (%)

Test Error (%)

21.58 22 21.8 21.6

21.5 t = 1.5

logistic

Savage

22.2 22 21.8 21.6

t = 1.5

logistic

t = 1.5

logistic

Savage

Savage

Noise-3

Noise-2

Noise-1 21.6

21.56

Test Error (%)

30 Test Error (%)

Test Error (%)

24 21.58

23

22

21.54

t = 1.5

logistic

20 t = 1.5

logistic

Savage

Noise-1

·104

25

t = 1.5

logistic

Savage

Noise-2

·104

Savage

Noise-3

·104

1.5

0.5

0

Frequency

1 Frequency

Frequency

1 1 0.5

0

0 0

0.2

0.4

0.6

0.8

1

0.5

0

0.2

0.4

0.6

0.8

1

0

0.2

0.4

0.6

0.8

1

Fig. C.7. Experiment on delta Dataset. Top: Generalization Performance; Middle: Random Initialization; Bottom: Forgetting Variables.

133

Noise-3

Noise-2

Noise-1 11.4 Test Error (%)

10.4

Test Error (%)

11.2

10.5 Test Error (%)

11.5

11 10.8 10.6

10.5

10.4

10.3

11

10.2 t = 1.5

logistic

Savage

t = 1.5

logistic

11.4 11.2

10.5 10.45 10.4 10.35 t = 1.5

11.4

11 10.8 10.6

Savage

11.2 11 10.8 10.6 10.4

t = 1.5

logistic

Noise-1

·104

Savage

3

3

Frequency

Frequency

1

2 1

0.5 0

0 0

0.2

0.4

0.6

0.8

1

Savage

Noise-3

·104

4

1.5

t = 1.5

logistic

Noise-2

·104

2

Savage

Noise-3

10.4 logistic

t = 1.5

logistic

Test Error (%)

10.6 10.55

Frequency

Savage

Noise-2

Test Error (%)

Test Error (%)

Noise-1

2

1

0 0

0.2

0.4

0.6

0.8

1

0

0.2

0.4

0.6

0.8

1

Fig. C.8. Experiment on epsilon Dataset. Top: Generalization Performance; Middle: Random Initialization; Bottom: Forgetting Variables.

134

19.98 19.96

Test Error (%)

Test Error (%)

Test Error (%)

20.8

20.6

20

20.4 20.2

t = 1.5

logistic

20.4 20.2

Savage

20 t = 1.5

logistic

Savage

Test Error (%)

19.99 19.98 19.97

25

25

24

24

23 22 21

t = 1.5

logistic

Savage

t = 1.5

logistic

Noise-1

·104

2

23 22 21 20

20

19.96

Savage

Noise-3

Test Error (%)

20

t = 1.5

logistic

Noise-2

Noise-1

Test Error (%)

20.6

20

19.94

1.5

Noise-3

Noise-2

Noise-1 20.02

Savage

t = 1.5

logistic

Noise-2

·104

1.5

Savage

Noise-3

·104

0.5

Frequency

1

Frequency

Frequency

1.5 1 0.5 0

0 0

0.2

0.4

0.6

0.8

1

1

0.5

0 0

0.2

0.4

0.6

0.8

1

0

0.2

0.4

0.6

0.8

1

Fig. C.9. Experiment on gamma Dataset. Top: Generalization Performance; Middle: Random Initialization; Bottom: Forgetting Variables.

135

Noise-1

Noise-3

Noise-2

0.5

Test Error (%)

0.6

0.6

Test Error (%)

Test Error (%)

0.6

0.5

t = 1.5

logistic

Savage

t = 1.5

logistic

8

2 0

40

20

0

−2 t = 1.5

Savage

t = 1.5

logistic

Noise-1

·106

2.5

1.5

Savage

Noise-3

Test Error (%)

4

logistic

t = 1.5

logistic

40

6

Test Error (%)

Test Error (%)

Savage

Noise-2

Noise-1

−4

0.5 0.45 0.4

0.4

0.4

0.55

20

0

t = 1.5

logistic

Savage

Noise-2

·106

Savage

Noise-3

·106

2

0.5

Frequency

Frequency

Frequency

2 1

1.5 1

1

0.5 0

0 0

0.2

0.4

0.6

0.8

1

0 0

0.2

0.4

0.6

0.8

1

0

0.2

0.4

0.6

0.8

1

Fig. C.10. Experiment on kdd99 Dataset. Top: Generalization Performance; Middle: Random Initialization; Bottom: Forgetting Variables.

136

Noise-1

12

11.8

11.95

11.9

11.9

11.85

Test Error (%)

12.2

Test Error (%)

Test Error (%)

Noise-3

Noise-2

12.4

11.85 11.8

11.75

11.75 t = 1.5

logistic

Savage

t = 1.5

logistic

Noise-1

Savage

13

12

15

14 13 12

t = 1.5

logistic

14 13 12

t = 1.5

logistic

Savage

Noise-1

·105

Savage

Noise-3

Test Error (%)

Test Error (%)

14

t = 1.5

logistic

Noise-2 15

Test Error (%)

11.8

Savage

t = 1.5

logistic

Noise-2

·105

Savage

Noise-3

·105 6

6

2

0

Frequency

Frequency

Frequency

4 4

2

0 0

0.2

0.4

0.6

0.8

1

4

2

0 0

0.2

0.4

0.6

0.8

1

0

0.2

0.4

0.6

0.8

1

Fig. C.11. Experiment on kdda Dataset. Top: Generalization Performance; Middle: Random Initialization; Bottom: Forgetting Variables.

137

Noise-3

Noise-2

Noise-1 11.7

11.6

11.5

Test Error (%)

Test Error (%)

Test Error (%)

11.8 11.6

11.7 11.6 11.5

11.4 t = 1.5

Savage

t = 1.5

logistic

12

t = 1.5

14

13

12

t = 1.5

logistic

Savage

Noise-1

·106

Savage

Noise-3

Test Error (%)

Test Error (%)

13

t = 1.5

logistic

14

14 Test Error (%)

Savage

Noise-2

Noise-1

logistic

11.5 11.45 11.4

11.4 logistic

11.55

13

12

Savage

Noise-2

·106

6

logistic

t = 1.5

·105

Noise-3

Savage

1.5

0.5

0

1

Frequency

Frequency

Frequency

1

0.5

0 0

0.2

0.4

0.6

0.8

1

4

2

0 0

0.2

0.4

0.6

0.8

1

0

0.2

0.4

0.6

0.8

1

Fig. C.12. Experiment on kddb Dataset. Top: Generalization Performance; Middle: Random Initialization; Bottom: Forgetting Variables.

138

Noise-3

Noise-2

Noise-1 20

25

15 10 5

20

15

Test Error (%)

Test Error (%)

Test Error (%)

20

10 5

t = 1.5

logistic

Savage

0 t = 1.5

logistic

Noise-1

Savage

25

10 5 0

20

15

Test Error (%)

Test Error (%)

Test Error (%)

15

10 5

Savage

15 10 5 0

0 t = 1.5

Savage

Noise-3

20

t = 1.5

logistic

Savage

t = 1.5

logistic

Savage

Noise-3

Noise-2

Noise-1 300

300

100

0

Frequency

200

200

Frequency

Frequency

t = 1.5

logistic

Noise-2

20

logistic

10 5

0

0

15

100

0 0

0.2

0.4

0.6

0.8

1

200

100

0 0

0.2

0.4

0.6

0.8

1

0

0.2

0.4

0.6

0.8

1

Fig. C.13. Experiment on longservedio Dataset. Top: Generalization Performance; Middle: Random Initialization; Bottom: Forgetting Variables.

139

Noise-1

Noise-3

Noise-2

5

5 6

3 2

4 Test Error (%)

Test Error (%)

Test Error (%)

4

4

2

1

3 2 1

0 t = 1.5

logistic

Savage

t = 1.5

logistic

Savage

40

3 2

Test Error (%)

Test Error (%)

Test Error (%)

40 4

20

t = 1.5

t = 1.5

logistic

Savage

20

0

0

1

Noise-1

Savage

t = 1.5

logistic

Savage

Noise-3

Noise-2 400

150

Savage

Noise-3

5

logistic

t = 1.5

logistic

Noise-2

Noise-1

600

50

Frequency

100

Frequency

Frequency

300 200 100 0

0 0

0.2

0.4

0.6

0.8

1

400

200

0 0

0.2

0.4

0.6

0.8

1

0

0.2

0.4

0.6

0.8

1

Fig. C.14. Experiment on measewyner Dataset. Top: Generalization Performance; Middle: Random Initialization; Bottom: Forgetting Variables.

140

0.5

0.5

0.5

0

Test Error (%)

1

−0.5

0 −0.5

t = 1.5

logistic

Savage

0 −0.5

t = 1.5

logistic

Noise-1

Savage

1

0.5

0.5

0.5

−0.5

Test Error (%)

1

0

0 −0.5

t = 1.5

Savage

Savage

Noise-3

1

logistic

t = 1.5

logistic

Noise-2

Test Error (%)

Test Error (%)

Noise-3

Noise-2 1

Test Error (%)

Test Error (%)

Noise-1 1

0 −0.5

t = 1.5

logistic

Noise-1

Savage

t = 1.5

logistic

Savage

Noise-3

Noise-2

2,000 2,000

2,000

1,500

1,500

1,000

1,000

500

500

0

0 0

0.2

0.4

0.6

0.8

1

Frequency

Frequency

Frequency

1,500

1,000 500 0

0

0.2

0.4

0.6

0.8

1

0

0.2

0.4

0.6

0.8

1

Fig. C.15. Experiment on mushrooms Dataset. Top: Generalization Performance; Middle: Random Initialization; Bottom: Forgetting Variables.

141

Noise-1

Noise-3

Noise-2

6

6

Test Error (%)

8

Test Error (%)

Test Error (%)

5.5

5

t = 1.5

logistic

4

Savage

t = 1.5

logistic

Savage

Noise-3 4.8

7 6 5

6

Test Error (%)

Test Error (%)

8

5

4

4 t = 1.5

logistic

Savage

4.4 4.2 4 3.8

t = 1.5

logistic

Noise-1

·104

4.6

Savage

t = 1.5

logistic

Savage

Noise-3

Noise-2 3,000

0.8 Frequency

2,000

0.6 0.4

Frequency

Test Error (%)

t = 1.5

logistic

Savage

Noise-2

Noise-1

Frequency

4.5

4

4

1

5

1,000

2,000

1,000

0.2 0

0 0

0.2

0.4

0.6

0.8

1

0 0

0.2

0.4

0.6

0.8

1

0

0.2

0.4

0.6

0.8

1

Fig. C.16. Experiment on news20 Dataset. Top: Generalization Performance; Middle: Random Initialization; Bottom: Forgetting Variables.

142

5 4

5

7

4.5

6 Test Error (%)

Test Error (%)

6 Test Error (%)

Noise-3

Noise-2

Noise-1

4 3.5 3

3

5 4 3

2.5 t = 1.5

logistic

Savage

t = 1.5

logistic

Noise-1

t = 1.5

logistic

5 4

5

7

4.5

6

4 3.5 3

3

Savage

Noise-3

Test Error (%)

Test Error (%)

6 Test Error (%)

Savage

Noise-2

5 4 3

2.5 t = 1.5

logistic

Savage

t = 1.5

logistic

Noise-1

t = 1.5

logistic

Savage

Savage

Noise-3

Noise-2 3,000 3,000

1,000

0

Frequency

2,000 Frequency

Frequency

2,000

1,000

0 0

0.2

0.4

0.6

0.8

1

2,000 1,000 0

0

0.2

0.4

0.6

0.8

1

0

0.2

0.4

0.6

0.8

1

Fig. C.17. Experiment on real-sim Dataset. Top: Generalization Performance; Middle: Random Initialization; Bottom: Forgetting Variables.

143

Noise-1

Noise-3

Noise-2 2.65 3

2.6 2.55

2.6

Test Error (%)

Test Error (%)

Test Error (%)

2.65

2.55

2.8

2.6

2.5

2.5 t = 1.5

logistic

t = 1.5

logistic

Savage

Savage

3

2.6 2.55

2.6

Test Error (%)

Test Error (%)

2.55

2.8

2.6

2.5

2.5 t = 1.5

logistic

1.2

1.2 1

0.8

0.8

Frequency

1

0.6 0.4

0.2 0 0.4

0.6

0.8

1

t = 1.5

logistic

4

Savage

Noise-3

·105

3

0.4

0 0.2

Savage

Noise-2

·105

0.6

0.2

0

t = 1.5

logistic

Savage

Noise-1

·105

Frequency

Test Error (%)

Savage

Noise-3

2.65

Frequency

t = 1.5

logistic

Noise-2

Noise-1

2 1 0

0

0.2

0.4

0.6

0.8

1

0

0.2

0.4

0.6

0.8

1

Fig. C.18. Experiment on reuters-c11 Dataset. Top: Generalization Performance; Middle: Random Initialization; Bottom: Forgetting Variables.

144

Noise-1

Noise-3

Noise-2

5.4

Test Error (%)

5.2 5.1 5

Test Error (%)

5.6

5.3 Test Error (%)

6

5.4 5.2

5.5

5

5

4.9 4.8 t = 1.5

logistic

Savage

t = 1.5

logistic

t = 1.5

logistic

Savage

Savage

Noise-3

Noise-2

Noise-1 5.4

5.2 5.1 5

Test Error (%)

5.8 Test Error (%)

Test Error (%)

6

5.6

5.3

5.4 5.2

5.6 5.4 5.2

5

5

4.9

4.8 t = 1.5

logistic

6

t = 1.5

logistic

Savage

Noise-1

·104

Savage

t = 1.5

logistic

Noise-2

·104

Savage

Noise-3

·104

6

2

4

Frequency

Frequency

Frequency

4 4

2

0

0 0

0.2

0.4

0.6

0.8

1

2

0 0

0.2

0.4

0.6

0.8

1

0

0.2

0.4

0.6

0.8

1

Fig. C.19. Experiment on reuters-ccat Dataset. Top: Generalization Performance; Middle: Random Initialization; Bottom: Forgetting Variables.

145

Noise-1

Noise-3

Noise-2

1.2 1.1 1

4

1.2

Test Error (%)

Test Error (%)

Test Error (%)

1.3

1.1

3 2

1 t = 1.5

logistic

1 t = 1.5

logistic

Savage

Noise-1

Savage

t = 1.5

logistic

1.3

Savage

Noise-3

Noise-2 1.2

1.1

1.1 1.05

t = 1.5

logistic

Noise-1

·104

Savage

0.4

6,000

6,000 4,000 2,000

0.2 0

0 0.4

0.6

0.8

1

Savage

Noise-3

Frequency

Frequency

0.6

0.2

t = 1.5

logistic

8,000

0

2

Noise-2

0.8 Frequency

t = 1.5

logistic

Savage

3

1

1

1

1

Test Error (%)

Test Error (%)

Test Error (%)

4 1.2

1.15

4,000

2,000

0 0

0.2

0.4

0.6

0.8

1

0

0.2

0.4

0.6

0.8

1

Fig. C.20. Experiment on web8 Dataset. Top: Generalization Performance; Middle: Random Initialization; Bottom: Forgetting Variables.

146

Noise-1

Noise-3

Noise-2 3

1 0.8 0.6

Test Error (%)

1.2 Test Error (%)

Test Error (%)

1.2

1 0.8 0.6

t = 1.5

logistic

Savage

t = 1.5

logistic

Savage

1.2

15

0.8 0.6

Test Error (%)

Test Error (%)

1

20

0

t = 1.5

Savage

t = 1.5

logistic

Noise-1 1.2

6

0

t = 1.5

logistic

Savage

Savage

Noise-3

·105 1

0.8

Frequency

Frequency

2

5

Noise-2

·105

1 4

10

−5

0.4 logistic

Savage

Noise-3

40

·104

t = 1.5

logistic

Noise-2

Noise-1

Test Error (%)

1

0.4

0.4

Frequency

2

0.6 0.4

0.5

0.2 0

0

0 0

0.2

0.4

0.6

0.8

1

0

0.2

0.4

0.6

0.8

1

0

0.2

0.4

0.6

0.8

1

Fig. C.21. Experiment on webspamtrigram Dataset. Top: Generalization Performance; Middle: Random Initialization; Bottom: Forgetting Variables.

147

Noise-1

Noise-3

Noise-2 8.5

7.5

8

7

6.5

Test Error (%)

Test Error (%)

Test Error (%)

8 7.5 7 6.5

t = 1.5

logistic

Savage

7.5 7 6.5

t = 1.5

logistic

Noise-1

Savage

t = 1.5

logistic

Savage

Noise-3

Noise-2 8.5

7.5

8

7

6.5

7 6.5

Savage

6

Frequency

2

1

0

t = 1.5

logistic

Noise-1

·104

0.2

0.4

0.6

0.8

1

7

Savage

t = 1.5

logistic

Noise-2

·104

Savage

Noise-3

·104 6

4

2

0 0

7.5

6.5

Frequency

t = 1.5

logistic

Frequency

Test Error (%)

Test Error (%)

Test Error (%)

8 7.5

4

2

0 0

0.2

0.4

0.6

0.8

1

0

0.2

0.4

0.6

0.8

1

Fig. C.22. Experiment on webspamunigram Dataset. Top: Generalization Performance; Middle: Random Initialization; Bottom: Forgetting Variables.

148

Noise-1

6

2.2 2 1.8

5

1.9

Test Error (%)

2.4

Test Error (%)

Test Error (%)

Noise-3

Noise-2 2

2.6

1.8 1.7

4 3

1.6

2

1.6 1.5

1.4 t = 1.5

logistic

Savage

t = 1.5

logistic

Noise-1

Savage

Noise-3

6

4 3 2

Test Error (%)

6 Test Error (%)

Test Error (%)

t = 1.5

logistic

Savage

Noise-2

4

2

4

2

1 t = 1.5

logistic

Savage

Savage

1

1

0.4

0.6 0.4

0.2

0.2

0

0 0

0.2

0.4

0.6

0.8

1

Frequency

Frequency

0.6

Savage

Noise-3

·105

0.8

0.8

0.8

t = 1.5

logistic

Noise-2

·105

1 Frequency

t = 1.5

logistic

Noise-1

·105

0.6 0.4 0.2 0

0

0.2

0.4

0.6

0.8

1

0

0.2

0.4

0.6

0.8

1

Fig. C.23. Experiment on worm Dataset. Top: Generalization Performance; Middle: Random Initialization; Bottom: Forgetting Variables.

149 Noise-1

6.2

7.5

6

5.8

5.6

Test Error (%)

7 Test Error (%)

Test Error (%)

Noise-3

Noise-2

6.5 6

Savage

6

t = 1.5

logistic

Savage

t = 1.5

logistic

6.2

20

5.8

Test Error (%)

Test Error (%)

7 6

Savage

Noise-3

Noise-2

Noise-1

Test Error (%)

6.5

5.5

5.5 t = 1.5

logistic

7

6.5

6

10

0 5.6

5.5 t = 1.5

logistic

4

Frequency

1

0

2

0 0

0.2

0.4

0.6

0.8

1

Savage

Noise-3

3

1

0.5

t = 1.5

logistic

·104

3

1.5 Frequency

Savage

Noise-2

·104

Frequency

2

t = 1.5

logistic

Savage

Noise-1

·104

2 1 0

0

0.2

0.4

0.6

0.8

1

0

0.2

0.4

0.6

0.8

1

Fig. C.24. Experiment on zeta Dataset. Top: Generalization Performance; Middle: Random Initialization; Bottom: Forgetting Variables. Noise-3

Noise-2

Noise-1

0.7

Test Error (%)

Test Error (%)

0.29

0.29

Test Error (%)

0.3

0.3

0.29

0.29

0.6 0.5 0.4 0.3

0.28

0.28 logistic

t = 1.5

logistic

Savage

t = 1.5

Savage

logistic

t = 1.5

Savage

Fig. C.25. Generalization Performance on dna Dataset. Noise-1

23.9

23.7

23.6

Test Error (%)

24 Test Error (%)

Test Error (%)

Noise-3

Noise-2

23.8

23.8

23.6

23.8 23.7 23.6 23.5

23.5 logistic

t = 1.5

Savage

logistic

t = 1.5

Savage

Fig. C.26. Generalization Performance on ocr Dataset.

logistic

t = 1.5

Savage

150

Noise-3

Noise-2

Noise-1

16 7

6.5 6 5.5

14 Test Error (%)

Test Error (%)

Test Error (%)

7

6.5 6 5.5

savage

t = 1.5

logistic

8 6

logistic

Noise-3 14

6 5.5

Test Error (%)

Test Error (%)

7

6.5

6.5 6 5.5

savage

t = 1.5

savage

t = 1.5

logistic

Noise-2

7 Test Error (%)

10

savage

t = 1.5

Noise-1

logistic

12

12 10 8 6

savage

t = 1.5

logistic

Noise-1

savage

t = 1.5

logistic

Noise-3

Noise-2

80

60

60

60

40

40

20

20

0

0 0

0.2

0.4

0.6

0.8

1

Frequency

80 Frequency

Frequency

100 80

40 20 0

0

0.2

0.4

0.6

0.8

1

0

0.2

0.4

0.6

0.8

1

Fig. C.27. Experiment on dna Dataset. Top: Generalization Performance; Middle: Random Initialization; Bottom: Forgetting Variables.

151

Noise-1

Noise-3

Noise-2

22

20

26

23

Test Error (%)

24

Test Error (%)

Test Error (%)

24

22 21 20

savage

t = 1.5

logistic

22 20

savage

t = 1.5

logistic

Noise-1

Noise-3

24

26

23

Test Error (%)

Test Error (%)

22

22 21 20

20

savage

t = 1.5

logistic

Noise-2

24 Test Error (%)

24

24 22 20

19 savage

t = 1.5

logistic

Noise-3

300 200

800

400

600

300 Frequency

Frequency

400

400 200

100

0

0.2

0.4

0.6

0.8

1

200 100

0

0

savage

t = 1.5

logistic

Noise-2

500

Frequency

savage

t = 1.5

logistic

Noise-1

0 0

0.2

0.4

0.6

0.8

1

0

0.2

0.4

0.6

0.8

1

Fig. C.28. Experiment on letter Dataset. Top: Generalization Performance; Middle: Random Initialization; Bottom: Forgetting Variables.

152

Noise-3

Noise-2

9

8.5

8.5 8 7.5 7

10

Test Error (%)

9

Test Error (%)

Test Error (%)

Noise-1 9.5

8 7.5

savage

t = 1.5

logistic

Noise-1

savage

t = 1.5

logistic

Noise-3

Noise-2

9.5

9 14

8.5 8

8.5 8

7.5

7.5

7

7 savage

t = 1.5

logistic

Test Error (%)

Test Error (%)

9 Test Error (%)

8

7

7 savage

t = 1.5

logistic

9

12 10 8 6 4

savage

t = 1.5

logistic

Noise-1

savage

t = 1.5

logistic

Noise-3

Noise-2

4,000 4,000

2,000 1,000

3,000

3,000

Frequency

Frequency

Frequency

3,000

2,000 1,000 0

0 0

0.2

0.4

0.6

0.8

1

2,000 1,000

0 0

0.2

0.4

0.6

0.8

1

0

0.2

0.4

0.6

0.8

1

Fig. C.29. Experiment on mnist Dataset. Top: Generalization Performance; Middle: Random Initialization; Bottom: Forgetting Variables.

153

Noise-1

Noise-3

Noise-2

33.4 33.2

33 32.8 32.6

33

33

Test Error (%)

Test Error (%)

Test Error (%)

33.2

32.8 32.6

32.2

32.2 savage

t = 1.5

logistic

32.2 Noise-3

32.6 32.4

33.2

32.7

33

32.6

Test Error (%)

Test Error (%)

Test Error (%)

32.8

32.8 32.6 32.4

savage

t = 1.5

32.5 32.4 32.3

32.2

32.2

32.2 savage

t = 1.5

logistic

Noise-1

savage

t = 1.5

logistic

Noise-3

Noise-2

400

savage

t = 1.5

logistic

Noise-2

33

400 400

300 200

200

100

100

0

0 0

0.2

0.4

0.6

0.8

1

Frequency

300 Frequency

Frequency

savage

t = 1.5

logistic

Noise-1

logistic

32.6 32.4

32.4

32.4

32.8

300 200 100 0

0

0.2

0.4

0.6

0.8

1

0

0.2

0.4

0.6

0.8

1

Fig. C.30. Experiment on protein Dataset. Top: Generalization Performance; Middle: Random Initialization; Bottom: Forgetting Variables.

154

Noise-1 25

15

10

20 Test Error (%)

Test Error (%)

Test Error (%)

Noise-3

Noise-2

20

20 15

10

10 5

5 savage

t = 1.5

logistic

savage

t = 1.5

logistic

Noise-1

Noise-3 80

25

8

Test Error (%)

Test Error (%)

10

savage

t = 1.5

logistic

Noise-2

12 Test Error (%)

15

20 15

60 40 20

10 0

6 savage

t = 1.5

logistic

Noise-2

·104

5

0.5

Frequency

Frequency

1

1 0.5

0

0.2

0.4

0.6

0.8

1

t = 1.5

·104

Noise-3

savage

3 2 1

0

0

logistic

4

1.5

1.5 Frequency

savage

t = 1.5

logistic

Noise-1

·104

0 0

0.2

0.4

0.6

0.8

1

0

0.2

0.4

0.6

0.8

1

Fig. C.31. Experiment on rcv1 Dataset. Top: Generalization Performance; Middle: Random Initialization; Bottom: Forgetting Variables.

155

Noise-1

Noise-3

Noise-2

32

32

30 29

Test Error (%)

Test Error (%)

Test Error (%)

32 31

31 30

savage

t = 1.5

30 29

29

logistic

31

savage

t = 1.5

logistic

savage

t = 1.5

logistic

Noise-3

Noise-2

Noise-1

34

32 31 30 29

Test Error (%)

Test Error (%)

Test Error (%)

32 31 30

savage

t = 1.5

Noise-3

500

2,000

2,000

1,500

1,500

Frequency

Frequency

1,000

1,000 500

0 0.4

0.6

0.8

1

1,000 500

0 0.2

savage

t = 1.5

logistic

Noise-2

1,500 Frequency

savage

t = 1.5

logistic

Noise-1

0

30

28

29 logistic

32

0 0

0.2

0.4

0.6

0.8

1

0

0.2

0.4

0.6

0.8

1

Fig. C.32. Experiment on sensitacoustic Dataset. Top: Generalization Performance; Middle: Random Initialization; Bottom: Forgetting Variables.

156

Noise-3

Noise-2

Noise-1 20.5

19.5 19 18.5

20 Test Error (%)

Test Error (%)

Test Error (%)

20 20

19.5

19

19

18.5

18.5 savage

t = 1.5

logistic

19.5

savage

t = 1.5

logistic

Noise-1

savage

t = 1.5

logistic

Noise-3

Noise-2 20

24

19.5 19 18.5

Test Error (%)

Test Error (%)

Test Error (%)

20 19.5

19

18

savage

t = 1.5

logistic

Noise-1

20

16

18.5 savage

t = 1.5

logistic

22

Noise-2

·104 1

1

0.8

0.8

logistic

t = 1.5

·104

Noise-3

savage

2,000

0.6 0.4

1,000

0.2

0

0 0

0.2

0.4

0.6

0.8

1

Frequency

3,000

Frequency

Frequency

4,000 0.6 0.4 0.2 0 0

0.2

0.4

0.6

0.8

1

0

0.2

0.4

0.6

0.8

1

Fig. C.33. Experiment on sensitcombined Dataset. Top: Generalization Performance; Middle: Random Initialization; Bottom: Forgetting Variables.

157

Noise-1

Test Error (%)

Test Error (%)

29

28.5

28.5

28

Test Error (%)

29

28 27.5 27

27

27.5

savage

t = 1.5

logistic

savage

t = 1.5

logistic

Noise-3

Noise-2

Noise-1 29

29

28

40

28.5

Test Error (%)

Test Error (%)

30

28 27.5 27

27 savage

t = 1.5

logistic

Noise-1

4,000 Frequency

3,000 2,000

1 0.8

0.6 0.4 0.2

0

0 0.2

0.4

0.6

0.8

1

Noise-3

·104

1

savage

t = 1.5

logistic

0.8

1,000

0

30

Noise-2

·104

5,000

35

25 savage

t = 1.5

logistic

Frequency

Test Error (%)

28

27

savage

t = 1.5

logistic

Frequency

Noise-3

Noise-2 29

30

0.6 0.4 0.2 0

0

0.2

0.4

0.6

0.8

1

0

0.2

0.4

0.6

0.8

1

Fig. C.34. Experiment on sensitseismic Dataset. Top: Generalization Performance; Middle: Random Initialization; Bottom: Forgetting Variables.

158

7

6.5

6

8 Test Error (%)

Test Error (%)

7 Test Error (%)

Noise-3

Noise-2

Noise-1

6.5

6

7

6 savage

t = 1.5

logistic

savage

t = 1.5

logistic

Noise-1

savage

t = 1.5

logistic

Noise-3

Noise-2 10

7

6.5 6 5.5

Test Error (%)

Test Error (%)

Test Error (%)

7 6.5

6

6

5.5 savage

t = 1.5

logistic

600

400

200

0

0 0.4

0.6

0.8

1

500 400

400

200

0.2

Noise-3

Frequency

600 Frequency

800

savage

t = 1.5

logistic

Noise-2

800

0

savage

t = 1.5

logistic

Noise-1

Frequency

8

300 200 100 0

0

0.2

0.4

0.6

0.8

1

0

0.2

0.4

0.6

0.8

1

Fig. C.35. Experiment on usps Dataset. Top: Generalization Performance; Middle: Random Initialization; Bottom: Forgetting Variables.

159 Noise-1

Noise-3

Noise-2 19

Test Error (%)

Test Error (%)

15.3

Test Error (%)

15.8 15.4

15.6 15.4

18 17 16

15.2

15.2 Logistic

Mis-0

Mis-I

Mis-II

15

Savage

Logistic

Mis-0

Noise-1

Mis-I

Mis-II

Savage

Logistic

Mis-0

Mis-I

Mis-II

Savage

Noise-3

Noise-2 24

15.5

24

15.4

15.3

Test Error (%)

Test Error (%)

Test Error (%)

22 20 18 16

22 20 18 16

14

14

15.2 Logistic

Mis-0

Mis-I

Mis-II

Logistic

Savage

Mis-0

Mis-I

Mis-II

Savage

Logistic

Mis-0

Mis-I

Mis-II

Savage

Fig. D.1. Experiment on adult9 Dataset. Top: Generalization Performance; Bottom: Random Initialization. Noise-3

Noise-2

Noise-1 23

21.92

23

21.88 21.86

22.6

Test Error (%)

Test Error (%)

Test Error (%)

22.8 21.9

22.4 22.2 22

21.84

22.5

22

21.8 Logistic

Mis-0

Mis-I

Mis-II

Savage

Logistic

Mis-0

Mis-I

Mis-II

Savage

Logistic

Mis-0

Mis-I

Mis-II

Savage

Noise-3

Noise-2

Noise-1 21.9

21.86 21.84

30

22.6

Test Error (%)

Test Error (%)

Test Error (%)

22.8 21.88

22.4 22.2 22 21.8

21.82 Logistic

Mis-0

Mis-I

Mis-II

Savage

25 20 15

Logistic

Mis-0

Mis-I

Mis-II

Savage

Logistic

Mis-0

Mis-I

Mis-II

Savage

Fig. D.2. Experiment on alpha Dataset. Top: Generalization Performance; Bottom: Random Initialization.

Appendix D: Additional Figures of Section 6.4 In this chapter, we provide the additional figures from the empirical evalutation of generalized t-logistic regression.

160

Noise-1

Noise-3

Noise-2

14

10 8 6

Test Error (%)

Test Error (%)

Test Error (%)

8

10

12

8 6 4

4 2

2 Logistic

Mis-0

Mis-I

Mis-II

Savage

4 2

Logistic

Mis-0

Noise-1

Mis-I

Mis-II

Savage

Logistic

Mis-0

25

25

12

20

20

8 6

Test Error (%)

14

10

Mis-I

Mis-II

Savage

Mis-II

Savage

Noise-3

Noise-2

Test Error (%)

Test Error (%)

6

15 10 5

15 10 5

4 0

0

2 Logistic

Mis-0

Mis-I

Mis-II

Savage

Logistic

Mis-0

Mis-I

Mis-II

Logistic

Savage

Mis-0

Mis-I

Fig. D.3. Experiment on astro-ph Dataset. Top: Generalization Performance; Bottom: Random Initialization.

Noise-1

Noise-3

Noise-2

20

10

Test Error (%)

8

15

Test Error (%)

Test Error (%)

8

6

4

6 4

5 2 Logistic

Mis-0

Mis-I

Mis-II

Savage

2 Logistic

Mis-0

Mis-I

Mis-II

Savage

Logistic

Mis-0

Mis-I

Mis-II

Savage

Noise-3

Noise-2

Noise-1

30

20

10 5

Test Error (%)

15

Test Error (%)

Test Error (%)

30 20 10

Mis-0

Mis-I

Mis-II

Savage

10

0

0

Logistic

20

Logistic

Mis-0

Mis-I

Mis-II

Savage

Logistic

Mis-0

Mis-I

Mis-II

Savage

Fig. D.4. Experiment on aut-avn Dataset. Top: Generalization Performance; Bottom: Random Initialization.

161

Noise-1

Noise-3

Noise-2

49.9

50.1

49.8

49.7

Test Error (%)

Test Error (%)

Test Error (%)

50 49.9 49.8

Logistic

Mis-0

Mis-I

Mis-II

Savage

Logistic

Mis-0

Mis-I

Mis-II

Savage

Logistic

Mis-0

Mis-I

49.84 49.82 49.8

49.95

Test Error (%)

Test Error (%)

49.86

49.9 49.85 49.8

49.78

Savage

50.1

50

49.88

Mis-II

Noise-3

Noise-2

Noise-1

Test Error (%)

49.8 49.7

49.7

49.6

50 49.9

50

49.9

49.8

49.75 Logistic

Mis-0

Mis-I

Mis-II

Logistic

Savage

Mis-0

Mis-I

Mis-II

Savage

Logistic

Mis-0

Mis-I

Mis-II

Savage

Fig. D.5. Experiment on beta Dataset. Top: Generalization Performance; Bottom: Random Initialization.

Noise-1

Noise-3

Noise-2

23 23

22.6 22.4 22.2 22

28 Test Error (%)

Test Error (%)

Test Error (%)

22.8

22.5

26 24

22 22

21.8 Logistic

Mis-0

Mis-I

Mis-II

Savage

Logistic

Mis-0

Noise-1

Mis-I

Mis-II

Logistic

Savage

Mis-0

Mis-I

Mis-II

Savage

Noise-3

Noise-2

23

Test Error (%)

Test Error (%)

22.6 22.4 22.2 22

Test Error (%)

30

22.8

25

20 Mis-0

Mis-I

Mis-II

Savage

25

20

21.8 Logistic

30

Logistic

Mis-0

Mis-I

Mis-II

Savage

Logistic

Mis-0

Mis-I

Mis-II

Savage

Fig. D.6. Experiment on covertype Dataset. Top: Generalization Performance; Bottom: Random Initialization.

162

Noise-3

Noise-2

Noise-1 21.6

22.4

22.2

21.56 21.54 21.52

Test Error (%)

Test Error (%)

Test Error (%)

21.58 22 21.8 21.6

21.5 Logistic

Mis-0

Mis-I

Mis-II

Savage

22.2 22 21.8 21.6

Logistic

Mis-0

Mis-I

Mis-II

Savage

Logistic

Mis-0

Mis-I

Mis-II

Savage

Noise-3

Noise-2

Noise-1 21.6

21.56 21.54

35 Test Error (%)

Test Error (%)

Test Error (%)

35 21.58

30 25

30 25 20

20 21.52 Logistic

Mis-0

Mis-I

Mis-II

Logistic

Savage

Mis-0

Mis-I

Mis-II

Logistic

Savage

Mis-0

Mis-I

Mis-II

Savage

Fig. D.7. Experiment on delta Dataset. Top: Generalization Performance; Bottom: Random Initialization.

19.98 19.96

Test Error (%)

20

Test Error (%)

Test Error (%)

20.8

20.6

20.02

20.4 20.2 20

19.94 Logistic

Mis-0

Mis-I

Mis-II

Savage

20.6 20.4 20.2 20

Logistic

Mis-0

Mis-I

Mis-II

Savage

Logistic

Mis-0

Mis-I

Mis-II

Savage

Noise-3

Noise-2

Noise-1 25

20

24

19.98 19.97

30 Test Error (%)

19.99

Test Error (%)

Test Error (%)

Noise-3

Noise-2

Noise-1

23 22 21

25

20

19.96

20 Logistic

Mis-0

Mis-I

Mis-II

Savage

Logistic

Mis-0

Mis-I

Mis-II

Savage

Logistic

Mis-0

Mis-I

Mis-II

Savage

Fig. D.8. Experiment on gamma Dataset. Top: Generalization Performance; Bottom: Random Initialization.

163

Noise-1

Noise-3

Noise-2

0.5

Test Error (%)

Test Error (%)

Test Error (%)

0.6 0.6

0.6

0.5

0.4

0.4

0.4

Logistic

Mis-0

Mis-I

Mis-II

Savage

Logistic

Mis-0

Mis-I

Mis-II

Savage

Logistic

Mis-0

8

2 0

Test Error (%)

Test Error (%)

4

Mis-II

Savage

Mis-II

Savage

40

40

6

Mis-I

Noise-3

Noise-2

Noise-1

Test Error (%)

0.5

20

0

20

0

−2 −4

Logistic

Mis-0

Mis-I

Mis-II

Savage

Logistic

Mis-0

Mis-I

Mis-II

−20

Savage

Logistic

Mis-0

Mis-I

Fig. D.9. Experiment on kdd99 Dataset. Top: Generalization Performance; Bottom: Random Initialization.

Noise-3

Noise-2

Noise-1 20

20

15 10 5 0

15

Test Error (%)

Test Error (%)

Test Error (%)

20

10 5 0

Logistic

Mis-0

Mis-I

Mis-II

Savage

Mis-0

Mis-II

Savage

Logistic

Mis-0

15 10 5

Mis-II

Savage

Mis-II

Savage

Mis-II

Savage

20

15 10 5

15 10 5 0

0

0

Mis-I

Noise-3

Test Error (%)

Test Error (%)

Test Error (%)

Mis-I

20

Mis-I

5

Noise-2

20

Mis-0

10

0 Logistic

Noise-1

Logistic

15

Logistic

Mis-0

Mis-I

Mis-II

Savage

−5

Logistic

Mis-0

Mis-I

Fig. D.10. Experiment on longservedio Dataset. Top: Generalization Performance; Bottom: Random Initialization.

164

Noise-3

Noise-2

Noise-1 5

5 6

3 2

Test Error (%)

4 Test Error (%)

Test Error (%)

4

4

2

1

2 1

0

0 Logistic

Mis-0

Mis-I

Mis-II

Savage

0 Logistic

Mis-0

Mis-I

Mis-II

Savage

Logistic

5

2

20

0

1 Mis-0

Mis-I

Mis-II

Savage

Mis-II

Savage

40 Test Error (%)

Test Error (%)

3

Mis-I

Noise-3

40

4

Logistic

Mis-0

Noise-2

Noise-1

Test Error (%)

3

20

0

Logistic

Mis-0

Mis-I

Mis-II

Logistic

Savage

Mis-0

Mis-I

Mis-II

Savage

Fig. D.11. Experiment on measewyner Dataset. Top: Generalization Performance; Bottom: Random Initialization.

Noise-1

0.2 0.1

1

0.5

0.5

Test Error (%)

Test Error (%)

0.3 Test Error (%)

Noise-3

Noise-2 1

0 −0.5

0 −0.5

0 Logistic

Mis-0

Mis-I

Mis-II

Savage

Logistic

Mis-0

Mis-II

Savage

Logistic

Mis-0

Test Error (%)

0.4 0.3 0.2 0.1

Mis-I

Mis-II

Savage

Mis-II

Savage

Noise-3

1

1

0.5

0.5

Test Error (%)

0.5

Test Error (%)

Mis-I

Noise-2

Noise-1

0 −0.5

0 −0.5

0 Logistic

Mis-0

Mis-I

Mis-II

Savage

Logistic

Mis-0

Mis-I

Mis-II

Savage

Logistic

Mis-0

Mis-I

Fig. D.12. Experiment on mushrooms Dataset. Top: Generalization Performance; Bottom: Random Initialization.

165

Noise-1

Noise-3

Noise-2 9

12

7

8 6

Test Error (%)

Test Error (%)

Test Error (%)

8 10

7 6 5

Logistic

Mis-0

Mis-I

Mis-II

Savage

Logistic

Mis-0

Mis-II

Savage

Logistic

Test Error (%)

10 8 6 4

40

8

30

7

20 10 0

Mis-0

Mis-I

Mis-0

Mis-II

Savage

Mis-I

Mis-II

Savage

Mis-II

Savage

Noise-3

Test Error (%)

12 Test Error (%)

Mis-I

Noise-2

Noise-1

Logistic

5 4

4

4

6

6 5 4

Logistic

Mis-0

Mis-I

Mis-II

Logistic

Savage

Mis-0

Mis-I

Fig. D.13. Experiment on news20 Dataset. Top: Generalization Performance; Bottom: Random Initialization.

Noise-3

Noise-2

Noise-1 20

15

10 5

12

Test Error (%)

Test Error (%)

Test Error (%)

14 15

10 8 6

10

5

4 2 Logistic

Mis-0

Mis-I

Mis-II

Savage

Logistic

Mis-0

Mis-II

Savage

30

15

20

10 5

Mis-0

Mis-I

Mis-0

Mis-II

Savage

Mis-I

Mis-II

Savage

Mis-II

Savage

Noise-3 20

10

15 10 5 0

0 Logistic

Logistic

Test Error (%)

20

Test Error (%)

Test Error (%)

Mis-I

Noise-2

Noise-1

Logistic

Mis-0

Mis-I

Mis-II

Savage

Logistic

Mis-0

Mis-I

Fig. D.14. Experiment on real-sim Dataset. Top: Generalization Performance; Bottom: Random Initialization.

166

Noise-1 7

6 5 4

8 Test Error (%)

Test Error (%)

7 Test Error (%)

Noise-3

Noise-2

6 5 4

6

4

3

3

2 Logistic

Mis-0

Mis-I

Mis-II

Logistic

Savage

Mis-0

Noise-1

Mis-II

Savage

Logistic

4

Test Error (%)

Test Error (%)

5

4 3

3 Mis-I

Mis-II

Savage

Mis-II

Savage

Mis-II

Savage

6

4

2

2 Mis-0

Mis-I

Noise-3

5

6

Logistic

Mis-0

Noise-2

7 Test Error (%)

Mis-I

Logistic

Mis-0

Mis-I

Mis-II

Logistic

Savage

Mis-0

Mis-I

Fig. D.15. Experiment on reuters-c11 Dataset. Top: Generalization Performance; Bottom: Random Initialization.

Noise-3

Noise-2

Noise-1

9 9

8 Test Error (%)

Test Error (%)

7 6 5

Test Error (%)

8

8

7

6

Mis-0

Mis-I

Mis-II

Savage

6 5

5 Logistic

7

Logistic

Mis-0

Noise-1

Mis-I

Mis-II

Savage

Logistic

Mis-0

Mis-I

Mis-II

Savage

Noise-3

Noise-2

20

7 6

Test Error (%)

20

8

Test Error (%)

Test Error (%)

9

10

10

0

0 5 Logistic

Mis-0

Mis-I

Mis-II

Savage

Logistic

Mis-0

Mis-I

Mis-II

Savage

Logistic

Mis-0

Mis-I

Mis-II

Savage

Fig. D.16. Experiment on reuters-ccat Dataset. Top: Generalization Performance; Bottom: Random Initialization.

167

Noise-1 1.3

1.2 1.1

5 Test Error (%)

1.3

Test Error (%)

Test Error (%)

Noise-3

Noise-2

1.4

1.2

1.1

1

3 2 1

Logistic

Mis-0

Mis-I

Mis-II

Savage

Logistic

Mis-0

Mis-I

Mis-II

Savage

Logistic

2.5

Test Error (%)

Test Error (%)

1.1

2 1.5

Mis-0

Mis-I

Mis-II

Logistic

Savage

Savage

Mis-II

Savage

4 3 2 1

1

1

Mis-II

5

3

1.2

Mis-I

Noise-3

1.3

Logistic

Mis-0

Noise-2

Noise-1

Test Error (%)

4

Mis-0

Mis-I

Mis-II

Savage

Logistic

Mis-0

Mis-I

Fig. D.17. Experiment on web8 Dataset. Top: Generalization Performance; Bottom: Random Initialization.

Noise-3

Noise-2

Noise-1

8.5 8

7

6.5

8 Test Error (%)

Test Error (%)

Test Error (%)

7.5

7.5 7 6.5

Logistic

Mis-0

Mis-I

Mis-II

Logistic

Savage

Mis-0

Noise-1

Mis-II

7

6.5

Mis-I

Mis-0

Mis-II

Savage

Mis-I

Mis-II

Savage

Noise-3

20

20

15

15

10 5

10 5 0

0

6 Mis-0

Logistic

Savage

Test Error (%)

Test Error (%)

Test Error (%)

Mis-I

Noise-2

7.5

Logistic

7 6.5 6

6

6

7.5

Logistic

Mis-0

Mis-I

Mis-II

Savage

Logistic

Mis-0

Mis-I

Mis-II

Savage

Fig. D.18. Experiment on webspamunigram Dataset. Top: Generalization Performance; Bottom: Random Initialization.

168

Noise-1

6

2

2.6

2.2 2 1.8

5

1.9

Test Error (%)

2.4

Test Error (%)

Test Error (%)

Noise-3

Noise-2

1.8 1.7

4 3

1.6

2

1.6 1.5

1.4 Logistic

Mis-0

Mis-I

Mis-II

Savage

Logistic

Mis-0

4

Mis-II

Savage

Logistic

2

4

2

1 Mis-0

Mis-I

Mis-II

Savage

Mis-I

Mis-II

Savage

6 Test Error (%)

3

Logistic

Mis-0

Noise-3

6 Test Error (%)

Test Error (%)

Mis-I

Noise-2

Noise-1

4

2

Logistic

Mis-0

Mis-I

Mis-II

Savage

Logistic

Mis-0

Mis-I

Mis-II

Savage

Fig. D.19. Experiment on worm Dataset. Top: Generalization Performance; Bottom: Random Initialization.

Noise-1

6.2

7.5

6 5.8 5.6

Test Error (%)

7 Test Error (%)

Test Error (%)

Noise-3

Noise-2

6.5 6

6 5.5

5.5 Logistic

Mis-0

Mis-I

Mis-II

Savage

7 6.5

Logistic

Mis-0

Mis-I

Mis-II

Savage

Logistic

Mis-0

Mis-I

Mis-II

Savage

Mis-II

Savage

Noise-3

Noise-2

Noise-1 6.2

20

5.8

Test Error (%)

Test Error (%)

Test Error (%)

7 6

6.5 6

10

0 5.6

5.5 Logistic

Mis-0

Mis-I

Mis-II

Savage

Logistic

Mis-0

Mis-I

Mis-II

Savage

Logistic

Mis-0

Mis-I

Fig. D.20. Experiment on zeta Dataset. Top: Generalization Performance; Bottom: Random Initialization.

169

Noise-1

Noise-3

Noise-2 16

10

14

Test Error (%)

Test Error (%)

Test Error (%)

20 15

12 10 8 6

5 Logistic

Mis-0

Mis-I

Mis-II

Savage

10

5 Logistic

Mis-0

Noise-1

Mis-I

Mis-II

Logistic

Savage

Mis-0

Test Error (%)

Test Error (%)

10

Mis-II

Savage

Mis-II

Savage

40

30

20

Mis-I

Noise-3

Noise-2

30 Test Error (%)

15

20

30 20

10 10 Logistic

Mis-0

Mis-I

Mis-II

Logistic

Savage

Mis-0

Mis-I

Mis-II

Savage

Logistic

Mis-0

Mis-I

Fig. D.21. Experiment on dna Dataset. Top: Generalization Performance; Bottom: Random Initialization.

Noise-3

Noise-2

Noise-1 24

22

26

23

Test Error (%)

Test Error (%)

Test Error (%)

24

22 21 20

20 Logistic

Mis-0

Mis-I

Mis-II

Mis-0

Mis-I

Mis-II

Savage

Logistic

20

Mis-II

Savage

Savage

Mis-II

Savage

26

23 22 21

24 22 20

20

Mis-I

Mis-II

28

Test Error (%)

Test Error (%)

Test Error (%)

22

Mis-I

Noise-3

24 24

Mis-0

Mis-0

Noise-2

Noise-1

Logistic

22 20

Logistic

Savage

24

Logistic

Mis-0

Mis-I

Mis-II

Savage

Logistic

Mis-0

Mis-I

Fig. D.22. Experiment on letter Dataset. Top: Generalization Performance; Bottom: Random Initialization.

170

Noise-1

Noise-3

Noise-2

11

12

9 8

11 Test Error (%)

Test Error (%)

Test Error (%)

10 10

9

8

10 9 8 7

Logistic

Mis-0

Mis-I

Mis-II

Savage

Logistic

Mis-0

Mis-II

Savage

Logistic

Mis-0

25

15 10

Mis-II

Savage

Mis-II

Savage

60 Test Error (%)

Test Error (%)

25

20

Mis-I

Noise-3

Noise-2

Noise-1

Test Error (%)

Mis-I

20 15 10

40

20

5

5 Logistic

Mis-0

Mis-I

Mis-II

Logistic

Savage

Mis-0

Mis-I

Mis-II

Logistic

Savage

Mis-0

Mis-I

Fig. D.23. Experiment on mnist Dataset. Top: Generalization Performance; Bottom: Random Initialization.

Noise-1

Noise-3

Noise-2 33.4

33.2 33 32.8

33.2

33.5

Test Error (%)

Test Error (%)

Test Error (%)

33.4

33

33 32.8 32.6

32.6 32.5 Logistic

Mis-0

Mis-I

Mis-II

Savage

Logistic

Mis-0

Noise-1

Mis-I

Mis-II

Savage

Logistic

Mis-0

33.2

Mis-I

Mis-II

Savage

Noise-3

Noise-2 42

34.5

32.8

34

Test Error (%)

Test Error (%)

Test Error (%)

40 33

33.5 33

38 36 34

32.6 32

32.5 Logistic

Mis-0

Mis-I

Mis-II

Savage

Logistic

Mis-0

Mis-I

Mis-II

Savage

Logistic

Mis-0

Mis-I

Mis-II

Savage

Fig. D.24. Experiment on protein Dataset. Top: Generalization Performance; Bottom: Random Initialization.

171

32

30 29 28

Test Error (%)

32

31

Test Error (%)

Test Error (%)

Noise-3

Noise-2

Noise-1 32

30

30 29 28

28

27

27 Logistic

Mis-0

Mis-I

Mis-II

Savage

Logistic

Mis-0

Noise-1

Mis-I

Mis-II

Savage

Logistic

Mis-0

32

30 29 28

Mis-II

Savage

Mis-II

Savage

32 Test Error (%)

Test Error (%)

31

Mis-I

Noise-3

Noise-2

32 Test Error (%)

31

31 30 29 28

30 28 26

27 Logistic

Mis-0

Mis-I

Mis-II

Savage

Logistic

Mis-0

Mis-I

Mis-II

Logistic

Savage

Mis-0

Mis-I

Fig. D.25. Experiment on sensitacoustic Dataset. Top: Generalization Performance; Bottom: Random Initialization.

Noise-3

Noise-2

Noise-1 20.5 20

20

19.5 19 18.5

Test Error (%)

Test Error (%)

Test Error (%)

20 19.5 19

Mis-0

Mis-I

Mis-II

Logistic

Savage

19 18.5

18.5

Logistic

19.5

Mis-0

Mis-I

Mis-II

Savage

Logistic

Mis-0

Mis-I

Mis-II

Savage

Noise-3

Noise-2

Noise-1 20.5 20

19.5 19

Test Error (%)

Test Error (%)

Test Error (%)

20 19.5 19

25

20

18.5

18.5

15 Logistic

Mis-0

Mis-I

Mis-II

Savage

Logistic

Mis-0

Mis-I

Mis-II

Savage

Logistic

Mis-0

Mis-I

Mis-II

Savage

Fig. D.26. Experiment on sensitcombined Dataset. Top: Generalization Performance; Bottom: Random Initialization.

172

29

28 27

29 Test Error (%)

29

Test Error (%)

Test Error (%)

30

28

27

Logistic

Mis-0

Mis-I

Mis-II

Savage

28

27

26

26

26

Logistic

Mis-0

Mis-I

Mis-II

Savage

Logistic

Mis-0

Mis-II

Savage

Mis-II

Savage

45

29

30

Mis-I

Noise-3

Noise-2

Noise-1

28 27

Test Error (%)

40

29

Test Error (%)

Test Error (%)

Noise-3

Noise-2

Noise-1

28

27

35 30 25

26

26 Logistic

Mis-0

Mis-I

Mis-II

Savage

Logistic

Mis-0

Mis-I

Mis-II

Savage

Logistic

Mis-0

Mis-I

Fig. D.27. Experiment on sensitseismic Dataset. Top: Generalization Performance; Bottom: Random Initialization.

Noise-3

Noise-2

Noise-1

16

14

10 8

12

Test Error (%)

Test Error (%)

Test Error (%)

12

10 8

Logistic

Mis-0

Mis-I

Mis-II

Savage

10

6 Logistic

Mis-0

Noise-1

Mis-I

Mis-II

Logistic

Savage

Mis-0

Mis-I

Mis-II

Savage

Mis-II

Savage

Noise-3

Noise-2

60

16 80

40

20

Test Error (%)

Test Error (%)

14 Test Error (%)

12

8

6

6

14

12 10 8

4 Logistic

Mis-0

Mis-I

Mis-II

Savage

40 20

6 0

60

0 Logistic

Mis-0

Mis-I

Mis-II

Savage

Logistic

Mis-0

Mis-I

Fig. D.28. Experiment on usps Dataset. Top: Generalization Performance; Bottom: Random Initialization.

173 Appendix E: Additional Tables of Section 3.5 In this chapter, we provide the additional tables from the empirical evalutation of tlogistic regression.

174

Table E.1 CPU time spent on binary datasets (Total time, Averaged time per function evaluation) on seconds. Dataset

logistic

t = 1.5

Savage

adult9

(0.30, 0.01)

(1.37, 0.04)

(0.16, 0.01)

alpha

(134.75, 0.36)

(283.94, 0.85)

(164.48, 0.46)

astro-ph

(3.12, 0.04)

(7.06, 0.12)

(3.25, 0.03)

aut-avn

(0.94, 0.03)

(4.48, 0.07)

(0.86, 0.03)

beta

(37.79, 0.79)

(50.17, 1.43)

(85.65, 2.76)

covertype

(2.64, 0.03)

(42.71, 0.49)

(2.27, 0.03)

delta

(94.49, 0.41)

(188.50, 0.83)

(90.03, 0.41)

epsilon

(339.18, 2.49)

(867.48, 2.39)

(243.70, 2.54)

gamma

(123.16, 0.40)

(238.67, 0.85)

(100.31, 0.42)

kdd99

(40.47, 0.29)

(560.90, 4.42)

(28.69, 0.28)

kdda

(1317.03, 5.21)

kddb

(2717.57, 14.01) (8164.70, 30.02) (1278.16, 9.47)

(4223.01, 12.68) (1152.17, 6.03)

longservedio

(0.09, 0.00)

(0.11, 0.01)

(0.05, 0.00)

measewyner

(0.07, 0.00)

(0.14, 0.00)

(0.10, 0.00)

mushrooms

(0.07, 0.01)

(0.15, 0.01)

(0.12, 0.01)

news20

(12.98, 0.28)

(17.32, 0.35)

(40.86, 0.48)

real-sim

(0.60, 0.03)

(1.67, 0.09)

(0.70, 0.04)

reuters-c11

(7.91, 0.72)

(14.56, 1.46)

(30.26, 2.75)

reuters-ccat

(37.24, 0.33)

(83.72, 1.20)

(19.68, 0.21)

web8

(0.58, 0.00)

(0.84, 0.06)

(0.20, 0.01)

webspamtrigram

(1450.52, 12.29)

(3008.55, 7.45)

(799.84, 12.90)

webspamunigram

(20.28, 0.08)

(80.65, 0.35)

(17.16, 0.08)

worm

(38.88, 0.83)

(79.65, 1.85)

(52.63, 0.60)

zeta

(365.85, 1.54)

(297.36, 2.94)

(274.32, 2.59)

175

Table E.2 CPU time spent on multiclass datasets (Total time, Averaged time per function evaluation) on seconds. Dataset

logistic

t = 1.5

Savage

dna

(0.14, 0.00)

(0.31, 0.01)

(0.18, 0.00)

letter

(2.34, 0.01)

(63.88, 0.20)

(15.15, 0.01)

mnist

(47.47, 0.08)

(99.76, 0.44)

(36.34, 0.09)

protein

(2.89, 0.01)

(10.04, 0.04)

(2.98, 0.01)

rcv1

(3620.79, 3.56)

sensitacoustic

(3.46, 0.02)

(37.36, 0.18)

(6.63, 0.02)

sensitcombined

(9.23, 0.04)

(62.84, 0.20)

(13.71, 0.03)

sensitseismic

(3.11, 0.02)

(109.98, 0.31)

(10.97, 0.02)

usps

(6.44, 0.04)

(31.20, 0.06)

(2.78, 0.02)

(2391.85, 16.27) (726.63, 3.95)

VITA

176

VITA Nan Ding was born in Shanghai, China, on Feburary 14, 1986. After completing his work at Weiyu High School in Shanghai, he entered Tsinghua University in Beijing, China. In June 2008, he completed a Bachelor of Engineering in Electronic Engineering. He subsequently entered the Department of Computer Science at Purdue University, West Lafayette for graduate study. He completed a Master of Science in December 2010 and a Doctor of Philosophy in May 2013. His research interests are statistical machine learning, graphical models, Bayesian inference, and convex optimization.

STATISTICAL MACHINE LEARNING IN THE T ...

one of the three roles as marked on the figure. Without adding label noise, the black double arrow (N-S) is the optimal classifier of the convex losses, which classifies the clean data perfectly. However, if we add 10% label noise into the dataset (represented by narrow red or blue circles surrounding the blue or red blobs), the ...

2MB Sizes 0 Downloads 143 Views

Recommend Documents

Online PDF Statistical and Machine-Learning Data Mining
Online PDF Statistical and Machine-Learning Data Mining: Techniques for Better Predictive Modeling and Analysis of Big Data, Second Edition, Read PDF ...

paper - Statistical Machine Translation
Jul 30, 2011 - used to generate the reordering reference data are generated in an ... group to analyze reordering errors for English to Japanese machine ...

[PDF] Statistical and Machine-Learning Data Mining
Online PDF Statistical and Machine-Learning Data Mining: Techniques for Better Predictive Modeling and Analysis of Big Data, Second Edition, Read PDF ...

Machine Learning in Computer Vision
More specifically, machine learning offers effective methods for computer vision for ... vision problems, we find that some of the applications of machine learning ... The Internet, more specifically the Web, has become a common channel for ..... Cha

Machine Learning In Chemoinformatics - International Journal of ...
Support vector machine is one of the emerging m/c learning tool which is used in QSAR study ... A more recent use of SVM is in ranking of chemical structure [4].

Boyarshinov, Machine Learning in Computational Finance.PDF ...
Requirements for the Degree of. DOCTOR ... Major Subject: Computer Science .... PDF. Boyarshinov, Machine Learning in Computational Finance.PDF. Open.

Function Word Generation in Statistical Machine ...
Function Word Generation in Statistical Machine Translation Systems. ∗. Lei Cui† ... very high frequencies in a language compared to content words such ..... In Proc. Fourth Work- shop on SMT, pages 215-223. David Chiang. 2007. Hierarchical phras

Novel Reordering Approaches in Phrase-Based Statistical Machine ...
of plausible reorderings seen on training data. Most other ... source word will be mapped to a target phrase of one ... Mapping the bilingual language model to a.

Improvements in Phrase-Based Statistical Machine ... - Richard Zens
Chair of Computer Science VI. RWTH Aachen University. {zens ... tigate the degree of monotonicity and present the transla- tion results for three tasks: Verbmobil ...

Robust Large-Scale Machine Learning in the ... - Research at Google
and enables it to scale to massive datasets on low-cost com- modity servers. ... datasets. In this paper, we describe a new scalable coordinate de- scent (SCD) algorithm for ...... International Workshop on Data Mining for Online. Advertising ...