STATISTICAL MACHINE LEARNING IN THE T -EXPONENTIAL FAMILY OF DISTRIBUTIONS
A Dissertation Submitted to the Faculty of Purdue University by Nan Ding
In Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy
May 2013 Purdue University West Lafayette, Indiana
ii
To my family on earth and in heaven.
iii
ACKNOWLEDGMENTS I have been surely proud of being a Purdue Boilermaker for the last five years. During these years, I have been so blessed to live and work with so many kind and intelligent people. Especially, I would like to acknowledge a few people who have made big impact on my life and study. First of all, I would like to express my utmost gratitude to my Ph.D. advisor Prof. S.V.N.(Vishy) Vishwanathan. Vishy not only teaches me how to do solid research, but also influences me to be a better human being. Because of Vishy’s Chinese relation, he always says ”nu3 li4 gong1 zuo4, bu2 yao4 shui4 jiao4” (meaning work hard, do not sleep) and he himself is always working as hard if not harder than all of us. Although Vishy is a very strict advisor, he is also a very patient teacher. He has kept teaching me that presentation/writing should be as organizable as peeling the onions and he has kept helping me improve my skills. Although Vishy is very knowledgeable and smart, he is very honest about things that he is uncertain. It has been always enjoyable discussing with Vishy because of his intelligence, integrity and passion. Of course Ph.D. study is not just about reading and writing papers. During the four years in Vishy’s group, he also provided me with varieties of experience including giving lectures, writing proposals, organizing seminars and so on. In addition, Vishy always cares about our career development. He always encourages and helps us to build connections in the academic and industry communities. He has also supported and refered us to valuable internships in top research labs or companies every year. There are so many things that Vishy has done for me, for what I would be forever grateful. I would also like to deeply appreciate the following advisors of mine during the past five years: Prof. Alan Qi, who recruited me from Tsinghua University with the prestigious Ross Fellowship. Alan also served as my inital advisor who worked closely with me from day to night in my first year. Dr. Wray Buntine, who was my advisor during the visit
iv to National ICT Australia. It was great working with Wray on nonparametric Bayesian models and together with Changyou Chen we have nice collaborations for two years which lead to two publications. Dr. Cedric Archambeau and Shengbo Guo, who kindly hosted me during my internship at Xerox Research Centre Europe. The summer that I spent in France has been really enjoyable and fruitful. Prof. Manfred Warmuth, who graciously served as my co-advisor during Vishy’s sabbatical and hosted my two visits to UC Santa Cruz. He is truely a master of mind, who shared with me his wealth of brilliant ideas in both research and life. And finally, Prof. Jayanta Ghosh and David Gleich, who along with Alan served as my thesis advisors. They gave me comprehensive comments and invaluable advice to my thesis and research. Besides of my co-authors as well as the thesis committee members, there are a few other contributors to this thesis. Vasil Denchev helped me polishing the thesis. Dr. Xinhua Zhang spent hours helping me set up and compile the PETSc and TAO package so that I can run t-logistic regression in large scale datasets. The idea of generalizing t-logistic regression to mismatch losses first came from Prof. Manfred Warmuth during the discussion in the IMA workshop at University of Minnesota. The codes and experiments on t-CRF are joint work with Changyou Chen at NICTA. I really appreciate their tremendous effort and support. In addition, my Ph.D. study would not have been so wonderful without the generous help and joint work with my colleagues and collaborators at Purdue, NICTA, XRCE and UCSC. My thanks go to but do not limit to the following amazing people: Nguyen Cao, Shuhao Cao, Francois Caron, Xiaoxiao Chen, Yi Chen, Bo Dai, Jyotishka Datta, Lan Du, Yi Fang, Youhan Fang, Rupesh Gupta, Long He, Pei He, Dunxu Hu, Hongbin Kuang, Te Ke, Zhiqiang Lin, Fangjia Lu, Shin Matsushima, Hai Nguyen, Lichen Ni, Jiazhong Nie, Philip Ritchey, Ankan Saha, Huanyu Shao, Bin Shen, Sanvesh Srivastava, Zhaonan Sun, Xi Tan, Choonhui Teo, Tao Wang, Xu Wang, Rongjing Xiang, Jingjie Xiao, Chao Xu, Pinar Yanardag, Feng Yan, Jin Yu, Lin Yuan, Hyokun Yun, Dan Zhang, Lumin Zhang and Yao Zhu. I also want to extend my sincere thanks to the following professors, senior researchers, and staffs for their kind help: Doug Crabill, Stefania Delassus, Marian Duncan, William
v Gorman, Holly Graef, Sergey Kirshner, Chuanhai Liu, Hartmut Neven, Jeniffer Neville, Shaun Ponders, Luo Si, and Jian Zhang. Last but not least, I am incredibly grateful to my family for their love and support which keeps me moving forward during this unforgettable journey.
vi
TABLE OF CONTENTS Page LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
x
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
xiii
ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
xix
1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1
1.1
Dissertation Outline . . . . . . . . . . . . . . . . . . . . . . . . . . .
4
1.2
Collaborators and Related Publications . . . . . . . . . . . . . . . . .
7
2 BACKGROUND . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8
2.1
Noise Tolerance of Convex Losses . . . . . . . . . . . . . . . . . . . .
8
2.2
Logistic Regression and Exponential Family of Distributions . . . . . .
10
2.3
Φ-Exponential Family of Distributions . . . . . . . . . . . . . . . . . .
13
2.3.1
Φ-Logarithm . . . . . . . . . . . . . . . . . . . . . . . . . . .
13
2.3.2
Φ-Exponential . . . . . . . . . . . . . . . . . . . . . . . . . .
14
2.3.3
Φ-Exponential Family of Distributions . . . . . . . . . . . . .
16
2.3.4
T -Exponential Family of Distributions . . . . . . . . . . . . .
18
Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . .
20
3 T -LOGISTIC REGRESSION . . . . . . . . . . . . . . . . . . . . . . . . .
21
3.1
Binary Classification . . . . . . . . . . . . . . . . . . . . . . . . . . .
21
3.2
Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
23
3.2.1
Bayes-Risk Consistency . . . . . . . . . . . . . . . . . . . . .
25
3.2.2
Robustness . . . . . . . . . . . . . . . . . . . . . . . . . . . .
26
3.2.3
Multiple Local Minima . . . . . . . . . . . . . . . . . . . . .
30
3.3
Multiclass Classification . . . . . . . . . . . . . . . . . . . . . . . . .
32
3.4
Optimization Methods . . . . . . . . . . . . . . . . . . . . . . . . . .
37
Convex Multiplicative Programming . . . . . . . . . . . . . .
37
2.4
3.4.1
vii Page Empirical Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . .
39
3.5.1
Noise Models . . . . . . . . . . . . . . . . . . . . . . . . . .
41
3.5.2
Experiment Design . . . . . . . . . . . . . . . . . . . . . . .
45
3.5.3
Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
46
Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . .
51
4 T -DIVERGENCE BASED APPROXIMATE INFERENCE . . . . . . . . . .
52
Variational Inference in Exponential Family of Distributions . . . . . .
52
4.1.1
Mean Field Methods . . . . . . . . . . . . . . . . . . . . . . .
54
4.1.2
Assumed Density Filtering . . . . . . . . . . . . . . . . . . .
56
T -Entropy and T -Divergence . . . . . . . . . . . . . . . . . . . . . .
58
4.2.1
T -Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . .
59
4.2.2
T -Divergence . . . . . . . . . . . . . . . . . . . . . . . . . .
62
3.5
3.6
4.1
4.2
. . . . .
63
4.3.1
Mean Field Methods . . . . . . . . . . . . . . . . . . . . . . .
65
4.3.2
Assumed Density Filtering . . . . . . . . . . . . . . . . . . .
69
Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . .
73
5 T -CONDITIONAL RANDOM FIELDS . . . . . . . . . . . . . . . . . . . .
74
Conditional Random Fields . . . . . . . . . . . . . . . . . . . . . . .
74
5.1.1
Undirected Graphical Models . . . . . . . . . . . . . . . . . .
74
5.1.2
Conditional Random Fields . . . . . . . . . . . . . . . . . . .
75
5.1.3
Parameter Estimation . . . . . . . . . . . . . . . . . . . . . .
77
5.1.4
Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
78
5.2
T -Conditional Random Fields . . . . . . . . . . . . . . . . . . . . . .
79
5.3
Approximate Inference . . . . . . . . . . . . . . . . . . . . . . . . . .
82
5.4
2-D T -CRF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
84
5.5
Empirical Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . .
85
5.6
Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . .
91
6 GENERALIZED T -LOGISTIC REGRESSION . . . . . . . . . . . . . . . .
92
4.3
4.4
5.1
Variational Inference in T -Exponential Family of Distribution
viii Page 6.1
Binary Classification . . . . . . . . . . . . . . . . . . . . . . . . . . .
92
6.2
Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
94
6.3
Multiclass Classification . . . . . . . . . . . . . . . . . . . . . . . . .
96
6.4
Empirical Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . .
96
6.5
Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . .
100
7 SUMMARY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
101
7.1
Contributions
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
101
7.2
Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
102
LIST OF REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
104
APPENDICES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
108
Appendix A: Fundamentals of Convex Optimizations . . . . . . . . . . . . .
108
A.1
Convex Analysis . . . . . . . . . . . . . . . . . . . . . . . . .
108
A.2
Numerical Optimization . . . . . . . . . . . . . . . . . . . . .
110
Appendix B: Technical Proofs and Verifications . . . . . . . . . . . . . . . .
111
B.1
Proof of Theorem 2.3.1 . . . . . . . . . . . . . . . . . . . . .
111
B.2
Proof of Theorem 2.3.2 . . . . . . . . . . . . . . . . . . . . .
112
B.3
Proof of Lemma 3.2.1 . . . . . . . . . . . . . . . . . . . . . .
113
B.4
Proof of Lemma 3.2.2 . . . . . . . . . . . . . . . . . . . . . .
113
B.5
Proof of Theorem 3.2.3 . . . . . . . . . . . . . . . . . . . . .
114
B.6
Proof of Theorem 3.4.1 . . . . . . . . . . . . . . . . . . . . .
115
B.7
Proof of Theorem 4.2.1 . . . . . . . . . . . . . . . . . . . . .
116
B.8
Verification in Section 3.1 . . . . . . . . . . . . . . . . . . . .
116
B.9
Verification in Section 3.2.2 . . . . . . . . . . . . . . . . . . .
117
B.10
Verification in Section 3.3 . . . . . . . . . . . . . . . . . . . .
119
B.11
Verification in Definition 4.2.2 . . . . . . . . . . . . . . . . . .
120
B.12
Verification of Equation (4.33) . . . . . . . . . . . . . . . . . .
120
B.13
Verification in Section 4.3.1 . . . . . . . . . . . . . . . . . . .
121
B.14
Verification in Section 4.3.2 . . . . . . . . . . . . . . . . . . .
123
ix Page Verification in Section 6.2 . . . . . . . . . . . . . . . . . . . .
124
Appendix C: Additional Figures of Section 3.5 . . . . . . . . . . . . . . . .
125
Appendix D: Additional Figures of Section 6.4 . . . . . . . . . . . . . . . .
159
Appendix E: Additional Tables of Section 3.5 . . . . . . . . . . . . . . . . .
173
VITA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
176
B.15
x
LIST OF TABLES Table 1.1 2.1 3.1 3.2 3.3
3.4
3.5
3.6
4.1 5.1
Page Some popular convex losses used for binary classification. The loss functions are plotted in Figure 1.1. . . . . . . . . . . . . . . . . . . . . . . . . . . .
2
A few examples of non-convex losses for binary classification. The loss functions are plotted in Figure 2.2. erf is error function. . . . . . . . . . . . .
10
The robustness of some loss functions for binary classification based on Il (u). The verifications are provided in Appendix B.9. . . . . . . . . . . . . . .
30
Average time (in milliseconds) spent by our iterative scheme and fsolve in ˆ t (b computing G a) for multiclass t-logistic regression. . . . . . . . . . . . .
36
Summary of the binary classification datasets used in our experiments. n is the total # of examples, d is the # of features, and n+ : n− is the ratio of the number of positive vs negative examples. M denotes a million. ∗ deontes the datasets that are unused in Chapter 6 due to high computational costs. . . . . . . .
41
Summary of the multiclass classification datasets used in our experiments. n is the total # of examples, d is the # of features, nc is the # of classes. ∗ deontes the datasets that are unused in Chapter 6 due to high computational costs. .
42
The number of binary classification datasets that logistic regression or t-logistic regression is significantly better than the other. The left part of the table is the number of datasets where logistic regression is significantly better; the right part is the number of datasets where t-logistic regression is significantly better. The total number of datasets is 24 (dna and ocr datasets are excluded). . .
47
The number of multiclass classification datasets that logistic regression or tlogistic regression is significantly better than the other. The left part of the table is the number of datasets where logistic regression is significantly better; the right part is the number of datasets where t-logistic regression is significantly better. The total number of datasets is 9. . . . . . . . . . . . . . . . . . .
47
The accumulated prediction error rate on the synthetic online dataset using Bayesian online learning. . . . . . . . . . . . . . . . . . . . . . . . . . .
73
Comparisons of p(y 1 | X) and p(y 1 |y 2 , y 3 , X) between t-CRF (t = 1.1 and 1.5) and CRF (t = 1.0) in the 3-node chain example. . . . . . . . . . . . . . . .
81
xi Table 5.2
6.1 6.2 6.3 6.4
6.5
Page Optimal Parameters t and λ for CRF and t-CRF in image denoising task and image annotation task. (0%) denotes no extreme noise is added and (20%) denotes 20% extreme noise is added. . . . . . . . . . . . . . . . . . . . .
87
The number of binary datasets that each value of t2 for Mis-0 loss is optimal based on cross validation. The total number of datasets is 20. . . . . . . .
98
The number of binary datasets that each value of t2 for Mis-I loss is optimal based on cross validation. The total number of datasets is 20. . . . . . . .
98
The number of binary datasets that each value of t1 for Mis-II loss is optimal based on cross validation. The total number of datasets is 20. . . . . . . .
98
The number of binary classification datasets that the test error between mismatch losses is significant. Each column of the table is the number of datasets where a certain type of mismatch loss has significantly lower test error than another. The total number of datasets is 20. . . . . . . . . . . . . . . . . .
99
The number of multiclass classification datasets that the test error between mismatch losses is significant. Each column of the table is the number of datasets where a certain type of mismatch loss has significantly lower test error than another. The total number of datasets is 8. . . . . . . . . . . . . . . . . .
99
xii Appendix Table
Page
E.1 CPU time spent on binary datasets (Total time, Averaged time per function evaluation) on seconds. . . . . . . . . . . . . . . . . . . . . . . . . . . .
174
E.2 CPU time spent on multiclass datasets (Total time, Averaged time per function evaluation) on seconds. . . . . . . . . . . . . . . . . . . . . . . . . . . .
175
xiii
LIST OF FIGURES Figure 1.1 1.2
2.1
2.2
2.3 2.4 3.1
3.2
3.3
Some commonly used convex surrogate loss functions, including hinge loss, logistic loss, and exponential loss, for binary classification. . . . . . . . . . T -logistic loss for binary classification with different t: t = 1.3, t = 1.6, t = 1.9. Unlike the logistic loss (which is t = 1), t-logistic loss bends down as u 0, which caps the influence from outliers. . . . . . . . . . . . . . . .
Page 3
5
The Long-Servedio dataset. Points with label +1 are in red, while points with label −1 are in blue. Each blob of data points plays one of the three roles: large margin (25%), puller (25%), penalizer (50%). The black double arrow represents the true classifier. The red double arrow represents the optimal classifier of convex losses when 10% of data labels are flipped (represented by the circles surrounding the blobs). The red double arrow is no longer able to classify the penalizers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9
Some commonly used non-convex loss functions, including ramp loss, sigmoid loss, and Savage loss, for binary classification. We omit the probit loss because it is very close to sigmoid loss. . . . . . . . . . . . . . . . . . . . . . . .
11
The left figure depicts logt for the various values of t indicated. The right figure zooms in to better depict the interval [0, 1] in which logt are negative. . . . .
14
The left figure depicts expt for the various values of t indicated. The right figure zooms in to better depict when expt can achieve the value zero. . . .
15
T -logistic loss for binary classification with four different t: t = 1.3, t = 1.6, t = 1.9. Unlike the logistic loss (which is t = 1), t-logistic loss bends down as u 0, which caps the influence from outliers. . . . . . . . . . . . . . . .
24
An illustration of the three robust types. All the three types of losses behave similarly as u > 0. When u → −∞, Type-0 loss goes to +∞; Type-I loss goes to a constant; and Type-II loss goes to 0. . . . . . . . . . . . . . . . . . .
29
The empirical risk of t-logistic regression (upper) and Savage loss (lower) on a toy two dimensional dataset. T -logistic regression appears to be easier to optimize than Savage loss. . . . . . . . . . . . . . . . . . . . . . . . . .
31
xiv Figure 3.4
4.1
4.2
4.3
4.4
5.1 5.2 5.3 5.4 5.5
5.6
6.1
Page
Empirical risk of logistic regression and t-logistic regression on the one dimensional example. The optimal solutions before and after adding the outlier are significantly different for logistic regression. In contrast, the global optimum of t-logistic regression stays the same. . . . . . . . . . . . . . . . . . . .
33
T -entropy corresponding to two well known probability distributions. Left: the Bernoulli distribution p(z; µ). Right: the 1-dimensional Student’s t-distribution p(z; 0, σ 2 , v), where v = 2/(t−1)−1. One recovers the SBG entropy by letting t = 1.0. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
61
T -divergence between two distributions. Top: Bernoulli distributions p1 = p(z; µ) and p2 = p(z; 0.5). Bottom: Student’s t-distributions. Left: p1 = p(z; µ, 1, v) and p2 = p(z; 0, 1, v). Right: p1 = p(z; 0, σ 2 , v) and p2 = p(z; 0, 1, v). v = 2/(t − 1) − 1. One recovers the K-L divergence by letting t = 1.0. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
64
The negative t-divergence between the product of ten 1-dimensional Student’s t-distributions and one 10-dimensional Student’s t-distribution using the mean field approach for 500 iterations. . . . . . . . . . . . . . . . . . . . . . .
69
The discrepancy D(i) between the true weight vector w(i) and Ep˜i [w] the posterior mean of p ˜i (w) at each data i from the synthetic online dataset using Bayesian online learning. Left: case I. Right: case II. . . . . . . . . . . . .
73
The 3-node chain model. Each node indicates a variable. Each edge on the graph represents a dependency. . . . . . . . . . . . . . . . . . . . . . . .
75
The 3-node conditional chain model. Blue nodes indicate the labels; red nodes indicate the data variables. Each edge on the graph represents a factor. . . .
76
A 2-D conditional model. Blue nodes indicate the labels; red nodes indicate the observed input variables. . . . . . . . . . . . . . . . . . . . . . . . .
84
Test error between t-CRF and CRF with and without extreme noise added. Left: image denoising task. Right: image annotation task. . . . . . . . . .
88
Image denoising task. Top row is the dataset: left is the input image; right is the true label. Middle row is the denoise result without extreme noise: left is CRF, right is t-CRF. Bottom row is the denoise result with extreme noise: left is CRF, right is t-CRF. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
89
Image annotation task. The first and the third rows are the annotation results without extreme noise: left is CRF, right is t-CRF. The second and the fourth rows are the annotation results with extreme noise: left is CRF, right is t-CRF.
90
Generalized t-logistic regression with t2 = 1 and four different t1 : t1 = 1, t1 = 0.7, t1 = 0.4, t1 = 0.1. . . . . . . . . . . . . . . . . . . . . . . . . .
95
xv Appendix Figure
Page
C.1 Experiment on adult9 Dataset. Top: Generalization Performance; Middle: Random Initialization; Bottom: Forgetting Variables. . . . . . . . . . . . .
126
C.2 Experiment on alpha Dataset. Top: Generalization Performance; Middle: Random Initialization; Bottom: Forgetting Variables. . . . . . . . . . . . .
127
C.3 Experiment on astro-ph Dataset. Top: Generalization Performance; Middle: Random Initialization; Bottom: Forgetting Variables. . . . . . . . . . .
128
C.4 Experiment on aut-avn Dataset. Top: Generalization Performance; Middle: Random Initialization; Bottom: Forgetting Variables. . . . . . . . . . . . .
129
C.5 Experiment on beta Dataset. Top: Generalization Performance; Middle: Random Initialization; Bottom: Forgetting Variables. . . . . . . . . . . . .
130
C.6 Experiment on covertype Dataset. Top: Generalization Performance; Middle: Random Initialization; Bottom: Forgetting Variables. . . . . . . . . . .
131
C.7 Experiment on delta Dataset. Top: Generalization Performance; Middle: Random Initialization; Bottom: Forgetting Variables. . . . . . . . . . . . .
132
C.8 Experiment on epsilon Dataset. Top: Generalization Performance; Middle: Random Initialization; Bottom: Forgetting Variables. . . . . . . . . . . . .
133
C.9 Experiment on gamma Dataset. Top: Generalization Performance; Middle: Random Initialization; Bottom: Forgetting Variables. . . . . . . . . . . . .
134
C.10 Experiment on kdd99 Dataset. Top: Generalization Performance; Middle: Random Initialization; Bottom: Forgetting Variables. . . . . . . . . . . . .
135
C.11 Experiment on kdda Dataset. Top: Generalization Performance; Middle: Random Initialization; Bottom: Forgetting Variables. . . . . . . . . . . . .
136
C.12 Experiment on kddb Dataset. Top: Generalization Performance; Middle: Random Initialization; Bottom: Forgetting Variables. . . . . . . . . . . . .
137
C.13 Experiment on longservedio Dataset. Top: Generalization Performance; Middle: Random Initialization; Bottom: Forgetting Variables. . . . . . . . .
138
C.14 Experiment on measewyner Dataset. Top: Generalization Performance; Middle: Random Initialization; Bottom: Forgetting Variables. . . . . . . . .
139
C.15 Experiment on mushrooms Dataset. Top: Generalization Performance; Middle: Random Initialization; Bottom: Forgetting Variables. . . . . . . . . . .
140
C.16 Experiment on news20 Dataset. Top: Generalization Performance; Middle: Random Initialization; Bottom: Forgetting Variables. . . . . . . . . . . . .
141
C.17 Experiment on real-sim Dataset. Top: Generalization Performance; Middle: Random Initialization; Bottom: Forgetting Variables. . . . . . . . . . .
142
xvi Appendix Figure
Page
C.18 Experiment on reuters-c11 Dataset. Top: Generalization Performance; Middle: Random Initialization; Bottom: Forgetting Variables. . . . . . . . .
143
C.19 Experiment on reuters-ccat Dataset. Top: Generalization Performance; Middle: Random Initialization; Bottom: Forgetting Variables. . . . . . . . .
144
C.20 Experiment on web8 Dataset. Top: Generalization Performance; Middle: Random Initialization; Bottom: Forgetting Variables. . . . . . . . . . . . .
145
C.21 Experiment on webspamtrigram Dataset. Top: Generalization Performance; Middle: Random Initialization; Bottom: Forgetting Variables. . . . . . . . . 146 C.22 Experiment on webspamunigram Dataset. Top: Generalization Performance; Middle: Random Initialization; Bottom: Forgetting Variables. . . . . . . . . 147 C.23 Experiment on worm Dataset. Top: Generalization Performance; Middle: Random Initialization; Bottom: Forgetting Variables. . . . . . . . . . . . .
148
C.24 Experiment on zeta Dataset. Top: Generalization Performance; Middle: Random Initialization; Bottom: Forgetting Variables. . . . . . . . . . . . .
149
C.25 Generalization Performance on dna Dataset. . . . . . . . . . . . . . . . .
149
C.26 Generalization Performance on ocr Dataset. . . . . . . . . . . . . . . . .
149
C.27 Experiment on dna Dataset. Top: Generalization Performance; Middle: Random Initialization; Bottom: Forgetting Variables. . . . . . . . . . . . . . .
150
C.28 Experiment on letter Dataset. Top: Generalization Performance; Middle: Random Initialization; Bottom: Forgetting Variables. . . . . . . . . . . . .
151
C.29 Experiment on mnist Dataset. Top: Generalization Performance; Middle: Random Initialization; Bottom: Forgetting Variables. . . . . . . . . . . . .
152
C.30 Experiment on protein Dataset. Top: Generalization Performance; Middle: Random Initialization; Bottom: Forgetting Variables. . . . . . . . . . . . .
153
C.31 Experiment on rcv1 Dataset. Top: Generalization Performance; Middle: Random Initialization; Bottom: Forgetting Variables. . . . . . . . . . . . .
154
C.32 Experiment on sensitacoustic Dataset. Top: Generalization Performance; Middle: Random Initialization; Bottom: Forgetting Variables. . . . . . . . . 155 C.33 Experiment on sensitcombined Dataset. Top: Generalization Performance; Middle: Random Initialization; Bottom: Forgetting Variables. . . . . . . . . 156 C.34 Experiment on sensitseismic Dataset. Top: Generalization Performance; Middle: Random Initialization; Bottom: Forgetting Variables. . . . . . . . .
157
xvii Appendix Figure
Page
C.35 Experiment on usps Dataset. Top: Generalization Performance; Middle: Random Initialization; Bottom: Forgetting Variables. . . . . . . . . . . . .
158
D.1 Experiment on adult9 Dataset. Top: Generalization Performance; Bottom: Random Initialization. . . . . . . . . . . . . . . . . . . . . . . . . . . .
159
D.2 Experiment on alpha Dataset. Top: Generalization Performance; Bottom: Random Initialization. . . . . . . . . . . . . . . . . . . . . . . . . . . .
159
D.3 Experiment on astro-ph Dataset. Top: Generalization Performance; Bottom: Random Initialization. . . . . . . . . . . . . . . . . . . . . . . . . .
160
D.4 Experiment on aut-avn Dataset. Top: Generalization Performance; Bottom: Random Initialization. . . . . . . . . . . . . . . . . . . . . . . . . . . .
160
D.5 Experiment on beta Dataset. Top: Generalization Performance; Bottom: Random Initialization. . . . . . . . . . . . . . . . . . . . . . . . . . . .
161
D.6 Experiment on covertype Dataset. Top: Generalization Performance; Bottom: Random Initialization. . . . . . . . . . . . . . . . . . . . . . . . . .
161
D.7 Experiment on delta Dataset. Top: Generalization Performance; Bottom: Random Initialization. . . . . . . . . . . . . . . . . . . . . . . . . . . .
162
D.8 Experiment on gamma Dataset. Top: Generalization Performance; Bottom: Random Initialization. . . . . . . . . . . . . . . . . . . . . . . . . . . .
162
D.9 Experiment on kdd99 Dataset. Top: Generalization Performance; Bottom: Random Initialization. . . . . . . . . . . . . . . . . . . . . . . . . . . .
163
D.10 Experiment on longservedio Dataset. Top: Generalization Performance; Bottom: Random Initialization. . . . . . . . . . . . . . . . . . . . . . . .
163
D.11 Experiment on measewyner Dataset. Top: Generalization Performance; Bottom: Random Initialization. . . . . . . . . . . . . . . . . . . . . . . .
164
D.12 Experiment on mushrooms Dataset. Top: Generalization Performance; Bottom: Random Initialization. . . . . . . . . . . . . . . . . . . . . . . . . .
164
D.13 Experiment on news20 Dataset. Top: Generalization Performance; Bottom: Random Initialization. . . . . . . . . . . . . . . . . . . . . . . . . . . .
165
D.14 Experiment on real-sim Dataset. Top: Generalization Performance; Bottom: Random Initialization. . . . . . . . . . . . . . . . . . . . . . . . . .
165
D.15 Experiment on reuters-c11 Dataset. Top: Generalization Performance; Bottom: Random Initialization. . . . . . . . . . . . . . . . . . . . . . . .
166
D.16 Experiment on reuters-ccat Dataset. Top: Generalization Performance; Bottom: Random Initialization. . . . . . . . . . . . . . . . . . . . . . . .
166
xviii Appendix Figure D.17 Experiment on web8 Dataset. Top: Generalization Performance; Bottom: Random Initialization. . . . . . . . . . . . . . . . . . . . . . . . . . . .
Page 167
D.18 Experiment on webspamunigram Dataset. Top: Generalization Performance; Bottom: Random Initialization. . . . . . . . . . . . . . . . . . . . . . . . 167 D.19 Experiment on worm Dataset. Top: Generalization Performance; Bottom: Random Initialization. . . . . . . . . . . . . . . . . . . . . . . . . . . .
168
D.20 Experiment on zeta Dataset. Top: Generalization Performance; Bottom: Random Initialization. . . . . . . . . . . . . . . . . . . . . . . . . . . .
168
D.21 Experiment on dna Dataset. Top: Generalization Performance; Bottom: Random Initialization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
169
D.22 Experiment on letter Dataset. Top: Generalization Performance; Bottom: Random Initialization. . . . . . . . . . . . . . . . . . . . . . . . . . . .
169
D.23 Experiment on mnist Dataset. Top: Generalization Performance; Bottom: Random Initialization. . . . . . . . . . . . . . . . . . . . . . . . . . . .
170
D.24 Experiment on protein Dataset. Top: Generalization Performance; Bottom: Random Initialization. . . . . . . . . . . . . . . . . . . . . . . . . . . .
170
D.25 Experiment on sensitacoustic Dataset. Top: Generalization Performance; Bottom: Random Initialization. . . . . . . . . . . . . . . . . . . . . . . . 171 D.26 Experiment on sensitcombined Dataset. Top: Generalization Performance; Bottom: Random Initialization. . . . . . . . . . . . . . . . . . . . . . . . 171 D.27 Experiment on sensitseismic Dataset. Top: Generalization Performance; Bottom: Random Initialization. . . . . . . . . . . . . . . . . . . . . . . .
172
D.28 Experiment on usps Dataset. Top: Generalization Performance; Bottom: Random Initialization. . . . . . . . . . . . . . . . . . . . . . . . . . . .
172
xix
ABSTRACT Ding, Nan Ph.D., Purdue University, May 2013. Statistical Machine Learning in the T Exponential Family of Distributions. Major Professor: S.V.N. Vishwanathan. The exponential family of distributions plays an important role in statistics and machine learning. They underlie numerous models such as logistic regression and probabilistic graphical models. However, exponential family based probabilistic models are vulnerable to outliers. This dissertation aims to design machine learning models based on a more generalized distribution family, namely the t-exponential family of distributions, and show that efficient inference algorithms exist for these models. We first focus on the classification problem and propose t-logistic regression, which replaces the exponential family in logistic regression by a t-exponential family and is more robust in the presence of label noise. Second, inspired by variational inference in the exponential family, we define a new t-entropy which is the Fenchel conjugate to the log-partition function of the t-exponential family. By minimizing the t-divergence, the Bregman divergence of t-entropy, between the approximate and the true distribution, we develop efficient variational inference approaches for t-exponential family based graphical models. Using our inference procedure, we generalize conditional random fields (CRF) to t-CRF, and show how t-divergence based mean field approach can be used to approximate the logpartition function. Finally, t-divergence is combined with t-logistic regression to obtain a generalized family of convex and non-convex loss functions for classification. Empirical evaluation of our models on a variety of datasets is presented to demonstrate their advantages.
1
1. INTRODUCTION Consider the classic machine learning problem of binary classification: we are given m training data points {x1 , . . . , xm } and their corresponding labels {y1 , . . . , ym }, with xi drawn from some vector space X and yi ∈ {+1, −1}. The task is to learn a function f : X → {+1, −1} which can predict the labels on unseen data. In this dissertation, we focus on linear models: f (x) := sign (hΦ(x), θi). Here Φ is a feature map, θ are the parameters of the model, h·, ·i denotes an inner product, and sign(z) = +1 if z > 0 and −1 otherwise. One way to learn θ is to define a loss function l(x, y, θ) and minimize the averaged loss, or empirical risk: m
1 X min Remp (θ) := l (xi , yi , θ) . θ m i=1 In order to prevent overfitting to the training data, it is customary to add a regularizer Ω(θ) to Remp (θ) and minimize the regularized risk: min J(θ) := λΩ(θ) + Remp (θ). θ
Here λ is a scalar which trades off the importance of the regularizer and the empirical risk. While a variety of regularizers are commonly used (see e.g. [1]), we will restrict our attention to the L2 -regularizer: Ω(θ) =
1 kθk22 . 2
Let b a = hΦ(x), θi and u(x, y, θ) := y · b a denote the margin of (x, y). Where it is clear from context, we will use u to denote u(x, y, θ) and ui to denote u(xi , yi , θ). Note that
2 u > 0 if, and only if, f (x) = y, that is, x is correctly classified. Therefore, a natural loss function to define is the 0-1 loss:
l(x, y, θ) =
0
if u > 0
1
otherwise.
(1.1)
Unfortunately, the 0-1 loss is non-convex, non-smooth, and it is NP-hard to even approximately minimize the empirical risk with this loss [2]. Therefore, a lot of research effort has been directed towards finding surrogate losses which are computationally tractable. In particular, convex loss functions are in vogue mainly because the regularized risk minimization problem can be solved efficiently with readily available tools [3]. Table 1.1 summarizes a few popular convex losses and Figure 1.1 contrasts them with the 0-1 loss1 .
Table 1.1 Some popular convex losses used for binary classification. The loss functions are plotted in Figure 1.1. Name
Loss Function
Hinge
l(x, y, θ) = max(0, 1 − u)
Exponential
l(x, y, θ) = exp(−u)
Logistic
l(x, y, θ) = log(1 + exp(−u)
Despite many successes of the binary classification algorithms based on convex losses, as [4, 5] point out, those algorithms are not noise-tolerant2 (see Section 2.1). Intuitively, as can be seen from Figure 1.1 the convex loss functions grow at least linearly as u ∈ (−∞, 0), which causes data with u 0 to become too important. There has been some recent and not-so-recent work on using non-convex loss functions to alleviate the above problem. Although these non-convex losses are empirically more robust, they also lose some key advantages of convex losses. For instance, finding 1
Note that the logistic loss and the later t-logistic loss are plotted by dividing the losses by log(2). Although, the analysis of [4] is carried out in the context of boosting, we believe, the results hold for a larger class of algorithms which minimize a regularized risk with a convex loss function. 2
3
logistic
exp
loss
hinge
6
4
2 0-1 loss
-4
-2
0
2
4
margin
Fig. 1.1. Some commonly used convex surrogate loss functions, including hinge loss, logistic loss, and exponential loss, for binary classification.
4 the global optimum becomes very hard because the empirical risk may have multiple local minima. Unlike certain convex losses, such as logistic regression (see Section 2.2), those non-convex losses do not have a proper probabilistic interpretation. The probabilistic interpretation is important to generalize these losses to more complex settings, such as modeling interacting factors where probabilistic graphical models are widely applied [6, 7]. In this dissertation, we propose to investigate a non-convex loss function which is firmly grounded in probability theory. By extending logistic regression from the exponential family to the t-exponential family3 , a natural extension of exponential family of distributions studied in statistical physics [8–10] (reviewed in Section 2.3), we obtain the t-logistic regression (as shown in Figure 1.2). Furthermore, we show that our loss can be generalized to more complicated probabilistic models, e.g. t-conditional random fields. In order to make efficient inference in these complicated models, we study a new t-entropy which is the Fenchel conjugate to the log-partition function of t-exponential family. We develop two variational inference methods by minimizing the t-divergence, the Bregman divergence of t-entropy. Finally, we show that t-divergence can also be combined with t-logistic regression to obtain a more generalized family of loss functions for classification.
1.1
Dissertation Outline Our dissertation is structured as follows:
Chapter 2. Background In this chapter, we review some related background material, including noise tolerance of convex losses, probabilistic interpretation of logistic regression, exponential family of distributions and its generalization the t-exponential family of distributions. Chapter 3. T -Logistic Regression
In this chapter, we try to improve the robustness
of logistic regression for classification. Our main idea is to use t-exponential family to 3
Also known as the q-exponential family or the Tsallis distribution in statistical physics. C. Tsallis is one of pioneers of nonextensive entropy and generalized exponential family.
5
loss
t = 1 (logistic)
6 t = 1.3 4 t = 1.6 t = 1.9 2 0-1 loss -4
-2
0
2
4
margin
Fig. 1.2. T -logistic loss for binary classification with different t: t = 1.3, t = 1.6, t = 1.9. Unlike the logistic loss (which is t = 1), t-logistic loss bends down as u 0, which caps the influence from outliers.
6 replace the exponential family for modeling the conditional likelihood of the examples. We demonstrate the robustness of t-logistic regression both theoretically and empirically. We also show that the algorithm is empirically stable under random initialization. Chapter 4. T -Divergence Based Approximate Inference
In order to work with multi-
variate probabilistic models, one key challenge is to make efficient inference. Approximate inference is an important technique for dealing with large-scale graphical models based on exponential family of distributions. In this chapter, we extend the idea to t-exponential family by defining a new t-divergence. This divergence measure is obtained via convex conjugacy between the log-partition function of the t-exponential family and a new t-entropy. We propose two approximate inference algorithms for t-exponential family of distributions. Chapter 5. T -Conditional Random Fields In this chapter, we propose the t-conditional random field (t-CRF) which generalizes the conditional random fields to t-exponential family. This new t-CRF abandons the Markov properties as well as the Hammersley-Clifford theorem, and appears to be more robust. It applies the mean field method based on tdivergence to make efficient inference. Chapter 6. Generalized T -Logistic Regression This chapter combines t-divergence and t-logistic regression for classification. We obtain a family of convex and non-convex loss functions with all types of robustness. Chapter 7. Summary
We summarize our contributions and provide a discussion on
future work in the last chapter of the dissertation. Appendix A. Fundamentals of Convex Optimizations Appendix A provides a brief review of some concepts and properties in convex analysis as well as two well-known numerical optimization methods used in this dissertation.
7 Appendix B. Technical Proofs and Verifications
Appendix B provides the technical
proofs and verifications in this dissertation. Appendices C/D/E. Additional Figures and Tables
Appendices C, D and E provide
additional figures and tables from the empirical results in this dissertation.
1.2
Collaborators and Related Publications Chapter 3 was the joint work with S.V.N. Vishwanathan, Vasil Denchev, and Manfred
Warmuth. The work was first published in ”t-logistic regression” in Advances in Neural Information Processing Systems 23, 2010. Chapter 4 was the joint work with S.V.N. Vishwanathan and Alan Qi. The work was first published in ”t-divergence based approximate inference” in Advances in Neural Information Processing Systems 24, 2011. Chapter 5 was the joint work with S.V.N. Vishwanathan and Changyou Chen. The work has not yet been published. Chapter 6 was the joint work with S.V.N. Vishwanathan and Manfred Warmuth. The work has not yet been published.
8
2. BACKGROUND In this chapter, we present some existing literature and background material that is used later in this dissertation. We will first review the famous example proposed in [4] which shows that uniform random label noise defeats all convex classifiers. Next, we review logistic regression, discuss its probabilistic interpretation and relation to exponential family of distributions. Finally, we review t-exponential family of distributions.
2.1
Noise Tolerance of Convex Losses Convexity is a very attractive property because it ensures that the regularized risk min-
imization problem has a unique global optimum1 [3]. However, as was recently shown by [4], learning algorithms based on convex loss functions are not robust to noise. In [4], the authors constructed an interesting dataset to show that convex losses are not tolerant to uniform label noise (label noise is added by flipping a portion of the labels of the training data). In their dataset, each data point has a 21-dimensional feature vector and plays one of three possible roles: large margin examples (25%, x1,2,...,21 = y); pullers (25%, x1,...,11 = y, x12,...,21 = −y); and penalizers (50%, Randomly select and set 5 of the first 11 coordinates and 6 out of the last 10 coordinates to y, and set the remaining coordinates to −y). Note that all the data has the same magnitude in terms of L1 , L2 , L∞ norm. This dataset is illustrated in Figure 2.1. We use the red blobs to represent the points with label +1, and the blue blobs to represent the points with label −1. Each blob plays one of the three roles as marked on the figure. Without adding label noise, the black double arrow (N-S) is the optimal classifier of the convex losses, which classifies the clean data perfectly. However, if we add 10% label noise into the dataset (represented by narrow red or blue circles surrounding the blue or red blobs), the optimal classifier of the convex losses 1
By unique global optimum, we mean the uniqueness of minimum objective.
9 puller
Large
penalizer
Large
Margin
Margin penalizer
puller
Fig. 2.1. The Long-Servedio dataset. Points with label +1 are in red, while points with label −1 are in blue. Each blob of data points plays one of the three roles: large margin (25%), puller (25%), penalizer (50%). The black double arrow represents the true classifier. The red double arrow represents the optimal classifier of convex losses when 10% of data labels are flipped (represented by the circles surrounding the blobs). The red double arrow is no longer able to classify the penalizers.
changes to the red double arrow (NW-SE). Obviously, the new classifier is no longer able to distinguish the penalizers. We can intuitively see the reason from the shape of the convex loss functions. According to Figure 1.1, the convex losses grow at least linearly with slope |l0 (0)| as u ∈ (−∞, 0), which introduces an extremely large loss from the data point with u 0. Therefore, the flipped large margin examples in Figure 2.1 dramatically increase the empirical risk of the black classifier, which becomes larger than that of the red classifier. Since convex losses are non-robust against random label noise, many nonconvex losses have been investigated to improve the robustness of the classifier. We list some but not all commonly used nonconvex losses in Table 2.1. However, those non-convex losses have their own problems. First of all, although the non-convex losses are empirically more robust, they also lose some key advantages of convex losses, especially they may stuck into
10 local minima of their empirical risk. More importantly, over the past decades, probabilistic graphical models [6, 7] have been widely used as powerful and efficient tools to model interacting factors in multivariate data. However, none of those losses have a proper probabilistic interpretation (e.g. see Section 2.2), which largely limits their generalization to more complicated applications.
Table 2.1 A few examples of non-convex losses for binary classification. The loss functions are plotted in Figure 2.2. erf is error function. Name
Loss Function
Probit
l(x, y, θ) = 1 − erf (u)
Ramp
l(x, y, θ) = min(2, max(0, 1 − u))
Savage Sigmoid
2.2
l(x, y, θ) =
4 (1+exp(u))2
l(x, y, θ) =
2 1+exp(u)
Logistic Regression and Exponential Family of Distributions In contrast, logistic loss, the loss function of logistic regression, is well motivated from
a probabilistic perspective. As shown in [11, 12] (also see Section 5.1), its generalization to probabilistic graphical models, e.g. conditional random fields, will be natural and convenient. In this section, we briefly review logistic regression and its relation to exponential family of distributions [13]. In statistics, the data points in a dataset are typically assumed to be independently identically distributed (i.i.d.), which allows us to write the conditional likelihood of the entire datasets (X, y) = {(xi , yi )} with i = 1, . . . , m as, p(y | X, θ) =
m Y i=1
p(yi | xi , θ).
(2.1)
11
loss 4 Savage
ramp 2 sigmoid 0-1 loss
-3
-2
-1
0
1
2
3
margin
Fig. 2.2. Some commonly used non-convex loss functions, including ramp loss, sigmoid loss, and Savage loss, for binary classification. We omit the probit loss because it is very close to sigmoid loss.
12 To avoid overfitting to the data, we add a prior p(θ) on the parameter θ. Therefore, according to the Bayes rule, the posterior of θ is obtained, p(θ | y, X) = p(y | X, θ)p(θ)/p(y | X), and the maximum a-posteriori (MAP) estimate of θ is obtained by minimizing, − log p(θ | y, X) = −
m X i=1
log p(yi | xi ; θ) − log p(θ) + const.
(2.2)
where log p(y | X) is neglected since it is independent of θ. − log p(θ) serves as the regularizer. In logistic regression, p(y| x, θ) is modeled using the exponential family of distributions. The exponential family of distributions [13] of a set of random variables z is a parametric distribution family defined as2 : p(z; θ) := exp (hΦ(z), θi − G(θ)) ,
(2.3)
where h·, ·i is the inner product, Φ(z) is a map from z ∈ Z to the sufficient statistics, θ is commonly referred to as the natural parameter, and it lives in the space dual to Φ(z) (see Theorem 4.1.1). G(θ) is a normalizer, also known as the log-partition function, which ensures that p(z; θ) is properly normalized, Z exp (hΦ(z), θi) d z . G(θ) = log
(2.4)
Z
Exponential family of distributions has many important properties and applications. Since many of them are non-trivial, we will review and compare them with their generalizations later in Section 2.3 and Section 4.1. For binary logistic regression, p(y| x; θ) = exp (hΦ(x, y), θi − G(x; θ)) ,
(2.5)
where Φ(x, y) = 12 yΦ(x), so that 1 1 G(x; θ) = log exp hΦ(x), θi + exp − hΦ(x), θi . 2 2 2
Traditionally exponential family distributions are written as p(z; θ) := p0 (z) exp (hΦ(z), θi − G(θ)). For ease of exposition we ignore the base measure p0 (z) in this paper.
13 The function l(x, y; θ) := − log p(y| x; θ) is the logistic loss of the data (x, y), because 1 1 y hΦ(x), θi + exp − hΦ(x), θi l(x, y, θ) = − hΦ(x), θi + log exp 2 2 2 = log (1 + exp(−y hΦ(x), θi)) . 2.3 Φ-Exponential Family of Distributions The convexity of logistic regression is essentially because it uses exponential family to model the conditional distribution. The thin-tailed nature of the exponential family makes it unsuitable for designing robust algorithms against noisy data. In the past several decades, effort has also been devoted to develop alternate, generalized distribution families in statistics [14, 15], statistical physics [8, 10], and most recently in machine learning [16]. Of particular interest to us is the t-exponential family, which was first proposed by Tsallis and co-workers [10, 17, 18]. It is a special case of the more general φ-exponential family of Naudts [8, 9]. In this section, we begin by reviewing the generalized logφ and expφ functions which were introduced in statistical physics by [8, 9]. Then, these generalized exponential functions are used to define φ-exponential family of distributions [9] and t-exponential family of distributions as special cases.
2.3.1 Φ-Logarithm The φ-logarithm, logφ 3 , is defined as follows: Definition 2.3.1 (φ-logarithm [8, 9]) Let φ : [0, +∞) → [0, +∞) be strictly positive and non-decreasing on (0, +∞). Define logφ via Z logφ (x) := 1
x
1 dy φ(y)
(2.6)
If this integral converges for all finite x > 0, then logφ is called a φ-logarithm. 3 Note that throughout this dissertation, logφ or logt are defined in (2.6) and (2.7). The subscripts do not represent the log base.
14 To see that this definition generalizes log, simply set φ(y) = y. Clearly, the gradient of logφ (x) is 1/φ(x) from which it follows that logφ is a concave increasing function. Furthermore, logφ is negative on (0, 1), positive on (1, +∞), with logφ (1) = 0. Of course the integral may diverge at x = 0. All these are properties of the familiar log function. An important example is Example 1 (t-logarithm) Let φ(x) = xt , t > 0. Then log(x) if t = 1 logt (x) := x1−t −1 otherwise 1−t
(2.7)
and d logt (x) = x−t . dx
(2.8)
Figure 2.3 visualizes logt for various values of t and contrasts it with the familiar log. logt
2.5
logt 0.2
t = 0.5 log(x)
t→0
2.0 1.5 1.0 0.5
2
3
4
0.6
0.8
1.0
−0.5
t = 1.5 1
0.4
5
−1.0
t→0 t = 0.5
−1.5
−0.5 −1.0
−2.0
−1.5 −2.0
log(x) t = 1.5
−2.5
Fig. 2.3. The left figure depicts logt for the various values of t indicated. The right figure zooms in to better depict the interval [0, 1] in which logt are negative.
2.3.2
Φ-Exponential
The inverse of logφ is the φ-exponential function, denoted expφ . When logφ takes on a finite value, this is well defined. But, unlike log, there is no guarantee that logφ takes on all
15 values in R. Therefore, define expφ (x) = 0 if x is less than every element of range(logφ ) and expφ (x) = +∞ if x is larger than range(logφ ). Properties of expφ , such as convexity, mirror those of logφ [8,9]. A key difference involves the fact that exp is the only non-trivial function which is its own derivative. However expφ has the following property: d expφ (x) = φ(expφ (x)). dx
(2.9)
Example 2 (t-exponential) Let [x]+ be x if x > 0 and 0 otherwise. Continuing with φ(x) = xt , t > 0, we have exp(x)
expt (x) =
if t = 1
1 [1 + (1 − t)x] 1−t
(2.10)
otherwise.
+
Elementary calculus shows that d expt (x) = expt (x)t dx
expt 10
(2.11)
expt
t = 1.5
1.0
exp(x) 8
0.8
6
0.6
t = 0.5 4
0.4
t = 1.5 t→0
2
exp(x)
0.2
t = 0.5 −3
−2
−1
1
2
−3.0
−2.5
−2.0
−1.5
−1.0
t→0
−0.5
Fig. 2.4. The left figure depicts expt for the various values of t indicated. The right figure zooms in to better depict when expt can achieve the value zero.
16 Figure 2.4 shows some t-exponential functions. They are convex, increasing functions. It is obvious that expt decays towards 0 more slowly as t increases. This property leads to a family of heavy-tailed distributions as t > 1. Since logφ is an increasing function, it follows that its inverse expφ is also an increasing function. Since φ is a non-decreasing function, ∇x expφ (x) = (φ ◦ expφ )(x) is also an increasing function. Using Theorem 24.1 in [19] it follows that expφ is a strictly convex function.
2.3.3 Φ-Exponential Family of Distributions [9] used the φ-exponential function to define the parametric distribution family: p(z; θ) := expφ (hΦ(z), θi − Gφ (θ)) .
(2.12)
where Φ(z) is a map from z ∈ Z to the sufficient statistics, and θ is the natural parameter. Gφ (θ) is the normalizer of the φ-exponential family such that Z expφ (hΦ(z), θi − Gφ (θ)) d z = 1 Z
and Gφ (θ) 6=
R Z
expφ hΦ(z), θi d z in general. A closely related distribution, which often
appears when working with φ-exponential families is the so-called escort distribution: Definition 2.3.2 (Escort distribution) Let φ : [0, +∞) → [0, +∞) be strictly positive and non-decreasing on (0, +∞). For a φ-exponential family of distributions, q(z; θ) := φ(p(z; θ))/Z(θ) is called the escort distribution of p(z; θ). Here Z(θ) =
R Z
(2.13) φ (p(z; θ)) d z is the normaliz-
ing constant which ensures that the escort distribution integrates to 1. One of the crucial properties of exponential families is that the log-partition function G is convex, and it can be used to generate cumulants of the distribution simply by taking derivatives.
17 Theorem 2.3.1 (Log partition function [13]) If the regularity condition Z Z ∇θ p(z; θ) d z = ∇θ p(z; θ) d z Z
(2.14)
Z
holds, then ∇θ G(θ) = E [Φ(z)] ,
∇2θ G(θ) = Var [Φ(z)] ,
(2.15)
and G(θ) is convex. The proof of the above theorem is included in the Appendix B.1. Somewhat surprisingly, Gφ (θ) of φ-exponential family shares some of the similar properties as G(θ) of exponential family. As following theorem asserts, its first derivative can still be written as an expectation of Φ(z) but now with respect to the escort distribution in contrast with Theorem 2.3.1. The proof of the theorem is included in the Appendix B.2. Theorem 2.3.2 (φ-log partition function [9, 16]) The function Gφ (θ) is convex. Moreover, if the following regularity condition Z Z ∇θ p(z; θ) d z = ∇θ p(z; θ) d z Z
(2.16)
Z
holds, then ∇θ Gφ (θ) = Eq(z;θ) [Φ(z)] .
(2.17)
Before moving on, we briefly discuss the regularity condition (2.16), which concerns the legality of swapping the differentiation over a parameter with the integration over the variables. Readers not interested in the following discussion may skip to Section 2.3.4. This is a fairly standard, yet technical requirement, which is often proved using the Dominated Convergence Theorem (see e.g. Section 9.2 of [20]). This holds, for instance, when Eq(z;θ) |Φ(z)| < ∞ and |∇θ Gφ (θ)| < ∞. Here | · | denotes the L1 norm. This condition may not hold for any arbitrary φ-exponential family. Here is one example: Example 3 Let z ∈ [1, +∞) and Φ(z) = z. Consider the φ-exponential family where φ(x) = xt (later referred as the t-exponential family), using (2.10) and (2.12) the resulting density can be written as p(z; θ) = (1 + (1 − t)(θz − Gt (θ)))1/(1−t) .
18 If we compute Eq(z;θ) |Φ(z)| = Eq(z;θ) |z| Z +∞ 1 (1 + (1 − t)(θz − Gt (θ)))t/(1−t) |z|dz = Z(θ) 1 Z +∞ t/(1−t) 1 dz = z (1−t)/t + (1 − t)(θz − Gt (θ))z (1−t)/t Z(θ) 1 t/(1−t) Z +∞ 1 (1 − (1 − t)Gt (θ))z (1−t)/t + (1 − t)θz 1/t = dz | {z } | {z } Z(θ) 1 :=T1
:=T2
Whenever t ≥ 2 the integral diverges because lim T1 + T2 = O(z
z→+∞
2.3.4
1/t
Z ), and hence 1
+∞
(T1 + T2 )t/(1−t) dz → +∞.
T -Exponential Family of Distributions
One of the most important members of the φ-exponential family distributions is the t-exponential family of distributions, which is defined using the expt function (2.10) in (2.12) p(z; θ) = expt (hΦ(z), θi − Gt (θ)) .
(2.18)
In fact, the t-exponential family was first proposed in 1980s by Tsallis [10, 21]4 . The corresponding escort distribution is given by q(z; θ) = R
p(z; θ)t . p(z; θ)t d z
(2.19)
As can be seen in Figure 2.4, expt , for t > 1, decays towards 0 more slowly than the exp function. Consequently, the t-exponential family of distributions becomes a family of heavy-tailed distribution as t > 1. Although the concept of the t-exponential family is relatively new, distributions that belong to this family have been widely used for years. For example, in linear regression problems, it is well-known that the Gaussian distribution is not robust if extreme outliers exist. 4
Note that Tsallis used the term q-exponential family. However, we prefer using t-exponential family to avoid confusion between the exponent q and the escort distribution q.
19 Instead, the Student’s t-distribution is a common substitute in noisy dataset, see e.g. [22]. Interestingly, the Student’s t-distribution is actually a member of the t-exponential family. Example 4 (Student’s-t distribution) Recall that a k-dimensional Student’s-t distribution St(z |µ, Σ, v) with 0 < v < 2 degrees of freedom has the following probability density function: St(z |µ, Σ, v) =
Γ ((v + k)/2) (πv)k/2 Γ(v/2)| Σ |1/2
−(v+k)/2 1 + (z −µ)> (v Σ)−1 (z −µ) . (2.20)
Here Γ(·) denotes the usual Gamma function. To see that the Student’s-t distribution is a member of the t-exponential family, first set −(v + k)/2 = 1/(1 − t) and !−2/(v+k) Γ ((v + k)/2) Ψ= (πv)k/2 Γ(v/2)| Σ |1/2 to rewrite (2.20) as St(z |µ, Σ, v) = Ψ + Ψ · (z −µ)> (v Σ)−1 (z −µ)
1/(1−t)
.
(2.21)
Next set Φ(z) = [z; z z> ], θ = [−2Ψ K µ/(1 − t); Ψ K /(1 − t)] with K defined as K = (v Σ)−1 . Then define
Ψ hΦ(z), θi = z> K z −2µ> K z and 1−t Ψ 1 Gt (θ) = − µ> K µ + 1 + 1−t 1−t to rewrite (2.21) as St(z |µ, Σ, v) = (1 + (1 − t) (hΦ(z), θi − Gt (θ)))1/(1−t) . Comparing with (2.10) clearly shows that St(z |µ, Σ, v) = expt (hΦ(z), θi − Gt (θ)) . Using (2.19) and some simple algebra yields the escort distribution of Student’s-t distribution: q(z; θ) = St(z |µ, v Σ /(v + 2), v + 2) Interestingly, the mean of the Student’s-t pdf is µ, and its variance is v Σ /(v − 2) while the mean and variance of the escort are µ and Σ respectively.
20 2.4
Chapter Summary In this chapter, we reviewed some background materials that is used later in this dis-
sertation. We illustrated the example by [4] and showed that convex losses are not robust tolerant against random noise. We reviewed logistic regression and discussed its relation to exponential family of distributions. Finally, we reviewed t-exponential family as a special case of the more general φ-exponential family of distributions.
21
3. T -LOGISTIC REGRESSION Logistic regression is not robust against random noise, essentially because its conditional distribution is modeled by an exponential family distribution. This chapter introduces a new algorithm, t-logistic regression. The motivation of t-logistic regression follows the same as using the Student’s t-distribution in linear regression [22]. In classification, we believe that the robustness of logistic regression can also be improved by using a heavy-tailed texponential family distribution. We show that t-logistic regression is Bayes-risk consistent, and more robust against outliers than convex losses, although it may yield multiple local minima due to non-convexity. Finally, we conduct extensive experiments including tens of large-scale datasets and show that t-logistic regression is robust against various types of label noises and in practice does not stuck in local minima.
3.1
Binary Classification In t-logistic regression, we model the conditional likelihood of a data point (x, y) by a
t-exponential family distribution, p(y| x; θ) = expt (hΦ(x, y), θi − Gt (x; θ)) y = expt hΦ(x), θi − Gt (x; θ) , 2 where t > 1, and the normalizer Gt (x; θ) is the solution of 1 1 expt hΦ(x), θi − Gt (x; θ) + expt − hΦ(x), θi − Gt (x; θ) = 1. 2 2
(3.1)
(3.2)
By defining b a = hΦ(x), θi, Gt (b a) = Gt (x; θ), we can simplify (3.2) to, b a b a expt ( − Gt (b a)) + expt (− − Gt (b a)) = 1. 2 2 Note that Gt (b a) = Gt (−b a).
(3.3)
22 The key challenge in using the t-exponential family is that no closed form solution1 exists for computing Gt (b a) in (3.3). However, we provide an iterative method which computes Gt (b a) efficiently. The outline of the algorithm is described in Algorithm 1. Algorithm 1: Iterative algorithm for computing Gt for binary t-logistic regression. Input: b a≥0 Output: Gt (b a) a ˜←b a; while a ˜ not converged do Z(˜ a) ← 1 + expt (−˜ a); a ˜ ← Z(˜ a)1−tb a;
end
Gt (b a) ← − logt (1/Z(˜ a)) + ba2 ; The convergence of this iterative algorithm is verified in Appendix B.8. In practice, the algorithm takes less than 20 iterations to converge to an accuracy of 10−10 . We plot the t-logistic loss u l(x, y, θ) = − log expt ( − Gt (u)) 2 as the negative logarithm of (3.1) as a function of margin u = yb a in Figure 3.1. We find that the t-logistic loss is quasi-convex and bends down as the margin of a data point becomes too negative. The larger the t, the more is the bending down effect. As t = 1, the t-logistic regression reduces to logistic regression, and the loss function becomes convex. This is not surprising since at t = 1, the t-exponential family becomes the exponential family. 1
There are a few exceptions. For example, when t = 2, Gt (b a) =
q
1+
b a2 4 .
23 Mathematically, the bending of the loss is directly related to the gradient of t-logistic loss function. For a data point (x, y), the gradient with respect to θ is, y hΦ(x), θi − Gt (x; θ) ∇θ l(x, y, θ) = − ∇θ log expt 2 y t−1 y hΦ(x), θi − Gt (x; θ) expt hΦ(x), θi − Gt (x; θ) = − ∇θ 2 2 (3.4) 1 (yΦ(x) − Eq [yΦ(x)]) p(y| x; θ)t−1 2 1 = − (y − yq(y| x; θ) + yq(−y| x; θ)) Φ(x)p(y| x; θ)t−1 2
=−
(3.5)
= − yq(−y| x; θ)Φ(x) p(y| x; θ)t−1 | {z }
(3.6)
ξ
where q(y| x; θ) =
p(y| x;θ)t , p(y| x;θ)t +p(−y| x;θ)t
(3.4) is from (2.9), and (3.5) is from (3.1) and The-
orem 2.3.2. In (3.6), the gradient of the loss function of (x, y) is associated with a forgetting variable ξ, which disappears as t = 1. As u = y hΦ(x), θi gets more negative, ξ decreases accordingly. Intuitively, the existence of forgetting variable improves the robustness of tlogistic regression by forgetting the influence of the outliers with low likelihood. We will discuss robustness in more detail in Section 3.2.2.
3.2
Properties In this section, we are going to discuss three key properties of t-logistic regression.
Firstly, we verify Bayes-risk consistency, which is an important statistical property of a binary loss function. Secondly, we formally show that t-logistic regression is robust against outliers compared to logistic regression. Thirdly, from a practical point of view, we investigate the local minima of non-convex loss functions. We show that the empirical risks of almost all non-convex losses including t-logistic regression may have multiple local minima. However, in practice, we will show in Section 3.5 that t-logistic regression is stable.
24
loss
t = 1 (logistic)
6
t = 1.3 4 t = 1.6 t = 1.9 2 0-1 loss -4
-2
0
2
4
margin
Fig. 3.1. T -logistic loss for binary classification with four different t: t = 1.3, t = 1.6, t = 1.9. Unlike the logistic loss (which is t = 1), t-logistic loss bends down as u 0, which caps the influence from outliers.
25 3.2.1
Bayes-Risk Consistency
Since all surrogate losses are a substitute for the 0 − 1 loss, it is natural to ask whether a surrogate loss is statistical consistent. To answer this question, a crucial criterion which is known as Bayes-risk consistency is used (see e.g. [23, 24]). Let us denote η(x) = p(y = 1| x) to be the underlying true conditional distribution of the label y given x, and let b a= hΦ(x), θi. The expected risk of a binary loss function l is, Cl (η, b a) = Eη [l(yb a)] = ηl(b a) + (1 − η)l(−b a). Since sign(b a) predicts the label of point based on its feature x, Bayes-risk consistency requests the optimal b a∗ of the expected risk Cl (η, b a) given η to have the same sign as the Bayes decision rule, sign[b a∗ ] = sign[2η − 1].
(3.7)
[23] further shows that all the three convex surrogate loss functions in Table 1.1 are Bayes-risk consistent. Now let us verify the Bayes-Risk consistency for t-logistic loss l(yb a) = − log expt (
yb a − Gt (b a)). 2
We have Cl (η, b a) = ηl(b a) + (1 − η)l(−b a)
b a b a = −η log expt ( − Gt (b a)) − (1 − η) log expt (− − Gt (b a)) 2 2 b a b a = −η log expt ( − Gt (b a)) − (1 − η) log(1 − expt ( − Gt (b a))). 2 2 | {z }
(3.8)
=expt (− ab2 −Gt (b a))
where (3.8) is because of (3.3). Let us define r = expt ( ba2 − Gt (b a)), and then (3.8) becomes, −η log(r) − (1 − η) log(1 − r). We can obtain the optimal r∗ by taking the derivative of r and set it to 0, −
η 1−η − = 0, r∗ 1 − r∗
26 ∗
a∗ )). Since which yields r∗ = η. Therefore, the optimal b a∗ satisfies η = expt ( ba2 − Gt (b ∗
a∗ )), we can take logt of the two and substract them, which yields, 1 − η = expt (− ba2 − Gt (b b a∗ = logt η − logt (1 − η).
(3.9)
It is clear that (3.9) satisfies (3.7), because logt is an increasing function and b a∗ > 0 if and only η > 12 . Therefore, t-logistic loss is Bayes-risk consistent.
3.2.2
Robustness
In this section, we theoretically investigate the robustness of t-logistic regression. There is no unique definition of robustness (see e.g. [25, 26]), and we will mainly focus on two of them. However, both definitions require the computation of the global optimum, which is infeasible for non-convex losses. Instead, we use the necessary conditions of the definitions and propose a function Il (u) for visualization. Finally, we use Il (u) to classify binary losses to three robust types, and show that t-logistic regression is fundamentally different from convex losses and many other non-convex losses in Table 2.1.
Definitions of Robustness Consider a dataset containing m data points x1 , . . . , xm with their labels y1 , . . . , ym , assume that θ ∗ is the global optimum of the regularized risk m
1 X λ l(xi , yi , θ) + k θ k22 . J(θ) = m i=1 2 For simplicity, let us assume that the loss function l(x, y, θ) is continuous and differentiable. From the optimality condition of a differentiable2 objective function, θ ∗ must satisfy, m
1 X ∇θ J(θ ) = ∇θ l(xi , yi , θ ∗ ) + λ θ ∗ = 0. m i=1 ∗
2
(3.10)
For nondifferentiable functions, one can replace the gradient by the subgradient and obtain a similar optimality condition.
27 Now assume that the dataset is augmented by a contaminated example (ˆ x, yˆ). Then the ˆ ∗ and it must satisfy, optimum on the contaminated dataset becomes θ ˆ ∗ ) = 0. ˆ ∗ ) + 1 ∇θ l(ˆ x, yˆ, θ ∇θ J(θ m
(3.11)
The robustness of a loss function is basically determined by the sensitivity of the optimum before and after the addition of a contaminated example, namely the difference ∗
ˆ . The two definitions of robustness that we consider are, between θ ∗ and θ Definition 3.2.1 (Inspired by the influence function in [25]) For any dataset (x1 , y1 ), . . . , (xm , ym ) and (ˆ x, yˆ), ˆ ∗ → θ∗ . lim θ
m→∞
Definition 3.2.2 (Outlier proneness in [26]) For any dataset (x1 , y1 ), . . . , (xm , ym ) and (ˆ x, yˆ), lim
kΦ(ˆ x)k2 →∞
ˆ ∗ → θ∗ . θ
where Φ(x) is a feature map from X to Rd . Roughly speaking, Definition 3.2.1 states that a robust model should not be affected too much by changing a small portion of data; and Definition 3.2.2 states that a robust model should ignore any extreme outliers. However, for non-convex losses, it is very hard to characterize the difference between ˆ ∗ because the regularized risk may have multiple local minima (see Section 3.2.3). θ ∗ and θ ˆ ∗ → θ ∗ is that On the other hand, from (3.11), it is clear that a necessary condition for θ 1 k∇θ l(ˆ x, yˆ, θ ∗ )k2 → 0. m
(3.12)
Therefore, instead of directly working on Definition 3.2.1 and 3.2.2, we will investigate the robustness by their necessary conditions, which are defined in Definition 3.2.3 and 3.2.4 respectively.
28 Definition 3.2.3 For any x, y and θ, k∇θ l(x, y, θ)k2 < ∞. Definition 3.2.4 For any x, y and θ, lim
kΦ(x)k2 →∞
k∇θ l(x, y, θ)k2 = 0.
Robust Types Since k∇θ l(x, y, θ)k2 involves both θ and (x, y), it would be more straightforward to have a function of margin u = y hΦ(x), θi for visualization. To this end, we do an innerproduct between ∇θ l(x, y, θ) and θ, and define a new function Il (u), h∇θ l(x, y, θ), θi = hl0 (u)yΦ(x), θi = l0 (u)u := Il (u), where l(u) := l(x, y; θ) and l0 (u) ≤ 0 for all losses that we are interested in this dissertation. Furthermore, the following two lemmas show that |Il (u)| and k∇θ l(x, y, θ)k2 can almost equivalently define robustness. The proofs of the lemmas are provided in Appendix B.3 and B.4. Lemma 3.2.1 If |Il (u)| < ∞, then for any θ, x and y, the probability p(k∇θ l(x, y, θ)k2 < ∞) = 1. Furthermore, k∇θ l(x, y, θ)k2 → ∞ if and only if ψ the angle between Φ(x) and θ is equal to π/2 and kΦ(x)k2 → ∞. Lemma 3.2.2 If limu→∞ |Il (u)| = 0, then for any θ, x and y, the probability p(
lim
kΦ(x)k2 →∞
k∇θ l(x, y, θ)k2 = 0) = 1.
Furthermore, limkΦ(x)k→∞ k∇θ l(x, y, θ)k2 6= 0 if and only if ψ the angle between Φ(x) and θ is equal to π/2.
29
logistic(type-0)
Il (u)
4
2 t-logistic(type-I)
Savage(type-II)
0
u
-2
Fig. 3.2. An illustration of the three robust types. All the three types of losses behave similarly as u > 0. When u → −∞, Type-0 loss goes to +∞; Type-I loss goes to a constant; and Type-II loss goes to 0.
Since all the losses that are commonly used are continuously defined on u ∈ R ∪{±∞},
we have |l0 (u)| < ∞ for |u| < ∞. Therefore, |Il (u)| may only be unbounded as u → ∞. Based on limu→∞ |Il (u)|, we classify binary losses into three robust types: • Robust Loss 0: limu→∞ |Il (u)| → ∞. • Robust Loss I: 0 < limu→∞ |Il (u)| < ∞. • Robust Loss II: limu→∞ |Il (u)| = 0. An illustration of the three types of binary losses is provided in Figure 3.2. In Table 3.1, we classify some binary losses based on their robustness. It is easy to verify that all convex losses belongs to Robust Loss 0. Some other verifications are provided in Appendix B.9. In particular, it differentiates the t-logistic regression (Type-I) from Type-0 losses, e.g. logistic regression as well as the Type-II non-convex losses, e.g. Savage loss. In later experiments, we will empirically compare these different types of losses.
30
Table 3.1 The robustness of some loss functions for binary classification based on Il (u). The verifications are provided in Appendix B.9. Name
Loss Function
Robust Type
Hinge
l(x, y, θ) = max(0, 1 − u)
0
Exponential
l(x, y, θ) = exp(−u)
0
Logistic
l(x, y, θ) = log(1 + exp(−u)
0
T -logistic
l(x, y, θ) = expt ( u2 − Gt (u))
I
Probit
l(x, y, θ) = 1 − erf (u)
II
Ramp
l(x, y, θ) = min(2, max(0, 1 − u))
II
Savage Sigmoid
3.2.3
l(x, y, θ) =
4 (1+exp(u))2
II
2 1+exp(u)
II
l(x, y, θ) =
Multiple Local Minima
One of the key disadvantages of the non-convex losses is that its empirical risk may have multiple local minima. To illustrate this, we used a two-dimensional toy dataset which contains 50 points drawn uniformly from [−2, 2] × [−2, 2]. In comparison, we plot the empirical risk of t-logistic loss as well as Savage loss in Figure 3.3. As can be seen, Savage loss yields a highly non-convex objective function with a large number of local optima. In contrast, even though we are averaging over non-convex loss functions, the resulting function of t-logistic regression has a single global optimum. This behavior persists when we use different random samples, change the sampling scheme, or vary the number of data points. Moving over to higher dimensional datasets such as Adult, USPS, and Web83 , we initialize the algorithm with different randomly chosen starting points and check the solution obtained. The algorithm always arrives at the same solution (within numerical precision) [27]. 3
All available from http://www.csie.ntu.edu.tw/˜cjlin/libsvmtools/datasets/.
31
100
50 −10
10 −5
0 0
5
10−10
60 50 40 −10
10 −5
0 0
5
10−10
Fig. 3.3. The empirical risk of t-logistic regression (upper) and Savage loss (lower) on a toy two dimensional dataset. T -logistic regression appears to be easier to optimize than Savage loss.
32 This interesting behavior once led us to conjecture that t-logistic regression has only one local minimum which is the global minimum. However, in the next theorem, we show that the conjecture is wrong. For any non-convex loss function, under some mild conditions, one can always construct a dataset whose empirical risk has multiple local minima. To the best of our knowledge, all the existing non-convex losses satisfy the conditions. The following theorem considers the case when the feature dimension is 1-dimensional and generalization to multi-dimensional setting is straightforward. The proof of the theorem is in the Appendix B.5. Theorem 3.2.3 Consider a loss function l(u) := l(x, y, θ) that is smooth at u := yθx = 0. If l0 (0) < 0, and there exist u1 < 0, u2 > 0, and > 0 where l0 (ui ) ≥ l0 (0) + for i = 1, 2, then there exists a dataset whose empirical risk Remp (θ) has at least two local minima. An interesting observation is that those local minima are related to the robustness of non-convex losses. To see that, consider the following 1-dimensional example, which includes 30 clean data points where (xi , yi ) = (1, 1), and one outlier where (x, y) = (−200, 1). We plot the empirical risk as a function of θ for both the logistic regression and t-logistic regression in Figure 3.4. Without the outlier being added, both logistic loss (red dashed) and t-logistic loss (purple solid) yield the same optimum. However, once the outlier is added, the optimum of the logistic regression is severely impacted by the outlier (blue dashed). On the other hand, t-logistic regression, although it creates another local minimum, retains the same global optimum as the θ∗ without the outlier (green solid).
3.3
Multiclass Classification In this section, we extend t-logistic regression to multiclass classification, where the
dataset consists of data points {x1 , . . . , xm } and the corresponding label {y1 , . . . , ym } with yi taking values from {1, . . . , C}. Let us first briefly review the multiclass logistic regression. The generalization to the t-logistic regression is straightforward.
33
50 logisticc t-logisticc logistico t-logistico
Remp (θ)
40 30 20 10 0 −0.5
0
0.5
1
1.5
2
θ
Fig. 3.4. Empirical risk of logistic regression and t-logistic regression on the one dimensional example. The optimal solutions before and after adding the outlier are significantly different for logistic regression. In contrast, the global optimum of t-logistic regression stays the same.
34 In the multiclass logistic regression, the conditional likelihood of a label y given x is p(y| x; θ) = exp (hΦ(x, y), θi − G(x; θ)) where,
Φ(x, y) = 0, . . . , 0, Φ(x), 0, . . . , 0 , θ = (θ 1 , . . . , θ C ) . | {z } | {z } 1,...,y−1
y+1,...,C
with Φ(x) : X → Rd . Here 0 denotes the d-dimensional all-zero vector, and θ a d × Cdimensional vector. Therefore, p(y| x; θ) = exp (hΦ(x), θ y i − G(x; θ)) ,
(3.13)
where the log-partition function is G(x; θ) = log
C X c=1
! exp (hΦ(x), θ c i) .
(3.14)
The multiclass logistic loss is the negative log-likelihood of a point (x, y), which equals ! C X l(x, y, θ) = − log p(y| x; θ) = log exp (hΦ(x), θ c − θ y i) . c=1
The main idea of the multiclass t-logistic regression is the same as in binary t-logistic regression. The conditional likelihood of the data point (x, y) is modeled by a conditional t-exponential family of distributions (t > 1): p(y| x, θ) = expt (hΦ(x, y), θi − Gt (x; θ))
(3.15)
= expt (hΦ(x), θ y i − Gt (x; θ)) where the log-partition function Gt satisfies C X c=1
expt (hΦ(x), θ c i − Gt (x; θ)) = 1.
Let b ac = hΦ(x), θ c i, Gt (b a) = Gt (x; θ), then we can simplify (3.16) as C X c=1
expt (b ac − Gt (b a)) = 1.
(3.16)
35
Algorithm 2: Iterative algorithm for computing Gt in multiclass t-logistic regression. Input: b a Output: Gt (b a) b a∗ ← max(b a); ˜←b a a−b a∗ ;
˜ not converged do while a P ac ); Z(˜ a) ← C c=1 expt (˜ ˜ ← Z(˜ a a)1−t (b a−b a∗ );
end
Gt (b a) ← − logt (1/Z(˜ a)) + b a∗ ;
36
Table 3.2 Average time (in milliseconds) spent by our iterative scheme and fsolve in ˆ t (b computing G a) for multiclass t-logistic regression. C
10
20
30
40
50
60
70
80
90
100
fsolve
8.1 8.3 8.1
8.7 9.6 9.8 10.0
10.2 10.3
10.7
iterative
0.3 0.3 0.3
0.4 0.4 0.4
0.3
0.5
0.3
0.4
An iterative algorithm which generalizes the one used in binary classification is applied to compute Gt (b a). The algorithm is described in Algorithm 2. In practice, Algorithm 2 scales well with C, the number of classes, thus making it efficient enough for problems involving a large number of classes. To illustrate, we let C ∈ {10, 20, . . . , 100} and we randomly generate b a ∈ [−10, 10]C , and compute the correˆ t (b ˆ t (b sponding G a). We compare the time spent in estimating G a) by the iterative scheme
and by calling Matlab fsolve function averaged over 100 random generations using Matlab 7.1 in a 2.93 GHz Dual-Core CPU. We present the results in Table 3.2. For a data point (x, y), the partial derivative of multiclass t-logistic loss function over θ n , where n ∈ {1, . . . , C}, is −
∂ ∂ log p(y| x; θ) = − log expt (hΦ(x), θ y i − Gt (x; θ)) ∂ θn ∂ θn = − (δ(y = n)Φ(x) − Eq [Φ(x, y)])p(y| x; θ)t−1 = − Φ(x) · (δ(y = n) −
C X
δ(c = n)q(c| x; θ))p(y| x; θ)t−1
c=1
= − Φ(x) · (δ(y = n) − q(n| x; θ)) p(y| x; θ)t−1 | {z }
(3.17)
ξ
where q(n| x; θ) =
p(n| x;θ)t PC . t c=1 p(c| x;θ)
In (3.17), the gradient of (x, y) contains a forgetting
variable ξ = p(y| x; θ)t−1 . Just like the binary classification, when t > 1, the influence of the points with low likelihood p(y| x; θ) will be capped by the ξ variable.
37 The definition of Bayes-risk consistency of multiclass classification losses was first discussed in [28]. As one can easily verify, multiclass t-logistic regression is also Bayesrisk consistent (see Appendix B.10 for verification).
3.4
Optimization Methods In this section, let us consider some of the practical issues including how to optimize
the objective function of t-logistic regression. The most straightforward way is to use a gradient-based method such as L-BFGS (please refer to Section A.2 for more details). In particular, the gradient of t-logistic regression is given in (3.6) for binary classification and (3.17) for multiclass classification. Although in our experiment we find that the algorithm converges every time using the L-BFGS solver, it is important to note that there is no convergence guarantee for using L-BFGS solver on non-convex objective functions. In the remainder of the section, we provide a different approach which is guaranteed to converge. For clarity, we discuss how to optimize its empirical risk. The regularized risk can be optimized in a similar way, which was applied in [27].
3.4.1
Convex Multiplicative Programming
For t > 1, instead of directly minimizing Remp (θ) = − log p(y | X, θ), one can equivalently minimize the objective function p(y | X, θ)1−t , 1−t
P(θ) , p(y | X; θ) =
m Y i=1
=
m Y i=1
p(yi | xi ; θ)1−t
(1 + (1 − t)(hΦ(xi , yi ), θi − Gt (xi ; θ))) | {z }
(3.18) (3.19)
li (θ)
Since t > 1, and Gt (xi ; θ) is convex, it is easy to see that each component li (θ) is positive and convex. Therefore, P(θ) becomes the product of positive convex functions li (θ). Minimizing such a function P(θ) is also called convex multiplicative programming [29].
38 The optimal solutions to the problem (3.19) can be obtained by solving the following parametric problem (see Theorem 2.1 of [29]): min min MP(θ, ζ) , ζ
θ
m X
ζi li (θ) s.t. ζ > 0,
i=1
m Y i=1
ζi ≥ 1.
(3.20)
Exact algorithms have been proposed for solving (3.20) (for instance, [29]). However, the computational cost of these algorithms grows exponentially with respect to m, which makes them impractical for our purposes. Instead, we apply the following block coordinate descent based method, namely the ζ-θ algorithm. The main idea of the algorithm is to minimize (3.20) with respect to θ and ζ separately. ζ-Step: Assume that θ is fixed, and denote ˜li = li (θ) to rewrite (3.20) as: min MP(θ, ζ) = min ζ
ζ
m X
ζi ˜li s.t. ζ > 0,
i=1
m Y i=1
ζi ≥ 1.
(3.21)
Since the objective function is linear in ζ and the feasible region is a convex set, (3.21) is a convex optimization problem. By introducing a non-negative Lagrange multiplier γ ≥ 0, the Lagrangian and its partial derivative with respect to ζi0 can be written as ! m m X Y L(ζ, γ) = ζi ˜li + γ · 1 − ζi i=1
(3.22)
i=1
Y ∂ ζi . L(ζ, γ) = ˜li0 − γ ∂ζi0 0 i6=i
(3.23)
˜ l Setting the gradient to 0 obtains γ = Q i0 0 ζi . Since ˜li0 > 0, it follows that γ cannot be 0. i6=i Q ˜ By the K.K.T. conditions [3], m i=1 ζi = 1. This in turn implies that γ = li0 ζi0 or
(ζ1 , . . . , ζm ) = (γ/˜l1 , . . . , γ/˜lm ), with γ =
m Y
1
˜l m . i
(3.24)
i=1
There is an obvious connection between ζi and the forgetting variable ξi , because ζi ∝ 1/˜li = p(yi | xi , θ)t−1 = ξi .
θ-Step: In this step we fix ζ and solve for the optimal θ. This step is essentially the same as logistic regression, except that each component has a weight ζi here. min MP(θ, ζ) = min θ
θ
m X i=1
ζi li (θ)
(3.25)
39 and the gradient is ∇θ MP(θ, ζ) = (1 − t)
m X i=1
ζi (Φ(xi , yi ) − Eq [Φ(xi , yi )]).
(3.26)
The gradient in (3.26) is very similar to the gradient in (3.6), (3.17). The main difference is that the ζ-θ algorithm computes ζ and θ in two steps, while the gradient based method computes ξ and θ in one step. However, the advantage of the ζ-θ algorithm is its convergence guarantee as shown in the following theorem. The proof is provided in the Appendix B.6. Theorem 3.4.1 The ζ-θ algorithm converges to a stationary point of the convex multiplicative programming problem.
3.5
Empirical Evaluation We used 26 publicly available binary classification datasets and 9 multiclass classifica-
tion datasets and focused our study on two aspects: the generalization performance under various noise models and the stability of the solution under random initialization. As our comparator we use logistic regression, and Savage loss4 . Our main observation from these extensive empirical experiments is that t-logistic regression is more robust than logistic regression, when the dataset is mixed with label noise. On the other hand, compared to Savage loss which often gets stuck in different local minima under random initializations, t-logistic regression appears to be much more stable. These two observations make t-logistic regression an attractive algorithm for classification. Datasets Table 3.3 summarizes the binary classification datasets used in our experiments. adult9, astro-ph, news20, real-sim, reuters-c11, reuters-ccat are from the same source as in [30]. aut-avn is from Andrew McCallum’s home page5 , 4
The multiclass Savage loss is defined as, l(x, y; θ) =
C X c=1
5
exp(hΦ(x), θ c i)
δ(y = c) − PC
c=1
exp(hΦ(x), θ c i)
http://www.cs.umass.edu/˜mccallum/data/sraa.tar.gz.
!2 .
40 covertype is from the UCI repository [31], worm is from [32], kdd99 is from KDD Cup 19996 , while web8, webspam-u, webspam-t7 , as well as the kdda and kddb8 are from the LibSVM binary data collection9 . The alpha, beta, delta, dna, epsilon, gamma, ocr and zeta datasets were obtained from the Pascal Large Scale Learning Workshop website [33]. measewyner is a synthetic dataset proposed in [34]. The input x is a 20-dimensional vector where each coordinate is uniformly distributed on [0, 1]. P The label y is +1 if 5j=1 xj ≥ 2.5 and −1 otherwise. Table 3.4 summarizes the multiclass classification datasets. In dna and ocr binary classification datasets, we used the same training and testing partition as in [35] (80% for training and 20% for testing). For all other datasets, we used 70% of the labeled data for training and the remaining 30% for testing. In all cases, we added a constant feature as bias. Optimization algorithms We choose to optimize the empirical risk with L2 regularizer using the L-BFGS. We implemented all the loss functions using PETSc and TAO, which allows efficient use of large-scale parallel linear algebra. We used the Limited Memory Variable Metric (lmvm) variant of L-BFGS which is implemented in TAO. The convergence criterion of the optimization algorithms is when the decrease in the objective function value and the norm of the gradient are less than 10−10 or the maximum number of 1000 function evaluations is reached. Implementation and Hardware
All experiments are conduced on the Rossmann com-
puting cluster at Purdue University, where each node has two 2.1GHz 12-core AMD 6172 processors with 48 Gb physical memory. We ran our algorithms with 4 cores in one single 6
http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html. webspam-u is the webspam-unigram and webspam-t is the webspam-trigram dataset. Original dataset can be found at http://www.cc.gatech.edu/projects/doi/WebbSpamCorpus.html. 8 These datasets were derived from KDD CUP 2010. kdda is the first problem algebra 2008 2009 and kddb is the second problem bridge to algebra 2008 2009. 9 http://www.csie.ntu.edu.tw/˜cjlin/libsvmtools/datasets/binary.html. 7
41
Table 3.3 Summary of the binary classification datasets used in our experiments. n is the total # of examples, d is the # of features, and n+ : n− is the ratio of the number of positive vs negative examples. M denotes a million. ∗ deontes the datasets that are unused in Chapter 6 due to high computational costs.
dataset
n
dataset
n
adult9
48,842
123 0.32
alpha
500,000
500 1.00
astro-ph
94,856
99,757 0.31
aut-avn
71,066
20,707 1.84
beta
500,000
500 1.00
covertype
581,012
54 0.57
delta
500,000
500 1.00
dna∗
50.00 M
epsilon∗
800 3e − 3
500,000
2000 1.00
gamma
500,000
500 1.00
kdd99
5.21 M
127 4.04
kdda∗
kddb∗
d
n+ : n−
20.01 M 29.89 M 6.18 2000
measewyner
20 1.00
news20
19,954 7.26 M 1.00
real-sim
72,201 2.97 M 0.44
reuters-ccat 804,414 1.76 M 0.90 webspam-t∗ worm
350,000 16.61 M 1.54 1.03 M
804 0.06
d
n+ : n−
8.92 M 20.22 M 5.80
longservedio
2000
21 1.00
mushrooms
8124
112 1.07
3.50 M
1156 0.96
ocr∗
reuters-c11 804,414 1.76 M 0.03 web8
59,245
300 0.03
webspam-u
350,000
254 1.54
zeta
500,000 800.4 M 1.00
node for all datasets, except dna and ocr datasets for binary classification, where we used 16 cores across 16 nodes with 30 Gb memory in each node.
3.5.1
Noise Models
One of the main objectives of our experiment is to test the robustness of the classification algorithms under different label noise models. Therefore, we implement the following three kinds of noise models. For binary classification, the three types of noise models are generated in the following ways using a flipping constant ρ ∈ [0, 1]:
42
Table 3.4 Summary of the multiclass classification datasets used in our experiments. n is the total # of examples, d is the # of features, nc is the # of classes. ∗ deontes the datasets that are unused in Chapter 6 due to high computational costs.
n
dataset
d
nc
dataset
n
d
nc
dna
2,586
182 3
letter
15,500
mnist
70,000
782 10
protein
21,516 359 3
rcv1∗
534,130 47,238 52 sensitacoustic 98,528
sensitcombined 98,528
102 3
9298
258 10
usps
sensitseismic 98,528
18 26
52 3 52 3
Uniform Noise (Noise-1) In noise-1 model, we uniformly flip the label of a training example with probability ρ (See Algorithm 3). Algorithm 3: Algorithm for generating noise-1 model. Input: Dataset (X, Y ) := {(xi , yi )}, where i = 1, . . . , m.
Output: Dataset (X, Yˆ ) := {(xi , yˆi )}, where i = 1, . . . , m. for i = 1, . . . , m do rand = U ni[0, 1]; if rand < ρ then yˆi = −yi ; end
end
Unbalanced Noise (Noise-2)
Noise-2 model generates unbalanced label noise. In other
words, we only flip the labels of negative-label training examples with probability ρ (See Algorithm 4).
43
Algorithm 4: Algorithm for generating noise-2 model. Input: Dataset (X, Y ) := {(xi , yi )}, where i = 1, . . . , m.
Output: Dataset (X, Yˆ ) := {(xi , yˆi )}, where i = 1, . . . , m. for i = 1, . . . , m do rand = U ni[0, 1]; if yi < 0 and rand < ρ then yˆi = 1; end
end
44 Unbalanced Large-Margin Noise (Noise-3)
Noise-3 model is intended to generate
large-margin outliers. In addition, only examples with the negative labels will be flipped. In order to estimate the margin of each example, we first run logistic regression on the clean dataset. We then flip the labels of examples by using the probability which favors the large margin examples (See Algorithm 5). Algorithm 5: Algorithm for generating noise-3 model. Input: Dataset (X, Y ) := {(xi , yi )}, where i = 1, . . . , m.
Output: Dataset (X, Yˆ ) := {(xi , yˆi )}, where i = 1, . . . , m. Train θ by running logistic regression on (X, Y ) for 30 iterations; for i = 1, . . . , m do Compute ui = yi hΦ(xi ), θi;
end Compute umax = maxi {ui }; for i = 1, . . . , m do Compute u˜i = ui − umax ; end Compute u˜min = min{˜ ui }; for i = 1, . . . , m do ui Compute bi = exp(− u10·˜ ); ˜min end Compute Z =
P
i
ui exp(− u10·˜ ); ˜min
for i = 1, . . . , m do rand = U ni[0, 1]; if yi < 0 and rand · Z < m · bi · ρ then yˆi = 1; end end
For multiclass classification, in noise-1 model, we assign yi to a uniformly random new label with probability ρ. In noise-2 and noise-3 model, we only change the labels of the
45 examples with yi ≤ C/2, where C is the total number of classes. The new assigned label yi will be in (C/2, C] based on uniform distribution.
3.5.2
Experiment Design
Since most of our datasets contain a large amount of features, we used the identity feature map Φ(x) = x throughout the experiment. For t-logistic regression, we set t = 1.5.
Generalization Performance Our first experiment is to compare the test error among the three algorithms under three different noise models with ρ = {0.00, 0.05, 0.10}. We split the training set into 5 partitions for 5-fold cross validation. The candidates of regularization constant λ are {10−2 , 10−4 , . . . , 10−10 }, and the one which in average performs best in the validation sets is chosen. The model parameters in this experiment are initialized to be all zero.
Random Initialization The second experiment is intended to compare the stability of non-convex losses. In particular, we want to test whether the non-convex losses get stuck in different local minima when we initialize the model parameters differently. we use the regularization constant chosen with 5-fold cross validation in the previous experiment and pick one of the five folds for training. In order to obtain random initialization of the model parameters, each variable of the model parameter is initialized uniformly randomly from [−10, 10]. The mean and the standard deviation of the test error is reported from nine random initializations and one all-zero initialization. For dna and ocr dataset, due to the large computational cost, we only report the generalization performance with all-zero initialization. We do not split the training set for cross validation, but train the algorithm on the entire training set with λ = 10−10 .
46 3.5.3
Results
From Figure C.1 to Figure C.35, we plot the performance of the three algorithms under three noise models from left to right. Each figure is the performance on one dataset. For each noise model, we report the test error of the algorithms with ρ = 0.00 (blue), 0.05 (red), 0.10 (yellow). On the first row of each figure, we report the test error of the three algorithms using 5-fold cross validation with the optimal λ on that dataset. For large values of λ (e.g. λ = {10−2 , 10−4 }), it appears that the test performance of the algorithms are mostly inferior. This is because most of the datasets we use in our experiment contain a large number of examples, and therefore requires very mild regularization. On the other hand, the dataset with higher noise tends to require larger regularization. For instance, for the binary classification, if ρ = 0.00, the distribution for the optimal λ equal to [10−2 , 10−4 , 10−6 , 10−8 , 10−10 ] is [7, 18, 90, 43, 58], while the distribution becomes [16, 36, 82, 36, 46] if ρ = 0.10. To quickly overview the improvement of robustness for t-logistic regression, in Table 3.5 and Table 3.6, we summarize the number of datasets where the test error difference between logistic regression and t-logistic regression are significant in three noise models with ρ = {0.00, 0.05, 0.10}. Across a variety of binary classification datasets in Table 3.5, it appears that t-logistic regression performs better when label noise is added. In particular, when ρ = 0.05, t-logistic regression has significant advantage in 48 cases, while logistic regression only has 5 cases. When ρ = 0.10, the advantage of t-logistic reduces, but still it is better in 42 cases versus 12 for logistic regression. In multiclass classification as shown in Table 3.6, the advantage of t-logistic regression is even more salient. Savage loss appears to be even more robust than t-logistic loss in a few datasets. However, it is unstable under random initialization of the model parameter. On the second row of each figure, we report the test errors when the model parameter is randomly initialized. We can see that the performance of Savage loss fluctuates in more than half of the datasets. In contrast, t-logistic loss converges to similar results in all except the longservedio dataset. Therefore, empirically t-logistic regression appears to be very stable. Logistic re-
47
Table 3.5 The number of binary classification datasets that logistic regression or tlogistic regression is significantly better than the other. The left part of the table is the number of datasets where logistic regression is significantly better; the right part is the number of datasets where t-logistic regression is significantly better. The total number of datasets is 24 (dna and ocr datasets are excluded).
Logistic 0.00 0.05 0.10 t = 1.5 0.00 0.05 0.10 Noise-1
4
1
6
Noise-1
5
15
11
Noise-2
4
2
4
Noise-2
5
16
17
Noise-3
4
2
2
Noise-3
5
17
14
Table 3.6 The number of multiclass classification datasets that logistic regression or tlogistic regression is significantly better than the other. The left part of the table is the number of datasets where logistic regression is significantly better; the right part is the number of datasets where t-logistic regression is significantly better. The total number of datasets is 9. Logistic 0.00 0.05 0.10 t = 1.5 0.00 0.05 0.10 Noise-1
1
1
1
Noise-1
5
6
8
Noise-2
1
1
1
Noise-2
5
7
6
Noise-3
1
0
0
Noise-3
5
6
7
gression always converge to the similar result regardless of the initialization because of its convexity. In order to highlight the main difference between t-logistic regression and logistic regression, on the third row, we plot the distribution of the forgetting variable ξ of t-logistic regression on one of the five folds as ln = 0.10. To distinguish the points with noisy labels we plot them in red while the other points in blue. Recall that ξ denotes the influence of a
48 point. In most of cases, we can observe that the ξ of the noisy data is smaller than that of the clean data, which indicates that the algorithm is able to effectively identify these noisy points and cap their influence.
Detailed Discussion on Selected Datasets On the alpha dataset (Figure C.2), the performance of the three algorithms are close to each other in the noise-1 model. However, in the noise-3 model, the test error of logistic regression rises from 21.9% of the clean dataset to about 1.1% in the noisy dataset (ρ = 0.10). On the other hand, although t-logistic regression also suffers from unbalanced label noise, the test error rise is only about 0.6%. Similar phenomena are observed on the astro-ph (Figure C.3), delta (Figure C.7), epsilon (Figure C.8) and gamma (Figure C.9) datasets. To understand why t-logistic regression works better, it is helpful to read its ξ-distribution on the bottom row. In the noise-3 model of these datasets, the mean of the distribution of the forgetting variable ξ of the noisy points appears to be much smaller than that of the clean points. Therefore, the influence of those noisy examples is capped. On the other hand, Savage loss also works well in those four datasets when the model parameter is initialized to be all-zero, but its performance sometimes fluctuates with different random initializations, e.g. on the delta and gamma datasets. On the covertype dataset (Figure C.6), t-logistic regression has better test performance with or without label noise in noise-1 and noise-2 model. The reason that t-logistic regression works better on clean dataset may be because the original dataset is mixed with outliers. The generalization performance of Savage loss is comparable to t-logistic regression, but it is unstable against random initialization. In the noise-3 model, the performances of all three algorithms becomes worse. For t-logistic regression, the ξ-distribution indicates that the influence of half of the noisy examples are not successfully capped. On the three KDD datasets, kdd99 (Figure C.10), kdda (Figure C.11), kddb (Figure C.12), the number of positive labels is a few times larger than that of the negative labels. Therefore, it appears that all the algorithms perform better in noise-2 and noise-3 model,
49 since the latter contains much fewer noisy examples. t-logistic regression outperforms logistic regression on the kdd99 and kdda dataset. However, somewhat surprisingly, it performs worse on the kddb dataset although its ξ-distribution looks similar to that of the kdda dataset. On the longservedio dataset (Figure C.13), all the three algorithms are able to perfectly classify the examples when ρ = 0.00. In the noise-1 and noise-2 model, t-logistic regression is apparently more robust against the label noise than logistic regression. In particular, on the ξ-distribution we observe the 4 distinct spikes. From left to right, the first spike corresponds to the noisy large margin examples, the second spike represents the noisy pullers, the third spike denotes the clean pullers, while the rightmost spike corresponds to the clean large margin examples. On the other hand, logistic regression is unable to discriminate between clean and noisy training samples which leads to its bad performance. In the noise-3 model, more large-margin examples are flipped. Although the overall quantity of noisy examples may be smaller, the impact from these noisy examples is actually strengthened. Not only logistic regression does not perform well in such case, even t-logistic regression performs much worse. The ξ-distribution of noise-3 model is apparently different from that of noise-1 and noise-2 model, as there is no spike representing the noisy large margin examples. Furthermore, the flipped large-margin examples clearly create multiple local minima in the empirical risk of t-logistic regression, as the performance of t-logistic regression fluctuates with random initialization. The measewyner dataset (Figure C.14) is another dataset where t-logistic regression demonstrate a clear edge over logistic regression. Here t-logistic regression outperforms logistic regression in all the three noise models. One can clearly see from the ξ-distribution that all the red bars lies to the left of the blue bars. The distance between the red bars and the blue bars are even larger in noise-2 and noise-3 model. Similar phenomena are observed on the ocr (Figure C.26), reuters-ccat (Figure C.19), webspamunigram (Figure C.22), worm (Figure C.23), zeta (Figure C.24) datasets. The multiclass datasets seem to give a clearer edge to t-logistic regression. On the letter (Figure C.28), sensitacoustic (Figure C.32), sensitcombined (Figure
50 C.33), sensitseismic (Figure C.34) datasets, the test performances of t-logistic regression are all significantly better than logistic regression even without adding label noise. It is therefore reasonable to conjecture that mislabeling is more likely to occur in the multiclass datasets. On the dna (Figure C.27), and protein (Figure C.30) dataset, t-logistic regression is comparable or slightly better than logistic regression. On the mnist (Figure C.29) and usps (Figure C.35) dataset, t-logistic regression performs much better when label noises are added. As is to be expected in such extensive empirical evaluation, there are a few other anomalies. On the aut-avn (Figure C.4), dna (Figure C.25), real-sim (Figure C.17) and webspamtrigram (Figure C.21) dataset, logistic regression has the best test accuracy in some of the noise models, although the ξ-distribution indicates that t-logistic regression caps the influence of the noisy examples. On the news20 dataset (Figure C.16), the ξ variables of t-logistic regression is almost identical for all examples, which makes its performance close to or not as good as logistic regression. On the beta dataset (Figure C.5), the test error of all the algorithms are around 50%.
CPU Time Comparison One of the drawbacks of t-exponential family is that there is no closed form solution for the log-partition function. The main additional cost of t-logistic regression is the iterative numerical method to compute Gt (xi ; θ) for each example (xi , yi ). This may impair the efficiency of the algorithm. In order to compare the time efficiency among the algorithms, we provide the time experiment result of noise-3 model with ρ = 0.1, containing the total CPU time spent as well as the averaged CPU time spent for every function evaluation in Table E.1 and Table E.2. It is not surprising that t-logistic regression takes longer time to train than the logistic regression and the Savage loss in most of the datasets. As the number of the samples is significantly larger than the dimensions e.g. the covertype and kdd99 datasets, the
51 computing cost of Gt (xi ; θ) becomes the primary bottleneck of t-logistic regression and the time efficiency reduces more.
3.6
Chapter Summary In this chapter, we generalize the logistic regression to the t-logistic regression by using
t-exponential family of distributions. The new algorithm has a probabilistic interpretation and is more robust to label noise than logistic regression. We investigate the algorithm in binary classification and multiclass classification. Although the loss function is nonconvex, we show that the t-logistic regression is Bayes-risk consistent and empirically stable against random initialization.
52
4. T -DIVERGENCE BASED APPROXIMATE INFERENCE This chapter is devoted to using t-exponential family of distributions in complicated models with large number of random variables. From a computational perspective, the central issue here is the efficient computation of the log-partition function Gt (θ). During the last decade, variational inference has become an important technique for dealing with large, intractable exponential family of distributions, especially probabilistic graphical models. We will first review the main idea of variational inference and introduce two well-known inference algorithms for the exponential family. Then we will extend them to t-exponential family by defining a new t-entropy and a new t-divergence. Finally, two approximate inference algorithms for the exponential family will be generalized to t-exponential family based on t-divergence.
4.1
Variational Inference in Exponential Family of Distributions One of the prominent applications of exponential family of distributions is their use
in modeling conditional independence between random variables via a graphical model. However, when the number of random variables is large, and the underlying graph structure is complex, a number of computational issues need to be tackled in order to make inference feasible. The key challenge here is the computation of the log-partition function. A number of inference techniques have been developed to solve this problem approximately. Two prominent approximate inference techniques include the Monte Carlo Markov Chain (MCMC) methods [36], and variational methods [7, 37]. Variational methods are gaining significant research traction, mostly because of their high efficiency and practical success in many applications. Essentially, these methods are premised on the search for a proxy in an analytically solvable distribution family that approximates the true underlying distribution. To measure the closeness between the true and
53 approximate distribution, a proper divergence measure between the two distributions has to be defined. Among all types of divergence measures, the Kullback-Leibler (K-L) divergence has been mostly widely studied. In particular, the K-L divergence between two distributions p1 (z) := p(z; θ 1 ) and p2 (z) := p(z; θ 2 ) is defined as, Z D(p1 kp2 ) = p1 (z) log p1 (z) − p1 (z) log p2 (z)d z,
(4.1)
which is the Bregman divergence1 associated with the Shannon-Boltzmann-Gibbs (SBG) entropy, H(p(z)) := −
Z
p(z) log p(z) d z = − Ep(z) [log p(z)] .
(4.2)
The reason that the K-L divergence has been so popular is mainly because the SBG entropy has close connection to the exponential family of distributions and its log-partition function. As is demonstrated in the following theorem, the negative SBG entropy is the Fenchel conjugate of log-partition function of the exponential family of distributions. Theorem 4.1.1 (Theorem 2 [7]) For a k-dimensional exponential family distribution p(z; θ) = exp(hΦ(z), θi − G(θ)), and a k-dimensional vector µ, θ(µ) (if it exists) is the parameter of p(z; θ) such that Z µ = Ep(z;θ(µ)) [Φ(z)] = Φ(z)p(z; θ(µ)) d z . (4.3) Furthermore, G∗ (µ) =
−H(p(z; θ(µ))) if θ(µ) exists
(4.4)
+∞ otherwise . By duality it also follows that G(θ) = sup {hµ, θi − G∗ (µ)} . µ
1
The Bregman divergence associated with F for points p, q ∈ Ω is DF (p, q) = F (p) − F (q) − h∇F (q), p − qi .
See Appendix A.1 for more details.
(4.5)
54 Variational methods try to find an approximate distribution p ˜ from an analytically tractable exponential family distribution which minimizes the K-L divergence with the true distribution p. Since the Bregman divergence is not symmetric, the results of minimizing D(˜ p kp) and D(pk p ˜) are different. Therefore, there are mainly two types of variational inference methods, and in the following we review two classical algorithms.
4.1.1
Mean Field Methods
We briefly review mean field methods [7]. Suppose we are interested in approximating a k-dimensional multivariate distribution p(z; θ) = exp (hΦ(z), θi − G(θ)) , where z = (z 1 , . . . , z k ). Since G(θ) and −H(p(z; θ(µ))) are the Fenchel conjugates (Theorem 4.1.1) G(θ) = sup {hµ, θi + H(p(z; θ(µ)))} , µ∈M
where M denotes the set: n
o ˆ M = µ|∃θ s.t. Ep(z;θ) ˆ [Φ(z)] = µ .
(4.6)
The problem which arises in computing G(θ) is that M and H(p(z; θ)) are generally hard to characterize for most non-trivial multivariate distributions. Instead, mean field ˜ from simpler distributions p approximation replaces the set M by a subset of µ ˜. Let Z ˜ µ))d ˜ = Φ(z) p ˜ µ ˜(z; θ( z = Ep˜(z;θ( (4.7) ˜ µ)) ˜ [Φ(z)] c denote the set of all such µ, c ⊆ M, then clearly ˜ where M and M G(θ) ≥ sup
c ˜ M µ∈
n o ˜ µ))) ˜ θi + H(˜ ˜ hµ, p(z; θ( .
(4.8)
55 Moreover, the approximation error incurred as a result of replacing p with p ˜ is D(˜ p kp). To see this, use (4.2) and (4.7) to write ˜ θi + H(˜ ˜ G(θ) − sup {hµ, p(z, θ(µ)))} c ˜ M µ∈
n o ˜ ˜ − hµ, ˜ θi + G(θ) = inf −H(˜ p(z, θ(µ))) c ˜ M µ∈ Z Z ˜ ˜ ˜ ˜ log p ˜ ˜ (hΦ(z), θi − G(θ)) d z = inf p ˜(z; θ(µ)) ˜(z; θ(µ))d z− p ˜(z; θ(µ)) c ˜ M µ∈ Z Z ˜ µ)) ˜ µ))d ˜ µ)) ˜ log p ˜ ˜ log p(z; θ)d z = inf p ˜(z; θ( ˜(z; θ( z− p ˜(z; θ( c ˜ M µ∈
= inf D(˜ p kp). c ˜ M µ∈
Perhaps the simplest approximating distribution is to assume that each of the random variables z 1 , . . . , z k are independent, that is, ˜ µ)) ˜ = p ˜(z; θ(
k Y
j p ˜(z j ; θ˜ (˜ µj )),
j=1
where j p ˜(z ; θ˜ (˜ µj )) = exp j
D
E j j j ˜j j ˜ Φ (z ), θ (˜ µ ) − G (θ (˜ µ )) . j
j
j For brevity, if we denote p ˜j = p ˜(z j ; θ˜ (˜ µj )), then the KL divergence D(˜ p kp) can be written
as D(˜ p kp) =
(Z
Z p ˜n
) ˜ µ)) ˜ log p ˜(z; θ(
Y
p ˜j dz j
dz n
j6=n
−
(Z
Z p ˜n
) log p(z; θ)
Y
p ˜j dz j
dz n ,
j6=n
j for any n ∈ {1, . . . , k}. If we keep all θ˜ (˜ µj ) for j 6= n fixed, and minimize D(˜ p kp) with n
respect to θ˜ (˜ µn ), then it is easy to verify that the infimum is attained by setting Z Z Y Y j ˜ ˜ log p ˜(z; θ(µ)) p ˜j dz = log p(z; θ) p ˜j dz j + const. j6=n
j6=n
56 Since D E n n ˜ µ)) ˜ = Φn (z n ), θ˜ (˜ log p ˜(z; θ( µn ) − Gn (θ˜ (˜ µn )) E XD j j j ˜j j + Φ (z ), θ (˜ µ ) − Gj (θ˜ (˜ µj )), j6=n
log p(z; θ) = hΦ(z), θi − G(θ), RQ
p ˜j dz j = 1, the infimum condition can be rearranged as D E D E n Φn (z n ), θ˜ (˜ µn ) = Ep˜j6=n [Φ(z)] , θ + const. (4.9) R Q where Ep˜j6=n [Φ(z)] = Φ(z) j6=n p ˜j dz j . We have absorbed all the terms which do not
and
j6=n
depend on z n into the constant. n In summary, the mean field algorithm updates θ˜ (˜ µn ) to equalize the terms with Φn (z n ) j and satisfy (4.9) by keeping all θ˜ (˜ µj ) for j 6= n fixed. Cyclically different n are picked n and θ˜ (˜ µn ) updated until a stationary point is achieved.
n Next, we want to compute the lower bound to G(θ) using the computed θ˜ (˜ µn ). To-
˜ µ)) ˜ is clearly the summation of the wards this end, observe that the SBG entropy of p ˜(z, θ( SBG entropy of each random variable (which we are able to compute efficiently), ˜ µ))) ˜ = H(˜ p(z; θ(
k X
j H(˜ pj (z j ; θ˜ (˜ µj ))).
j=1
Plugging this into (4.8) obtains the desired lower bound.
4.1.2
Assumed Density Filtering
This subsection reviews assumed density filtering [38]. Given an original distribution ˜ by minimizing p(z), assumed density filtering obtains the approximate distribution p ˜(z; θ) the K-L divergence D(pk p ˜). If D E ˜ = exp( Φ(z), θ ˜ − G(θ)), ˜ p ˜(z; θ) ˜ = Ep˜ [Φ(z)], one can take the derivative of D(pk p using the fact that ∇θ˜ G(θ) ˜) with respect ˜ and obtain: to θ
Ep [Φ(z)] = Ep˜ [Φ(z)].
(4.10)
57 (4.10) is widely known as ”moment matching”. It is also the key idea of Expectation Propagation [37], which has many applications in graphical models. One of the major applications of assumed density filtering is to estimate approximate posterior in Bayesian online learning [39]. The objective of Bayesian online learning is to train a binary classification model based on an online stream of m training examples Dm = {(x1 , y1 ), . . . , (xm , ym )}. For simplicity, let us consider a linear model with Φ(xi ) = xi parameterized by w, such that the label yi is predicted by sign (hxi , wi). For each training data example (xi , yi ), the conditional distribution of the label yi given xi and w is modeled as in [37]: ti (w) := p(yi | xi ; w) = + (1 − 2)Θ(yi hxi , wi),
(4.11)
where Θ(z) is the step function: Θ(z) = 1 if z > 0 and = 0 otherwise, and is a small error tolerance variable to increase the robustness of the model2 . By making a standard i.i.d. assumption about the data, the posterior distribution after seeing the m-th example can be written as p(w | Dm ) ∝ p0 (w)
m Y
ti (w),
i=1
where p0 (w) denotes a prior distribution which is usually assumed to be a Gaussian distribution p0 (w) = N (w; 0, I). As it turns out, the posterior p(w | Dm ) is infeasible to obtain as m ≥ 2. However, by using assumed density filtering, we can find a multivariate Gaussian distribution to approximate the true posterior, ˜ (m) ) := N (w; µ ˜ (m) ). ˜ (m) , Σ p(w | Dm ) ' p ˜(w; θ 2
(4.11) is equivalent to the 0-1 loss. Define u = y hx, wi and l(x, y, w) = − log p(y| x; w), then − log(1 − ) u > 0 l(x, y, w) = − log() u ≤ 0.
58 ˜ (0) ) = p0 (w) = N (w; 0, I), and denote the approximate distribuWe initialize p ˜(w; θ
˜ (i) ) for i ≥ 1. Define tion after processing (x1 , y1 ), . . . , (xi , yi ) to be p ˜i (w) := p(w; θ pi (w) ∝ p ˜i−1 (w)ti (w), then the approximate posterior p ˜i (w) is updated as ˜ (i) ) = argmin D(pi (w)k N (w; µ, Σ)). ˜ (i) , Σ p ˜i (w) = N (w; µ
(4.12)
µ,Σ
As was shown in [37], the solution of (4.12) is, ˜ (i−1) xi , ˜ (i) = Ep [w] = µ ˜ (i−1) + α(i) yi Σ µ ˜ (i) = Ep [w w> ] − Ep [w] Ep [w]> Σ ˜ (i−1) − (Σ ˜ (i−1) xi ) =Σ
! ˜ (i) α(i) yi xi , µ ˜ (i−1) xi )> , (Σ ˜ (i−1) xi x> Σ i
where α(i)
˜ (i−1) yi x i , µ (1 − 2) N (z(i) ; 0, 1) . =q and z(i) = q R > ˜ ˜ (i−1) xi + (1 − 2) z(i) N (z; 0, 1)dz x> Σ x Σ x (i−1) i i i −∞
4.2 T -Entropy and T -Divergence We have briefly reviewed the main ideas of variational inference as well as two wellknown algorithms in exponential family of distributions. Our objective in this chapter is to generalize the variational inference to t-exponential family of distributions. First, we need to find a new entropy which plays the same role as the SBG entropy in the exponential family. There are various generalizations of the SBG entropy which have been proposed in statistical physics, and paired with the t-exponential family of distributions. Perhaps the most well-known among them is the Tsallis entropy [10]: Z Htsallis (p) := − p(z)t logt p(z)d z .
(4.13)
Naudts in [8, 9] proposed the more general φ-exponential family of distributions (see Section 2.3). Corresponding to this family, an entropy-like measure called the information
59 content Iφ (p) as well as its divergence measure are defined. The information content is the Fenchel conjugate of a function F (θ), where ∇θ F (θ) = Ep [Φ(z)] .
(4.14)
Setting φ(x) = xt in the Naudts framework recovers the t-exponential family. Interestingly when φ(x) = 1t x2−t , the information content Iφ is exactly the Tsallis entropy (4.13). One another well-known non-SBG entropy is the R´enyi entropy [40]. The R´enyi αentropy (when α 6= 1) of the probability distribution p(z) is defined as: Z 1 α Hα (p) = log p(z) d z . 1−α
(4.15)
Besides these entropies proposed in statistical physics, there are other efforts that work with generalized linear models or utilize different divergence measures, such as [14, 41–43]. Although all of the above generalized entropies are useful in their own way, unfortunately none of them is the Fenchel conjugate of the log-partition function Gt (θ) of the t-exponential family. As has been shown in Section 4.1 as well as [7], this property is crucial in developing efficient variational inference approaches. In the following subsection, we define a new entropy, which to the best of our knowledge, has not been studied before. Note that although our main focus is the t-exponential family, we believe that our results can also be extended to the more general φ-exponential family of Naudts [9].
4.2.1 T -Entropy Definition 4.2.1 (Inspired by Theorem 2 [7]) The t-entropy of a probabilistic distribution p(z; θ) is defined as Ht (p(z; θ)) : = −
Z
q(z; θ) logt p(z; θ) d z = − Eq [logt p(z; θ)] .
where q(z; θ) = p(z; θ)t /Z(θ) and Z(θ) =
R
(4.16)
p(z; θ)t d z.
It is straightforward to verify that the t-entropy is non-negative. Furthermore, if p(z; θ) is a t-exponential family of distribution, the following theorem establishes the Fenchel
60 conjugacy between −Ht (p(z; θ(µ))) and Gt (θ), the log-partition function of p(z; θ). The theorem extends Theorem 3.4 of [7] to t-exponential family of distributions. The proof of the theorem is provided in Appendix B.7. Theorem 4.2.1 For a k-dimensional t-exponential family of distributions p(z; θ) = expt (hΦ(z), θi − Gt (θ)), and a k-dimensional vector µ, θ(µ) (if it exists) is the parameter of p(z; θ) such that Z µ = Eq(z;θ(µ)) [Φ(z)] = Φ(z)q(z; θ(µ)) d z . (4.17)
Then
G∗t (µ) =
−Ht (p(z; θ(µ))) if θ(µ) exists
(4.18)
+∞ otherwise . where G∗t (µ) denotes the Fenchel conjugate of Gt (θ). By duality it also follows that Gt (θ) = sup {hµ, θi − G∗t (µ)} .
(4.19)
µ
From Theorem 4.2.1, we know that −Ht (p(z; θ(µ))) is a convex function because it is the Fenchel conjugate of a function (Theorem 1.1.2 in Chapter X of [44]). Below, we derive the t-entropy of two commonly used distributions. See Figure 4.1 for a graphical illustration. Example 5 (T -entropy of Bernoulli distribution) Assume the Bernoulli distribution is p(z; µ) with parameter µ. The t-entropy is, Ht (p(z; µ)) =
1 − (µt + (1 − µ)t )−1 −µt logt µ − (1 − µ)t logt (1 − µ) = . µt + (1 − µ)t t−1
(4.20)
As t → 1, Ht (p(z; µ)) = −µ log µ − (1 − µ) log(1 − µ). Example 6 (T -entropy of Student’s t-distribution) Assume that a k-dimensional Student’s t-distribution p(z; µ, Σ, v)3 is given by (2.20), then the t-entropy of p(z; µ, Σ, v) is given by Ht (p(z; µ, Σ, v))) = − 3
Ψ 1 1 + v −1 + , 1−t 1−t
We abuse the notation µ here to denote the first moment of Student’s t-distribution.
(4.21)
61 where K = (v Σ)−1 , v =
2 t−1
v → +∞, Ht (p(z; µ, Σ, v))) =
− k, and Ψ = k 2
Γ((v+k)/2) (πv)k/2 Γ(v/2)| Σ |1/2
−2/(v+k)
+ k2 log(2π) + 21 log | Σ |.
1
Ht (σ 2 )
0.6
. As t → 1,
15
t=0.1 t=0.5 t=1.0 t=1.5 t=1.9
0.8 Ht (µ)
0.4
t=1.0 t=1.3 t=1.6 t=1.9
10
5
0.2 0
0
0.2
0.4
0.6 µ
0.8
1
0
0
2
4
6 σ
8
10
2
Fig. 4.1. T -entropy corresponding to two well known probability distributions. Left: the Bernoulli distribution p(z; µ). Right: the 1-dimensional Student’s tdistribution p(z; 0, σ 2 , v), where v = 2/(t − 1) − 1. One recovers the SBG entropy by letting t = 1.0.
Although t-entropy has not been studied in the past, as the following examples will show, it is closely related to some well-known generalized entropies.
Relation with the Tsallis Entropy Using (2.13), (4.13), and (4.16), it is straightforward to see that, the t-entropy is a normalized version of the Tsallis entropy, Z 1 1 Ht (p(z)) = − R p(z)t logt p(z)d z = R Htsallis (p(z)). t p(z) d z p(z)t d z Relation with the R´enyi Entropy We can equivalently rewrite the R´enyi Entropy as: Z Z −1/(1−α) 1 α α Hα (p(z)) = log p(z) d z = − log p(z) d z . 1−α
62 The t-entropy of p(z) (when t 6= 1) is equal to R p(z)t logt p(z)d z R Ht (p(z)) = − p(z)t d z R p(z)t (p(z)1−t − 1) d z R =− (1 − t) p(z)t d z R 1 − p(z)t d z R =− (1 − t) p(z)t d z Z −1/(1−t) t = − logt p(z) d z . where (4.22) is because
R
(4.22) (4.23)
p(z)d z = 1, and (4.23) is because logt (x) = (x1−t − 1) /(1 − t).
Therefore, when α = t, Ht (p(z)) = − logt (exp(−Hα (p(z)))) When t and α → 1, both entropies reduce to the SBG entropy. 4.2.2
T -Divergence
The next step is to define a divergence measure which pairs with t-exponential family. Analogous to K-L divergence, we define the t-divergence4 as the Bregman divergence based on the t-entropy. Definition 4.2.2 The t-divergence is the relative t-entropy between two distributions p1 (z) and p2 (z). It is equal to, Dt (p1 kp2 ) =
Z
q1 (z) logt p1 (z) − q1 (z) logt p2 (z)d z .
(4.24)
The mathematical verification of (4.24) is provided in Appendix B.11. The t-divergence plays a central role in the variational inference that will be derived shortly. Because it is a Bregman divergence, it preserves the following properties: • Dt (p1 kp2 ) ≥ 0, ∀p1 , p2 . The equality holds only for p1 = p2 . 4
Note that the t-divergence is not a special case of the divergence measure of Naudts [9] because the entropies are defined differently. The derivations are fairly similar in spirit though.
63 • Dt (p1 kp2 ) 6= Dt (p2 kp1 ). We give two examples of t-divergence in the below. For corresponding graphical illustrations see Figure 4.2. Example 7 (T -divergence between Bernoulli distributions) Assume that two Bernoulli distributions p1 := p(z; µ1 ) and p2 := p(z; µ2 ), then the t-divergence Dt (p1 kp2 ) between these two distributions is: µt1 logt µ1 + (1 − µ1 )t logt (1 − µ1 ) − µt1 logt µ2 − (1 − µ1 )t logt (1 − µ2 ) µt1 + (1 − µ1 )t 1 − µt1 µ1−t − (1 − µ1 )t (1 − µ2 )1−t 2 . = (1 − t)(µt1 + (1 − µ1 )t )
Dt (p1 kp2 ) =
1−µ1 As t → 1, Dt (p1 kp2 ) = µ1 log µµ12 + (1 − µ1 ) log 1−µ . 2
Example 8 (T -divergence between Student’s t-distributions) Assume that two Student’s t-distributions p1 := p(z; µ1 , Σ1 , v) and p2 := p(z; µ2 , Σ2 , v) are given, then the tdivergence Dt (p1 kp2 ) between these two distributions is: Dt (p1 kp2 ) =
2Ψ2 > Ψ1 µ K2 µ2 1 + v −1 + 1−t 1−t 1 Ψ2 Ψ2 > Ψ2 − T r K> µ1 K2 µ1 − µ> 2 Σ1 − 2 K2 µ2 + 1 , 1−t 1−t 1−t (4.25)
where the definition of Ki and Ψi are the same as (4.21). T r is the trace of the matrix. As t → 1, v → +∞, 1 Dt (p1 kp2 ) = 2
4.3
| Σ2 | −1 −1 > tr(Σ2 Σ1 ) + (µ2 − µ1 ) Σ2 (µ2 − µ1 ) + log −k . | Σ1 |
Variational Inference in T -Exponential Family of Distribution In this section, we extend the two variational inference methods from Section 4.1 to
t-exponential family.
64
1
t=0.1 t=0.5 t=1.0 t=1.5 t=1.9
Dt (p1 kp2 )
0.8 0.6 0.4 0.2 0
0
0.2
0.4
0.6
0.8
1
µ
Dt (p1 kp2 )
40
t=1.0 t=1.3 t=1.6 t=1.9
4 3
Dt (p1 kp2 )
t=1.0 t=1.3 t=1.6 t=1.9
60
2
20 1 0
−4
−2
0 µ
2
4
0
0
2
4
6 σ
8
10
2
Fig. 4.2. T -divergence between two distributions. Top: Bernoulli distributions p1 = p(z; µ) and p2 = p(z; 0.5). Bottom: Student’s t-distributions. Left: p1 = p(z; µ, 1, v) and p2 = p(z; 0, 1, v). Right: p1 = p(z; 0, σ 2 , v) and p2 = p(z; 0, 1, v). v = 2/(t − 1) − 1. One recovers the K-L divergence by letting t = 1.0.
65 4.3.1
Mean Field Methods
This subsection introduces the mean field method for the t-exponential family. Consider the k-dimensional multivariate t-exponential family of distribution p(z; θ) = expt (hΦ(z), θi − Gt (θ)) . where z = (z 1 , . . . , z k ). From Theorem 4.2.1, Gt (θ) = sup {hµ, θi + Ht (p(z; θ(µ)))} , µ∈M
where n o ˆ s.t. E ˆ [Φ(z)] = µ . M = µ|∃θ q(z;θ)
(4.26)
Similar to the case of the exponential family (see Section 4.1.1), if we replace p by ˜ µ)), ˜ a subset p ˜(z; θ( then the approximation error of Gt (θ) incurred is given by the tdivergence Gt (θ) − sup
c ˜ M µ∈
n
o ˜ ˜ θi + Ht (˜ ˜ hµ, p(z; θ(µ))) = inf Dt (˜ p kp),
(4.27)
c ˜ M µ∈
where n o ˆ s.t. E ˆ [Φ(z)] = µ c= µ ˜ |∃θ ˜ M . q ˜(z;θ) The simplest way is to approximate p(z; θ) by ˜ µ)) ˜ = p ˜(z; θ(
k Y
j p ˜(z j ; θ˜ ), where
(4.28)
j=1 j p ˜(z j ; θ˜ ) = expt
D
E j j Φj (z j ), θ˜ − Gjt (θ˜ ) .
j
Denote p ˜j = p ˜(z j ; θ˜ ), and q ˜j the corresponding escort distribution, for any n, one can write the t-divergence as Z Dt (˜ p kp) = q ˜n
Z
! ˜ logt p ˜(z; θ)
Y j6=n
q ˜j dz j
! Z Z Y dz n − q ˜n logt p(z; θ) q ˜j dz j dz n . j6=n
66 j If we keep all θ˜ for j 6= n fixed, then the t-divergence is minimized by setting Z Z Y Y j ˜ logt p ˜(z; θ) q ˜j dz = logt p(z; θ) q ˜j dz j + const. j6=n
(4.29)
j6=n
RQ Using the fact that ˜j dz j = 1, we can write j6=n q Z Z Y Y 1 1 j ˜ ˜ p ˜1−t (z; θ) q ˜j dz j − logt p ˜(z; θ) q ˜j dz = 1−t 1−t j6=n j6=n Z Z Y Y 1 1 logt p(z; θ) . q ˜j dz j = p1−t (z; θ) q ˜j dz j − 1 − t 1 − t j6=n j6=n Since p(z; θ) is t-exponential family, Z Z Y Y 1−t j p (z; θ) q ˜j dz = (1 + (1 − t) hΦ(z), θi − Gt (θ)) q ˜j dz j j6=n
j6=n
D E = 1 + (1 − t)( Eq˜j6=n [Φ(z)] , θ − Gt (θ)) ,
(4.30)
R Q where Eq˜j6=n [Φ(z)] = Φ(z) j6=n q ˜j dz j . Similarly, Z Y ˜ p ˜1−t (z; θ) q ˜j dz j j6=n
=
Z
E n n ˜n ˜ 1 + (1 − t) Φ (z ), θ − Gt (θ ) E D Y j j ˜j dz j 1 + (1 − t) Φj (z j ), θ˜ − Gjt (θ˜ ) q D
n
n
j6=n
E D n n = 1 + (1 − t)( Φn (z n ), θ˜ − Gnt (θ˜ )) D Y j j jE j ˜j ˜ 1 + (1 − t)( Eq˜j Φ (z ) , θ − Gt (θ )) ,
(4.31)
j6=n
where Eq˜j [Φj (z j )] =
R
Φj (z j ) q ˜j dz j . Putting together (4.30) and (4.31) by using (4.29)
yields
D E 1 + (1 − t)( Eq˜j6=n (Φ(z)) , θ − Gt (θ)) D E n n = 1 + (1 − t)( Φn (z n ), θ˜ − Gnt (θ˜ )) D E Y j j 1 + (1 − t)( Eq˜j [Φj (z j )], θ˜ − Gjt (θ˜ )) + const. j6=n
67 Absorbing all the terms which do not depend on z n into the constant, we obtain the following update equation, D
E D EY D t−1 jE n j Φn (z n ), θ˜ = Eq˜j6=n [Φ(z)] , θ expt Eq˜j Φj (z j ) , θ˜ − Gjt (θ˜ ) + const. j6=n
(4.32) n Cyclically different n are picked and θ˜ (˜ µn ) updated until a stationary point is achieved.
˜ µ)), ˜ µ))) ˜ one can compute Ht (˜ ˜ and plug into (4.27) to obtain After obtaining p ˜(z; θ( p(z; θ( ˜ µ))) ˜ the lower bound of Gt (θ). Note that unlike the SBG entropy, the t-entropy Ht (˜ p(z; θ( does not factorize easily. However, as is shown in Appendix B.12, one can still compute a closed form solution, ˜ µ))) ˜ = (1 − t)k−1 Ht (˜ p(z; θ( where Zj =
R
k Y j=1
1 Ht (pj (zj )) + (1 − t)Zj
k 1 Y 1 − . 1 − t j=1 Zj
(4.33)
pj (zj )t dzj .
Approximating Multivariate Student’s T -Distribution We illustrate the mean field method by using an example which approximates a kdimensional Student’s t-distribution with degree of freedom v and parameters µ and Σ. According to Example 4, St(z; µ, Σ, v) = expt (hΦ(z), θi − Gt (θ)) , where Φ(z) = [z; z z> ], θ = [−2Ψ K µ/(1 − t), Ψ K /(1 − t)], 1 v+k = , t−1 2 !−2/(v+k) Γ ((v + k)/2)
K = (v Σ)−1 , Ψ=
, (πv)k/2 Γ(v/2)| Σ |1/2 Ψ 1 µ> K µ + 1 + . Gt (θ) = − 1−t 1−t
68 In particular, our task is to approximate it by k one-dimensional Student’s t-distributions with degrees of freedom v˜. The reason that we choose this example is because the true logpartition Gt (θ) is analytically computable, so that we can see how close the approximation is. In order to make the approximate distribution have the same t, we choose v˜ such that 1 t−1
=
v+k 2
=
v ˜ +1 , 2
which yields v˜ = v + k − 1. The approximate distribution is ˜ = p ˜(z; θ)
k Y
˜j ) = p ˜(z ; θ j
j=1
k Y
St(z j ; µ ˜j , σ ˜ j , v˜).
j=1
Using the representation of t-exponential family of distributions ˜ j ) = expt p ˜(z j ; θ
D
E ˜ j − Gt (θ ˜j ) Φj (z j ), θ
j
˜ = [−2Ψ ˜ jK ˜j µ ˜ jK ˜ j /(1 − t)], where with Φj (z j ) = [z j ; (z j )2 ] and θ ˜j /(1 − t), Ψ ˜j
−1
j −2
˜j
K = v˜ (˜ σ ) , Ψ =
Γ((˜ v +1)/2) Γ(˜ v /2)(π v˜)1/2 σ ˜j
−2/(˜v +1) .
Now we apply the variational updates (4.32) to obtain the variational parameters µ ˜j and σ ˜ j for j = 1, . . . , k. Detailed derivations are provided in Appendix B.13. The resulting iterative updates are, 1 ˜ j6=n )> kj6=n,n , −2µ> kn +2(µ nn k −(˜v +1)/ v˜ Γ(˜ v /2)2/ v˜ π 1/ v˜ n 2 n ˜n ˜ (˜ σ ) = K Ψ , · Γ((˜ v +1)/2)2/ v˜ v˜ !−1 Y Ψ ˜j ˜ nΨ ˜ n = Ψk nn ˜j where K +Ψ . v ˜ j6=n µ ˜n =
j ˜ j6=n denotes the vector µ Here µ ˜ j=1...k,j6=n , kn denotes the n-th column of K, and kj6=n,n denotes the n-th column of K after its n-th element is deleted. To empirically validate these updates, we use the product of ten 1-dimensional Student’s t-distributions to approximate a 10-dimensional Student’s t-distribution with degrees of freedom v = 5, which corresponds to setting t = 1.13. Both µ and Σ are generated randomly using Matlab. Overall 500 variational updates were performed and the negative
69 t-divergence (−Dt (˜ p kp)) is plotted as a function of the number of iterations in Figure 4.3. One can see that the t-divergence between the approximate distribution and the true distribution monotonically decreases until it reaches a stationary point. At that point, the t-divergence between the approximate distribution and the true distribution appears to be close to 0, which indicates that a reasonable approximation has been obtained.
Negative t-divergence
0 −200 −400 −600 −800 −1000 0
100
200 300 # of iterations
400
500
Fig. 4.3. The negative t-divergence between the product of ten 1-dimensional Student’s t-distributions and one 10-dimensional Student’s t-distribution using the mean field approach for 500 iterations.
4.3.2
Assumed Density Filtering
This subsection introduces the assumed density filtering in t-exponential family. We D E ˜ ˜ ˜ by minapproximate the original distribution p(z) by p ˜(z; θ) = expt ( Φ(z), θ − Gt (θ)) imizing ˜ Dt (p(z)k p ˜(z; θ)) Z ˜ z. = q(z) logt p(z) − q(z) logt p ˜(z; θ)d
(4.34)
70 ˜ = Eq˜ [Φ(z)], one can take the derivative of (4.34) with respect Using the fact that ∇θ˜ Gt (θ)
˜ and obtain: to θ
Eq [Φ(z)] = Eq˜ [Φ(z)].
(4.35)
In other words, the approximate distribution is obtained by matching the escort expectation of Φ(z) between the two distributions. To illustrate our ideas on a non-trivial problem, we again apply it on the Bayesian online learning problem. But this time, instead of using multivariate Gaussian distribution as a prior as was done by [37], we use a Student’s t-prior, p0 (w) = St(w; 0, I, v).
(4.36)
In addition, we will find a multivariate Student’s t-distribution to approximate the true posterior p(w | Dm ) ∝ p0 (w)
m Y
ti (w).
(4.37)
i=1
˜ (0) ) = p0 (w) = where ti (w) = p(yi | xi , w) is defined in (4.11). We initialize p ˜(w; θ St(w; 0, I, v), and denote the approximate distribution after processing (x1 , y1 ), . . . , (xi , yi ) ˜ (i) ) = St(w; µ ˜ (i−1) , v) for i ≥ 1. Define ˜ (i−1) , Σ to be p ˜i (w) := p(w; θ pi (w) ∝ p ˜i−1 (w)ti (w), then the approximate posterior p ˜i (w) is updated as ˜ (i) , v) = argmin Dt (pi (w)kSt(w; µ, Σ, v)). ˜ (i) , Σ p ˜i (w) = St(w; µ
(4.38)
µ,Σ
Assume that w is a k-dimensional parameter vector, then p ˜i (w) is a k-dimensional Student’s t-distribution with degree of freedom v, for which Φ(w) = [w, w w> ] and t = 1 + 2/(v + k). From (4.35), the result of (4.38) matches the moments Φ(w) between qi (w) and q ˜i (w), Z
Z qi (w) w d w =
Z
qi (w) w w> d w =
Z
q ˜i (w) w d w, and
(4.39)
q ˜i (w) w w> d w,
(4.40)
71 where q ˜i (w) ∝ p ˜i (w)t , qi (w) ∝ p ˜i−1 (w)t t˜i (w), and t˜i (w) = ti (w)t = t + (1 − )t − t Θ(yi hw, xi i),
(4.41)
where Θ(z) is the step function: Θ(z) = 1 if z > 0 and = 0 otherwise. Solving (4.39) and (4.40) yields to the following simple update rules, which are reminiscent of the Bayesian online learning algorithm by Gaussian distribution [37]. The detailed derivation is provided in Appendix B.14. ˜ (i−1) xi ˜ (i) = Eq [w] = µ ˜ (i−1) + α(i) yi Σ µ ˜ (i) = Eq [w w> ] − Eq [w] Eq [w]> Σ ˜ (i−1) − (Σ ˜ (i−1) xi ) = r(i) Σ
! ˜ (i) α(i) yi xi , µ ˜ (i−1) xi )> . (Σ > ˜ x Σ(i−1) xi i
where ((1 − )t − t ) St(z(i) ; 0, 1, v) q , r(i) = Z1(i) /Z2(i) , > ˜ Z2(i) xi Σ(i−1) xi Z = p ˜i−1 (w)t˜i (w)d w Z z(i) t t t = + (1 − ) − St(z; 0, 1, v)dz, −∞ Z = q ˜i−1 (w)t˜i (w)d w Z z(i) t t t = + (1 − ) − St(z; 0, v/(v + 2), v + 2)dz, −∞
˜ (i−1) yi xi , µ =q . > ˜ xi Σ(i−1) xi
α(i) = Z1(i)
Z2(i)
and z(i)
Synthetic Results In classical binary classification problems, it is assumed that the underlying true classifier is fixed. However, in an online learning problem, it is possible that the underlying classifier changes from time to time. In such scenario, we require the learning algorithm to relearn the classifier quickly. The Student’s t-distribution is a more conservative prior
72 than the Gaussian distribution because of its heavy-tailed nature. As we will see in the following synthetic online dataset, the Bayesian online learning algorithm based on Student’s t-distribution is able to relearn the classifier much faster than the one based on Gaussian distribution. In our experiments, we generate a sequence of data which is composed of 4000 data points. Each data example xi is randomly generated by a 100 dimension isotropic Gaussian distribution N (0, I). In order to periodically change the underlying classifier, we partition the sequence evenly into 10 subsequences of length 400, and assign each subsequence a ¯ (s) ∈ {−1, +1}100 where s ∈ {1, 2, . . . , 10}. Each data base weight parameter vector w
¯ (s) + n(i) , s = di/400e, and point xi is labeled as yi = sign xi , w(i) where w(i) = w ¯ (s) is n(i) is generated from the uniform distribution in [−0.1, 0.1]. The base weight vector w generated in two ways: (I) from U {−1, +1}100 where U {a, b} denotes p(a) = p(b) = 0.5; ¯ (s−1) such that, for s ∈ {2, . . . , 10}, (II) based on the previous base weight vector w U {−1, +1} j ∈ [10s − 9, 10s] j w¯(s) = w¯ j otherwise. (s−1) We compare the Bayesian online learning algorithm with Student’s t-prior (with v = 3 and v = 10) and the one with the Gaussian prior. For both methods, we let = 0.01. We report the discrepancy D(i) between the true weight vector w(i) and Ep˜i [w] the posterior mean of p ˜i (w) at each data i in Figure 4.4, where D(i)
100 X j = δ w(i) Ep˜i [wj ] > 0 , j=1
and the accumulated prediction error rate 4000
1 X E= δ yi xi , Ep˜i [w] > 0 4000 i=1
in Table 4.1. Here δ(·) is 0 if the condition inside (·) holds and 1 otherwise. According to Figure 4.4, we can see that the discrepancy curve is periodical due to the change of the base weight parameter every 400 data points. It is obvious that discrepancy curves by Student’s t-distributions (red and green) drop much faster than the one
73 by Gaussian distribution (black). As a result, the accumulated error rates by Student’s t-distributions are also lower than the one by Gaussian distribution as shown in Table 4.1.
80 60 40 20 0
0
1,000
2,000
3,000
4,000
20 Gauss v=3 v=10
Discrepancy D(i)
Discrepancy D(i)
100
Gauss v=3 v=10
15 10 5 0
0
Data Example (i)
1,000
2,000
3,000
4,000
Data Example (i)
Fig. 4.4. The discrepancy D(i) between the true weight vector w(i) and Ep˜i [w] the posterior mean of p ˜i (w) at each data i from the synthetic online dataset using Bayesian online learning. Left: case I. Right: case II.
Table 4.1 The accumulated prediction error rate on the synthetic online dataset using Bayesian online learning. Gauss
4.4
v=3
v=10
Case I
0.337
0.242 0.254
Case II
0.150
0.130 0.128
Chapter Summary In this chapter, we investigated the conjugacy of the log-partition function of the t-
exponential family of distributions, and studied a new t-entropy. By minimizing the tdivergence, the Bregman divergence based on t-entropy, we generalized two well-known approximate inference approaches for the exponential family to the t-exponential family.
74
5. T -CONDITIONAL RANDOM FIELDS In classification, a classifier predicts the label of a data point without considering the other examples or labels. However, real-world data usually has underlying structure. Taking into account this structure is beneficial both for prediction and for modeling. In exponential family, conditional random field (CRF) is a well-known statistical modeling method, which extends logistic regression by modeling the structure of the data and labels. We will first briefly review graphical models and CRF. Then, we will introduce a new model, the tconditional random field (t-CRF), as a generalization of CRF. In order to perform inference in our new model, a novel mean field based approach is presented. Finally, we will give two examples which demonstrate the robustness of t-CRF.
5.1
Conditional Random Fields
5.1.1
Undirected Graphical Models
Many real-world applications involve a large number of variables which depend on each other [45]. Examples include parsing natural language sentences, annotating images, etc. Over the past two decades, probabilistic graphical models have been used to model such dependencies [46]. Our focus in this dissertation is on undirected graphical models G = (V, E), which contain two main components: • V : the set of graph vertices(nodes), which represent the variables; • E: the set of graph edges, which represent the dependencies between variables. Probably the simplest illustrative example (see Figure 5.1.1) would be a 3-node chain which consists of three variables (z 1 , z 2 , z 3 ). There are two edges in the graph, (z 1 , z 2 ) and (z 2 , z 3 ), while z 1 and z 3 are not directly connected. Intuitively, the way to interpret the model is that the value of z 1 depends on z 2 , while the value of z 2 depends on z 1 and z 3 .
75 z1
z2
z3
Fig. 5.1. The 3-node chain model. Each node indicates a variable. Each edge on the graph represents a dependency.
At the heart of a graphical model lies the Markov property: If a set U1 ⊆ V is separated from U3 by another set U2 in G, then this corresponds to the conditional independence property namely U1 ⊥ U3 | U2 in the joint distribution over V . In Figure 5.1.1, since z 1 and z 3 are separated by z 2 , we can conclude that z1 ⊥ z3 | z2. In other words, p(z 1 |z 2 , z 3 ) = p(z 1 |z 2 ) and p(z 3 |z 2 , z 1 ) = p(z 3 |z 2 ). Furthermore, the Hammersley-Clifford theorem [46] states that a probability distribution that satisfies the Markov property with respect to an undirected graph if, and only if, its density can be factorized over the cliques (fully connected subgraphs) of the graph. In Figure 5.1.1, there are two cliques (z 1 , z 2 ) and (z 2 , z 3 ). Therefore, the probability distribution factorizes as p(z) = where Z =
5.1.2
R
1 Ψ(z1 , z2 )Ψ(z2 , z3 ), Z
Ψ(z1 , z2 )Ψ(z2 , z3 )d z is the normalization constant.
Conditional Random Fields
Conditional random field [11, 12, 45], commonly abbreviated as CRF, is a graphical model, which models the conditional distribution p(y | X) of the label vector y based on the observed feature variables X. In order to illustrate the main idea of CRF, we will focus on the simplest 3-node chain model by slightly extending Figure 5.1.1 to a conditional model (see Figure 5.1.2). Interested readers may refer to [11, 12, 45] for a more detailed discussions. The 3-node conditional chain model consists of three labels y = (y 1 , y 2 , y 3 ) as the random variables and
76 three observed (feature) variables X = (x1 , x2 , x3 ). The observed variables are in red in the figure. In addition to the chain structure among the labels, each y j is connected with a feature variable xj ∈ Rd . For simplicity, we assume that y j is binary and takes values from {0, 1}. Therefore, there are 23 = 8 possible configurations of this conditional model. y1
y2
y3
x1
x2
x3
Fig. 5.2. The 3-node conditional chain model. Blue nodes indicate the labels; red nodes indicate the data variables. Each edge on the graph represents a factor.
The chain CRF in Figure 5.1.2 satisfies the Markov property, therefore, y 1 ⊥ y 3 | y 2 , X. Furthermore, thanks to the Hammersley-Clifford theorem, the conditional distribution p(y | X) can be factorized into cliques. Figure 5.1.2 contains two types of cliques: node cliques (xj , y j ) and edge cliques (y j , y j+1 ), therefore, p(y | X) =
3 2 1 Y v j j Y e j j+1 Ψ (x , y ) · Ψ (y , y ), Z(X) j=1 j=1
(5.1)
where Z(X) =
3 X Y y∈{0,1}3
j=1
Ψv (xj , y j ) ·
2 Y
Ψe (y j , y j+1 )
(5.2)
j=1
is a normalization constant which does not depend on y. Computing Z(X) requires summing over 23 = 8 different configurations of y. As the chain gets longer, the number of terms in the summation grows exponentially with the length of chain. Therefore, we need efficient algorithms for computing Z(X). We will discuss one such algorithm in Section 5.1.4.
77 As for the choice of the conditional distribution, exponential family distributions have been widely used. Each clique is modeled by an exponential factor,
Ψv (xj , y j ) = exp Φv (xj , y j ), θ v ,
Ψe (y j , y j+1 ) = exp Φe (y j , y j+1 ), θ e , and 3 2 Y
v j j Y 1 exp Φ (x , y ), θ v · exp Φe (y j , y j+1 ), θ e p(y | X; θ) = Z(X; θ) j=1 j=1
= exp
3 X
Φv (xj , y j ), θ v +
j=1
2 X
j=1
! Φe (y j , y j+1 ), θ e − G(X; θ) , (5.3)
where G(X; θ) the log-partition function which is equal to log Z(X; θ). In (5.3), θ v denotes the parameter for node cliques, θ e denotes the parameter for edge cliques. For simplicity, we assume that Φv (xj , y j ) = δ0j xj , δ1j xj , jr jr jr jr Φe (y j , y r ) = (δ00 , δ01 , δ10 , δ11 ), jr where δgj := δ(y j = g) = 1 if y j = g and δgj = 0 otherwise; and δgf is the abbreviation of
δ(y j = g)δ(y r = f ), where f , g ∈ {0, 1}. 5.1.3
Parameter Estimation
Similar to logistic regression, the loss function of a CRF is defined as its negative loglikelihood, l(X, y; θ) = − log p(y | X; θ) =−
3 X
j=1
v
j
j
Φ (x , y ), θ v −
2 X
j=1
Φe (y j , y j+1 ), θ e + G(X; θ),
78 whose gradient with respect to θ e and θ v can be computed as follows, 3 3 X X ∂ l(X, y; θ) = − Φv (xj , y j ) + Ep [Φv (xj , y j )], ∂ θv j=1 j=1 2 2 X X ∂ e j j+1 l(X, y; θ) = − Φ (y , y ) + Ep [Φe (y j , y j+1 )]. ∂ θe j=1 j=1
Here Ep means the expectation with respect to p(y | X; θ). Given m training examples {(X1 , y1 ), . . . , (Xm , ym )}, the empirical risk is equal to the average loss, m
m
1 X 1 X l(Xi , yi ; θ) = − log p(yi | Xi ; θ), Remp (θ) = m i=1 m i=1 and as before, the regularized risk is given by J(θ) := λΩ(θ) + Remp (θ). We will use Ω(θ) =
1 k θ k22 2
as our regularizer. The model parameter θ is obtained by
minimizing the regularized risk J(θ). Although other algorithms exist (e.g. SGD in [47]), we will use the L-BFGS algorithm.
5.1.4
Inference
The main computational issue in parameter estimation for a CRF is how to estimate p(y | X; θ) and compute Z(X). As discussed earlier, the naive way of computing Z(X) based on (5.2) grows exponentially with respect to the length of the chain, which is prohibitive for long chains. Fortunately, belief propagation [48], also known as the sum-product message passing algorithm, can be applied for efficient inference on the chain model. The basic idea of belief
79 propagation is to take advantage of the factorization so that the sum over y are distributed and reused. For example, in the 3-node chain model, Z(X) =
3 X Y
v
=
j
Ψ (x , y ) ·
y∈{0,1}3 j=1
X
j
Ψv (x3 , y 3 )
y 3 ∈{0,1}
2 Y
Ψe (y j , y j+1 )
j=1
X
Ψe (y 2 , y 3 )Ψv (x2 , y 2 )
y 2 ∈{0,1}
X
Ψe (y 1 , y 2 )Ψv (x1 , y 1 ).
y 1 ∈{0,1}
(5.4) By computing α1 (y 1 ) = Ψv (x1 , y 1 ), α2 (y 2 ) = Ψv (x2 , y 2 )
X y 1 ∈{0,1}
α3 (y 3 ) = Ψv (x3 , y 3 )
X y 2 ∈{0,1}
we have Z(X) =
P
y 3 ∈{0,1}
Ψe (y 1 , y 2 ) · α1 (y 1 ), Ψe (y 2 , y 3 ) · α2 (y 2 ),
α3 (y 3 ). Therefore, the number of summations drops from
exponential (23 ) to linear (2 × 3) with respect to the chain length. The belief propagation algorithm can be further applied to any acyclic undirected graph including tree graphs. However, for more general graphical models which contains cycles, a variant called loopy belief propagation can be applied. However, loopy belief propagation may not converge, because the summations in Z(X) cannot be distributed like in (5.4). In such cases, other approximate inference methods such as mean field methods are widely applied.
5.2 T -Conditional Random Fields The graphical models based on Markov properties are important and powerful for compactly representing multivariate distributions. However, they encode independence relations which are sometimes too strong. There is some existing research devoted to providing solutions to long-range dependencies by adding more edges, e.g. skip-chain CRF [12]. However, to consider long-range dependencies among all variables would need a complete graph, which dramatically increases the computational complexity. Furthermore, as
80 we have already seen in previous chapters, the thin-tailed nature of the exponential family makes it potentially vulnerable against extreme outliers. In order to provide a solution to the above issues, we introduce the t-conditional random field (t-CRF), which uses expt to replace the exp function in CRF. Let us illustrate t-CRF using the 3-node chain model in Figure 5.1.2, where p(y | X; θ) = expt
3 X
Φv (xj , y j ), θ v +
j=1
2 X
j=1
! Φe (y j , y j+1 ), θ e − Gt (X; θ) . (5.5)
Here, Gt (X; θ) is the log-partition function which does not have an analytical solution for t 6= 1. It satisfies X
expt
y∈{0,1}3
3 X
Φv (xj , y j ), θ v
j=1
2 X
e j j+1 + Φ (y , y ), θ e − Gt (X; θ)
! = 1.
j=1
We will discuss efficient computation of Gt (X; θ) in Section 5.3. Here we focus on the modeling implications. Unlike the exponential function, expt (a + b) 6= expt (a) expt (b), therefore (5.5) cannot be factorized like (5.1) as CRF. In addition, as t > 1, expt decays towards 0 slower than exp. Because of these differences between expt and exp, the following three properties of t-CRF are different from a CRF: • The Markov property and the Hammersley-Clifford theorem do not hold for t-CRF; • Even variables that are not adjacent to each other in the graph may have dependencies with each other; • As t > 1, t-CRF is more conservative1 than CRF. The following example illustrates the above properties of t-CRF. Example 9 Consider the 3-node conditional chain model in Figure 5.1.2. Let xj = 1 for j ∈ {1, 2, 3}, and θ v = (1, −1), θ e = (2, −2, −2, 2). 1
By more conservative, we mean that the distribution is closer to uniform distribution.
81 We calculate the conditional probability of y 1 given y 2 , y 3 , X; as well as the marginal probability of y 1 with t = 1.1 and 1.5 in Table 5.1. To see the difference from a CRF (t = 1.0), we also include its probabilities for reference. For brevity, we use p(1|1, 1, X) to represent p(y 1 = 1|y 2 = 1, y 3 = 1, X).
Table 5.1 Comparisons of p(y | X) and p(y |y 2 , y 3 , X) between t-CRF (t = 1.1 and 1.5) and CRF (t = 1.0) in the 3-node chain example. 1
1
CRF
t = 1.1 t = 1.5
p(1| X)
0.9947
0.9801
0.8294
p(1|1, 1, X)
0.9975
0.9913
0.9256
p(1|0, 0, X)
0.1192
0.2316
0.3945
p(1|1, 0, X)
0.9975
0.9627
0.7466
p(1|0, 1, X)
0.1192
0.2542
0.4128
From Table 5.1, we can see y 3 and y 1 in t-CRF are not conditionally independent given y 2 . The larger the t, the stronger the dependency. For example, when t = 1.5, p(1|1, 1, X) = 0.9256 and p(1|1, 0, X) = 0.7466; when t = 1.1, p(1|1, 1, X) = 0.9913 and p(1|1, 0, X) = 0.9627; when t = 1.0, t-CRF reduces to CRF and the two conditional probabilities are both equal to 0.9975. In addition, by comparing p(1| X), we can see the marginal probability of y 1 becomes more conservative as t gets larger.
Similar to CRF, we define the loss function of t-CRF as the negative log-likelihood, l(X, y; θ) = − log p(y | X; θ),
82 whose gradient with respect to θ v can be computed as, ∂ log p(y | X; θ) ∂ θv ! 3 2 X
v j j X
e j j+1 ∂ =− log expt Φ (x , y ), θ v + Φ (y , y ), θ e − Gt (X; θ) ∂ θv j=1 j=1 −
=
3 X j=1
Eq [Φv (xj , y j )] − Φv (xj , y j ) p(y | X; θ)t−1 , | {z } ξ
and gradient with respect to θ e is, 2 X ∂ log p(y | X; θ) = Eq [Φe (y j , y j+1 )] − Φe (y j , y j+1 ) p(y | X; θ)t−1 , − | {z } ∂ θe j=1 ξ
where q denotes the escort distribution q(y | X; θ) ∝ p(y | X; θ)t . Similar to t-logistic regression, the gradient of the loss contains a forgetting variable ξ := p(y | X; θ)t−1 , which caps the influence of the low-likelihood examples.
5.3
Approximate Inference In order to compute l(X, y; θ) and ∇θ l(X, y; θ), we need to estimate p(y | X; θ) which
requires efficient computation of Gt (X; θ). In t-CRF, since the probability distribution is not factorizable, belief propagation is inapplicable even for the 3-node chain model. The only way to estimate p(y | X; θ) is to use approximate inference. We apply the mean field
˜ such that l(X, y; θ) and ∇θ l(X, y; θ) method and approximate p(y | X; θ) by p ˜(y | X; θ), is approximated as ˜ l(X, y; θ) ' − log p ˜(y | X; θ),
(5.6)
3 X ∂ ˜ t−1 l(X, y; θ) ' Eq˜ [Φv (xj , y j )] − Φv (xj , y j ) p ˜(y | X; θ) ∂ θv j=1
(5.7)
2 X ∂ ˜ t−1 l(X, y; θ) ' Eq˜ [Φe (y j , y j+1 )] − Φe (y j , y j+1 ) p ˜(y | X; θ) ∂ θe j=1
(5.8)
˜ t is the escort of the approximate distribution. where q ˜∝p ˜(y | X; θ)
83 The most straightforward way to approximate the conditional distribution p(y | X; θ) is by the product of univariate probability distribution functions, ˜ = p˜(y | X; θ) =
3 Y j=1 3 Y
˜j ) p˜(y j | xj ; θ expt
D
j=1
E D E ˜j ) , ˜ j + Φe (y j ), θ ˜j − G ˜ t (xj ; θ Φv (xj , y j ), θ v e
where Φv (xj , y j ) = δ0j xj , δ1j xj the node feature is same as that of the true distribution, and Φe (y j ) = δ0j , δ1j is a 2-dimensional feature reduced from the edge features Φe (y j , y r ) ˜ j ) is the probability distribution of a univariate in the true distribution. Note that p˜(y j | xj , θ
˜ j ) can be estimated efficiently given θ ˜ j using Algorithm ˜ t (xj ; θ discrete variable, in which G 2 in Section 3.3. ˜ is obtained by minimizing the t-divergence Dt (˜ The variational parameter θ p kp). By
˜ j while fixing all other θ ˜ r for r 6= j via using the variational updates in (4.32), we update θ D
E
Y ˜ j = Φv (xj , y j ), θ v ˜ r )t−1 + const., Φv (xj , y j ), θ p˜(Eq˜r [y]| xr , θ v r6=j
D
E j
˜ Φe (y j ), θ e
X
Y ˜ r )t−1 + const., = Eq˜r [Φe (y j , y)], θ e p˜(Eq˜r [y]| xr , θ r∈N (j)
r6=j
where N (j) = {z r ∈ V |(z j , z r ) ∈ E} denotes the neighborhood of the node j: N (1) = 2, r
r
˜ ) ∝ p(y| xr ; θ ˜ )t , N (2) = {1, 3}, N (3) = 2. q ˜r denotes the escort distribution q(y| xr ; θ
and ˜r) p˜(Eq˜r [y]| xr , θ D E D E r ˜r ˜ r + Eq˜ [Φe (y)], θ ˜r − G ˜ = expt Eq˜r [Φv (xr , y)], θ (x ; θ ) . t v e r ˜ j for j = The algorithm iteratively sweeps through all the variational parameters θ ˜ j ), we can then {1, 2, 3}, until a stationary point is reached. After obtaining all p ˜(y j | x; θ compute the escort distributions and plug them into (5.6), (5.7) and (5.8).
84 5.4
2-D T -CRF Similar to a CRF, it is also possible to build a t-CRF from general graphs. In this
section, we briefly discuss t-CRF in a 2-D graph (see Figure 5.4) which will be used in the experiments. The conditional distribution is given by p(y | X; θ) = expt (
X
X
Φv (xj , y j ), θ v + Φe (y j , y r ), θ e − Gt (X; θ)), j∈V
(j,r)∈E
where with some abuse of notation we let V denote the set that contains all the label indices (or input variable indices), and E to denote the set that contains all the edges between the labels. Φv (xj , y j ) and Φe (y j , y r ) are defined as in the chain model. y1 y4
y2 y5
x1 y8
y7 x4 x7
y3 y6
x2 y9 x5
x8
x3 x6
x9
Fig. 5.3. A 2-D conditional model. Blue nodes indicate the labels; red nodes indicate the observed input variables.
The loss is defined as − log p(y | X; θ) and its gradient with respect to θ v and θ e is −
X ∂ log p(y | X; θ) = Eq [Φv (xj , y j )] − Φe (xj , y j ) p(y | X; θ)t−1 , | {z } ∂ θv j∈V
−
∂ log p(y | X; θ) = ∂ θe
ξ
X (j,r)∈E
Eq [Φe (y j , y r )] − Φe (y j , y r ) p(y | X; θ)t−1 . | {z }
ξ
In order to make inference, we again apply the mean field method which approximates the conditional distribution p(y | X; θ) by ˜ = p˜(y | X; θ)
Y j∈V
˜ j ). p˜(y j | xj ; θ
85 ˜ j are exactly the same as the chain model. The algorithm The variational updates for each θ ˜ j for j ∈ V , until a stationary iteratively sweeps through all the variational parameters θ point is reached.
5.5
Empirical Evaluation In order to empirically compare t-CRF and CRF, we conduct the following two experi-
ments. Image Denoising Task
The first experiment is the image denoising task as in [47]. The
objective of this task is to recover the original image (as a label) from noisy images (the input data). The original image is shown in Figure 5.5 (Top left). It is a 64×64-pixel binary image, where each pixel yi takes value from 0 (black) or 1 (white). The input images are created by adding random Gaussian noise on every pixel of the original image. An example of the input image is shown in Figure 5.5 (Top right). The synthetic dataset contains 30 training images and 20 test images. We model each image by the 2-D model in Figure 5.4, where each node in Figure 5.4 represents a pixel in the image. xj is equal to the normalized grey scale value of the pixel in the noisy image; and y j is the label of the pixel in the original image. Image Annotation Task
The second experiment is a man-made structure detection task
as in [47, 49]. This dataset consists of images of size 384 × 256 pixels from the Corel database. Each image is divided into 24 × 16 = 384 patches, each of size 16 × 16 pixels. The whole dataset contains 108 training images and 129 testing images. The objective of this task is to classify if a patch contains a man-made structure (labeled as 1) or not (labeled as 0). We again apply the 2-D model in Figure 5.4, where each node in Figure 5.4 represents a patch in the image. We use a different xj as the one used in [47, 49] because those features yield poor test performance for a CRF 2 . The reason might be because the extracted features are not rich enough. In order to get better results, other than including the 2
In [47], the test error of CRF is around 12%.
86 14-dimensional three-scale features as used in [49], we also include the edge orientation histograms (EOH) of the three scales, each of which has 36 bins. Extreme Noise We are especially interested in testing the robustness of the algorithms under extreme noise. Therefore, in addition to training the algorithms on the original dataset, we also generate some extreme noise. To this end, we randomly select 20% of the training images and turn all the labels on the selected images to 1. Parameter Setting We choose the regularization constant λ ∈ {1, 10−2 , 10−4 , 10−6 }. The choice of parameter t is more complicated. Similar to the Student’s t-distribution, the value of t is not only related to heavy-tailedness, but is also related to the number of random variables k. In Student’s t-distribution, the relation of t and k is (t − 1)(v + k) = 2, where v > 0. The smaller the v, the heavier the tail of the distribution. When v → +∞, it reduces to the exponential family. Here we use a similar relation. In image denoising task where k = 4096, we choose t ∈ {1.0005, 1.0003, 1.0001, 1.00008, 1.00006}. In image annotation task where k = 384, we choose t ∈ {1.005, 1.003, 1.001, 1.0008, 1.0006}. In order to choose λ and t, we split the training set into 5 partitions for 5-fold cross validation. Implementation and Optimization We implement t-CRF algorithm as well as its meanfield inference algorithm based on the UGM package, the Matlab code for undirected graphical models [50]. We use the L-BFGS method provided by the package for optimization. The model parameters θ are initialized to be all zero. We stop the L-BFGS algorithm when the change of loss function or the norm of gradient falls below 10−7 or a maximum of 1000 function evaluations have been performed. In each function evaluation, the maximum mean-field iteration is set to be 50. Results The selected λ and t parameters for each task are provided in Table 5.2.
87
Table 5.2 Optimal Parameters t and λ for CRF and t-CRF in image denoising task and image annotation task. (0%) denotes no extreme noise is added and (20%) denotes 20% extreme noise is added. CRF
t-CRF
Image Denoising (0%)
λ=1
λ = 10−2 , t = 1.0003
Image Denoising (20%)
λ = 10−6
λ = 10−6 , t = 1.0003
Image Annotation (0%)
λ = 10−6
λ = 10−6 , t = 1.0008
Image Annotation (20%)
λ=1
λ = 10−6 , t = 1.0008
88 In Figure 5.4, we compare the test errors between CRF and t-CRF. The blue bars represent the test error on the clean dataset; and the red bars represent the test error with 20% extreme noise. Both CRF and t-CRF perform well on the clean dataset. However, after the extreme noise is added, the test error of CRF significantly increases. In comparison, t-CRF performs much better than CRF with extreme noise. We display a test example from the image denoising task in Figure 5.5. Both CRF and t-CRF trained from the clean training dataset (Central left/right) are able to recover the original image. However, the recovery quality of CRF from the noisy dataset (Bottom left) visiably deteriorates, while the quality of t-CRF (Bottom right) stays almost the same. We also show two images from the image annotation task in Figure 5.6. We can clearly see the difference between the predictions by CRF before (1st and 3rd row left) and after the extreme noise (2nd and 4th row left), as the latter misclassifies the leaves and the river bank as man-made structures. On the other hand, the predictions by t-CRF appear to be the same before (1st and 3rd row right) and after adding extreme noise (2nd and 4th row right). The above two experiments empirically demonstrate that t-CRF is more robust against extreme noise than CRF. Clean 20% Noise
4.5 4 3.5 3
Clean 20% Noise
10 Test Error (%)
Test Error (%)
5
9 8 7
2.5 CRF
t-CRF
CRF
t-CRF
Fig. 5.4. Test error between t-CRF and CRF with and without extreme noise added. Left: image denoising task. Right: image annotation task.
89
Fig. 5.5. Image denoising task. Top row is the dataset: left is the input image; right is the true label. Middle row is the denoise result without extreme noise: left is CRF, right is t-CRF. Bottom row is the denoise result with extreme noise: left is CRF, right is t-CRF.
90
Fig. 5.6. Image annotation task. The first and the third rows are the annotation results without extreme noise: left is CRF, right is t-CRF. The second and the fourth rows are the annotation results with extreme noise: left is CRF, right is t-CRF.
91 5.6
Chapter Summary This chapter proposed the t-conditional random fields. T -CRF abandons the Markov
properties in graphical models and is able to model dependencies between variables which are not adjacent. In addition, t-CRF is more robust than CRF because of the heavytailedness of expt function as t > 1. The experiments empirically validate the robustness of t-CRF.
92
6. GENERALIZED T -LOGISTIC REGRESSION The previous chapter focused on applying the t-exponential family to structured models. In this chapter, we will return to the classification problem and study a generalization of t-logistic regression using t-divergence. The generalized t-logistic regression contains a family of convex and non-convex losses, which will be investigated theoretically and empirically.
6.1
Binary Classification To introduce our generalizations of the losses, let us begin with binary logistic regres-
sion. Logistic regression uses the conditional exponential family distribution to model a labeled example (x, y), y p(y| x; θ) = exp( hΦ(x), θi − G(x; θ)), 2 and the logistic loss is equal to l(x, y; θ) = − log p(y| x; θ). Based on the probabilistic interpretation, the empirical risk of the logistic loss follows from the i.i.d. assumption on the data (see Section 2.2). However, from the perspective of Bregman divergences (see definition in Appendix A.1), logistic loss is also the K-L divergence between the empirical
93 distribution of δ(c = y)1 and its conditional exponential family distribution p(c| x; θ). To see this, D(δ(c = y)kp(c| x; θ)) X X = δ(c = y) log p(c| x; θ) δ(c = y) log δ(c = y) − c∈{±1}
c∈{±1}
{z
|
}
=0
= − log p(y| x; θ) y = − log exp( hΦ(x), θi − G(x; θ)) 2 = log(1 + exp(y hΦ(x), θi)). T -logistic regression generalizes logistic regression by replacing exponential family distribution with a t-exponential family distribution. Since the t-divergence is a generalization of the K-L divergence, it is natural to consider replacing the K-L divergence with the t-divergence. This, however, abandons the i.i.d. assumption on the data. While this is somewhat controversial, in many case, it might actually capture the data generation process more accurately. Let t1 denote the parameter in t-divergence, and t2 the parameter in t-exponential family, such that y p(y| x; θ) = expt2 ( hΦ(x), θi − Gt2 (x; θ)). 2 Since the escort distribution of δ(c = y) is itself2 , the generalized t-logistic loss function of (x, y) is defined as: l(x, y; θ) =Dt1 (δ(c = y)kp(c| x; θ)) X X = δ(c = y) logt1 δ(c = y) − δ(c = y) logt1 p(c| x; θ) c∈{±1}
|
c∈{±1}
{z
=0
}
= − logt1 p(y| x; θ) y = − logt1 expt2 ( hΦ(x), θi − Gt2 (x; θ)). 2 1 2
δ(c = y) = 0 if c = y, and = 1 otherwise. It is easy to verify that δ(c = y)t = δ(c = y).
94 Note that, when t2 < 1, expt2 (x) is equal to 0 when x ∈ (−∞, t21−1 ]. Therefore, in this dissertation, we will restrict our focus on the case when t1 > 0 and t2 ≥ 1. The gradient of the generalized t-logistic regression with respect to θ is, y hΦ(x), θi − Gt2 (x; θ) ∇θ l(x, y; θ) = − ∇θ logt1 expt2 2 y y =− Φ(x) − Eq [ Φ(x)] p(y| x; θ)t2 −t1 2 y2 y y =− − q(y| x; θ) + q(−y| x; θ) Φ(x)p(y| x; θ)t2 −t1 2 2 2 = − yq(−y| x; θ)Φ(x) p(y| x; θ)t2 −t1 , | {z }
(6.1)
ξ
where Eq denotes the expectation over q(y| x; θ) ∝ p(y| x; θ)t2 . When t2 = t1 , ξ reduces to 1. When t2 > t1 , ξ caps the influence of the examples with low likelihood. On the other hand, when t2 < t1 , ξ boosts the influence of the examples with low likelihood.
6.2
Properties In this section, we will discuss properties of the generalized t-logistic regression with
different t1 and t2 . To this end, it is convenient to write the loss function in terms of the margin u = y hΦ(x), θi, such that u l(x, y; θ) = − logt1 expt2 ( − Gt2 (u)), 2
(6.2)
where Gt2 (u) satisfies u u expt2 ( − Gt2 (u)) + expt2 (− − Gt2 (u)) = 1. 2 2 When t1 ≥ t2 , the loss function is always convex. To see this, when t1 = t2 , logt1 and expt2 cancel out and Gt2 (u) is convex. Logistic regression is a special case as t1 = t2 = 1. When t1 > t2 , the composite function − logt1 expt2 is convex and nonincreasing3 . Since u 2
− Gt2 (u) is concave, the composition of the two functions makes the loss for t1 > t2
convex (See (3.10) of Section 3.2.4 in [3]). 3
∂ It is easy to verify because − ∂x logt1 expt2 (x) = − expt2 (x)t2 −t1 .
95 loss
t1 = 1 (logistic)
6
t1 = 0.7
4
t1 = 0.4 t1 = 0.1
2
0-1 loss -4
-2
0
2
4
margin
Fig. 6.1. Generalized t-logistic regression with t2 = 1 and four different t1 : t1 = 1, t1 = 0.7, t1 = 0.4, t1 = 0.1.
When t1 < t2 , the loss function usually is not convex. We will refer to the case of t1 < t2 as the mismatch loss and the case of t1 = t2 as the matching loss. T -logistic regression is a special mismatch loss when t1 = 1 and t2 > 1. Another interesting mismatch loss is when t2 = 1 and t1 < 1, the conditional distribution is an exponential family and therefore there is no additional cost on estimating Gt2 (u). We plot this loss function with different t1 in Figure 6.1. We can see that the loss function bends down more as t1 decreases. Clearly the losses with t1 ≥ t2 belong to the class of Robust Loss-0 from Section 3.2.2 because of convexity. On the other hand, in order to investigate the robustness of mismatch losses with t1 < t2 , we need to compute limu→∞ |I(u)|. The detailed derivation is provided in Appendix B.15. Although their forgetting variables in (6.1) have similar functionality, somewhat surprisingly the mismatch losses contain all three robust types which depend on the value of t1 : • t1 > 1: Robust Loss-0;
96 • t1 = 1: Robust Loss-I; • t1 < 1: Robust Loss-II. In the following experiment, we will focus on comparing these three types of mismatch losses.
6.3
Multiclass Classification The extension to multiclass classification is quite straightforward for the generalized
t-logistic regression. Assume that the label y ∈ {1, . . . , C}, then the loss function is l(x, y; θ) = − logt1 expt2 (hΦ(x, y), θi − Gt2 (x; θ)), where,
Φ(x, y) = 0, . . . , 0, Φ(x), 0, . . . , 0 , θ = (θ 1 , . . . ; θ C ) . | {z } | {z } 1,...,y−1
6.4
y+1,...,C
Empirical Evaluation In the experiment, we used 20 binary classification datasets from Table 3.3 and 8 mul-
ticlass classification datasets from Table 3.4. We focus on comparing the following three types of mismatch losses • Type-0: t1 = 2 and t2 > 2; • Type-I: t1 = 1 and t2 > 1; • Type-II: t1 < 1 and t2 = 1. For reference, we also include logistic regression and Savage loss. We will use the following shorthand notations: ’Logistic’ for logistic regression; ’Savage’ for Savage loss; ’Mis0’ for the Type-0 mismatch loss with {t1 = 2, t2 > 2}; ’Mis-I’ for the Type-I mismatch loss with {t1 = 1, t2 > 1}; and ’Mis-II’ for the Type-II mismatch loss with {t1 < 1, t2 = 1}.
97 Experimental Setting
The experimental setting, noise models, optimization algorithm,
implementation and hardware are identical to Section 3.5. The only difference is that we are going to select t1 or t2 parameter from a pool of candidates in order to extensively compare different types of mismatch losses, where for • Mis-0: t2 ∈ {2.3, 2.6, . . . , 3.8}; • Mis-I: t2 ∈ {1.3, 1.6, . . . , 2.8}; • Mis-II: t1 ∈ {0.1, 0.2, . . . , 0.9}. We also set the regularization constant to be 10−10 so that the impact from the regularizer is small compared to the loss functions. Results From Figure D.1 to Figure D.28, we report the test error from 5-fold cross validation with all-zero initialization (Top) and the test error under ten random initializations on one of the five folds (Bottom). All datasets are mixed with different noise models with4 ρ = 0.00 (blue), 0.05 (red), 0.10 (yellow). We find that the wings of all three losses bend down more as the dataset gets more noisy. For example, in Table 6.1, Table 6.2 and Table 6.3, we summarize the number of binary datasets under the noise-2 model that each value of t2 for Mis-0, Mis-I and t1 for Mis-II are optimal based on cross validation. As ρ increases, both t2 for Mis-0 and Mis-I tend to be larger; while t1 for Mis-II tends to be smaller. For example, t2 = 3.8 of Mis-0 is optimal in 7 datasets for ρ = 0.00 but in 11 datasets for ρ = 0.05; t2 = 2.5 of Mis-1 is optimal in 6 datasets for ρ = 0.00 but in 13 datasets for ρ = 0.10; t1 = 0.2 of Mis-3 is optimal in 4 datasets for ρ = 0.00 but in 7 datasets for ρ = 0.05. To quickly compare the generalization performance of different types of mismatch losses, in Table 6.4 and Table 6.5 we summarize the number of datasets where the test errors are significantly different under the noise-2 model. The comparisons under the noise-1 and noise-3 models are similar to the noise-2 model. When the model parameter is initialized to be all-zero, Mis-II appears to be the most robust while Mis-0 appears to be the 4
See the definition of ρ in Section 3.5.
98
Table 6.1 The number of binary datasets that each value of t2 for Mis-0 loss is optimal based on cross validation. The total number of datasets is 20. t2
2.3 2.6 2.9 3.2 3.5 3.8
ρ = 0.00 6
1
2
1
3
ρ = 0.05 3
1
1
0
4 11
ρ = 0.10 2
0
2
1
4 11
7
Table 6.2 The number of binary datasets that each value of t2 for Mis-I loss is optimal based on cross validation. The total number of datasets is 20. t2
1.3 1.6 1.9 2.2 2.5 2.8
ρ = 0.00 7
5
0
2
ρ = 0.05 3
7
0
0 10 0
ρ = 0.10 2
4
0
1 13 0
6
0
Table 6.3 The number of binary datasets that each value of t1 for Mis-II loss is optimal based on cross validation. The total number of datasets is 20. t1
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
ρ = 0.00 3
4
0
3
1
3
0
3
3
ρ = 0.05 3
7
0
3
4
1
0
1
1
ρ = 0.10 4
7
0
3
4
0
0
1
1
least robust. The result is consistent with three types of robustness each of the losses belongs. Furthermore, although Mis-0 and logistic regression both belong to type-0 robust
99
Table 6.4 The number of binary classification datasets that the test error between mismatch losses is significant. Each column of the table is the number of datasets where a certain type of mismatch loss has significantly lower test error than another. The total number of datasets is 20. 0
0
0
8
1
9
3
ρ = 0.05
0
0
12
0
12
9
ρ = 0.10
0
0
14
0
14
12
Table 6.5 The number of multiclass classification datasets that the test error between mismatch losses is significant. Each column of the table is the number of datasets where a certain type of mismatch loss has significantly lower test error than another. The total number of datasets is 8. 0
0
0
6
1
6
3
ρ = 0.05
0
0
6
0
7
5
ρ = 0.10
0
0
6
0
7
5
loss, Mis-0 appears to be more robust than logistic regression in most datasets because of its non-convexity. On the other hand, when the parameter is randomly initialized, Mis-II loss becomes unstable in around half of the datasets. This phenomenon is very similar to Savage loss which also belongs to type-II robust losses. On the other hand, the solutions of Mis-0 and Mis-I are very stable against random initialization.
100 6.5
Chapter Summary In this chapter, we combine t-divergence with t-logistic regression, and proposed the
generalized t-logistic regression for classification. By choosing different t1 and t2 , we can obtain both convex and nonconvex losses, as well as all three types of robustness. We empirically evaluate these losses on various binary and multiclass datasets.
101
7. SUMMARY We conclude this dissertation with a summary of contributions and a discussion of future work.
7.1
Contributions This dissertation is devoted to designing robust probabilistic models in machine learn-
ing based on t-exponential family of distributions. Below we list and detail our main contributions. Classification Using T -Exponential Family
Our first contribution is to apply the t-
exponential family in the probabilistic model for classification. Since the algorithm is based on the same probabilistic framework as logistic regression, we call it t-logistic regression. The algorithm is implemented using PETSc and TAO for efficient parallel computing. We testall our algorithm on a variety of publicly available datasets, which demonstrates the robustness and stability of the algorithm. T -Entropy, T -Divergence, and Approximate Inference Our second contribution is a new t-entropy and t-divergence. T -entropy is an important concept because it is the Fenchel conjugate of the log-partition function of t-exponential family. T -divergence is the Bregman divergence based on t-entropy. We further show that t-divergence can be used to perform efficient approximate inference on multivariate t-exponential family of distributions. Graphical Models in T -Exponential Family
Our third contribution is to propose a gen-
eralization of the conditional random fields (CRF) using t-exponential family. The new t-CRF appears to be more robust compared to the exponential family based CRF, and is
102 able to capture the interactions among nonadjacent nodes in a graphical model. The inference is based on the mean field method which minimizes the t-divergence between the approximate and the true distribution. Classification Using T -Exponential Family and T -Divergence Our fourth contribution is to further generalize t-logistic regression by replacing K-L divergence with t-divergence. A larger family of loss functions for classification is obtained, which includes losses with different types of robustness.
7.2
Future Work We list some potential future work in this section.
More Insights into Local Minima We have theoretically justified that all the non-convex losses may be stuck in local minima adverseral on some datasets. However, in our experiments, we find that certain loss functions such as t-logistic regression are quite stable against random initialization. It may be because t-logistic regression creates much fewer local minima; or it may be because the local minima do not appear in real-world data. It would be interesting to characterize the phenomemon theoretically. Boosting Maximum entropy (maxent) and maximum likelihood estimation are dual problems [51]. Maximum likelihood models work with distributions which need to be normalized. In contrast, one can drop the normalization constraint from maxent problems and derive novel algorithms. Although this does not lead to probabilistic models , one can sidestep the computation of the log-partition function which can be advantageous in some cases. As [51] show, dualizing classical maxent after dropping the normalization constraints yields AdaBoost. Similarly, a t-entropy based boosting algorithm can be derived and investigated.
103 φ-Exponential Family One can further generalize the algorithms and the theorems proposed in this dissertation to φ-exponential family [8, 9]. However, one needs to find φfunctions which yield interesting and useful properties. Also it would be interesting to investigate the physical meaning of the entropy and divergence proposed in this dissertation.
LIST OF REFERENCES
104
LIST OF REFERENCES
[1] C. H. Teo, S. V. N. Vishwanthan, A. J. Smola, and Q. V. Le, “Bundle methods for regularized risk minimization,” Journal of Machine Learning Research, vol. 11, pp. 311– 365, January 2010. [2] S. Ben-David, N. Eiron, and P. M. Long, “On the difficulty of approximately maximizing agreements,” Journal of Computer and System Sciences, vol. 66, no. 3, pp. 496–514, 2003. [3] S. Boyd and L. Vandenberghe, Convex Optimization. Cambridge, England: Cambridge University Press, 2004. [4] P. Long and R. Servedio, “Random classification noise defeats all convex potential boosters,” Machine Learning Journal, vol. 78, no. 3, pp. 287–304, 2010. [5] N. Manwani and P. S. Sastry, “Noise tolerance under risk minmization,” 2012. [http://arxiv.org/pdf/1109.5231]. [6] M. J. Wainwright and M. I. Jordan, “Graphical models, exponential families, and variational inference,” Report 649, UC Berkeley, Department of Statistics, September 2003. [7] M. J. Wainwright and M. I. Jordan, “Graphical models, exponential families, and variational inference,” Foundations and Trends in Machine Learning, vol. 1, no. 1–2, pp. 1–305, 2008. [8] J. Naudts, “Deformed exponentials and logarithms in generalized thermostatistics,” Physica A, vol. 316, pp. 323–334, 2002. [http://arxiv.org/pdf/cond-mat/0203489]. [9] J. Naudts, “Estimators, escort proabilities, and φ-exponential families in statistical physics,” Journal of Inequalities in Pure and Applied Mathematics, vol. 5, no. 4, 2004. [10] C. Tsallis, “Possible generalization of Boltzmann-Gibbs statistics,” Journal of Statistical Physics, vol. 52, pp. 479–487, 1988. [11] J. D. Lafferty, A. McCallum, and F. Pereira, “Conditional random fields: probabilistic modeling for segmenting and labeling sequence data,” in Proceedings of International Conference on Machine Learning, vol. 18, (San Francisco, CA), pp. 282–289, Morgan Kaufmann, 2001. [12] C. Sutton and A. McCallum, “An introduction to conditional random fields for relational learning,” Introduction to Statistical Relational Learning, 2006. [13] O. E. Barndorff-Nielsen, Information and Exponential Families in Statistical Theory. New York: John Wiley and Sons, 1978.
105 [14] P. Grunwald and A. Dawid, “Game theory, maximum entropy, minimum discrepancy, and robust Bayesian decision theory,” Annals of Statistics, vol. 32, no. 4, pp. 1367– 1433, 2004. [15] C. R. Shalizi, “Maximum likelihood estimation for q-exponential (tsallis) distributions,” 2007. [http://arxiv.org/abs/math.ST/0701854]. [16] T. D. Sears, Generalized Maximum Entropy, Convexity, and Machine Learning. PhD thesis, Australian National University, 2008. [17] A. Sousa and C. Tsallis, “Student’s t- and r-distributions: unified derivation from an entropic variational principle,” Physica A, vol. 236, pp. 52–57, 1994. [18] C. Tsallis, R. S. Mendes, and A. R. Plastino, “The role of constraints within generalized nonextensive statistics,” Physica A: Statistical and Theoretical Physics, vol. 261, pp. 534–554, 1998. [19] R. T. Rockafellar, Convex Analysis, vol. 28 of Princeton Mathematics Series. Princeton, NJ: Princeton University Press, 1970. [20] J. S. Rosenthal, A First Look at Rigorous Probability Theory. World Scientific Publishing, 2006. [21] M. Gell-Mann and C. Tsallis, eds., Nonextensive Entropy. Sante Fe Institute Studies in the Sciences of Complexity, Oxford, 2004. [22] A. Zellner, “Bayesian and non-Bayesian analysis of the regression model with multivariate student-t error terms,” Journal of the American Statistical Association, vol. 71, no. 354, pp. 400–405, 1976. [23] P. L. Bartlett, M. I. Jordan, and J. D. McAuliffe, “Convexity, classification, and risk bounds,” Journal of the American Statistical Association, vol. 101, no. 473, pp. 138– 156, 2006. [24] H. Masnadi-Shirazi, N. Vasconcelos, and V. Mahadevan, “On the design of robust classifiers for computer vision,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2010. [25] F. R. Hampel, E. M. Ronchetti, P. J. Rousseeuw, and W. A. Stahel, Robust Statistics: The Approach Based on Influence Functions. New York: Wiley, 1986. [26] A. O’Hagan, “On outlier rejection phenomena in Bayes inference,” Royal Statistical Society, vol. 41, no. 3, pp. 358–367, 1979. [27] N. Ding and S. V. N. Vishwanathan, “T -logistic regression,” in Advances in Neural Information Processing Systems 23, 2010. [28] A. Tewari and P. L. Bartlett, “On the consistency of multiclass classification methods,” Journal of Machine Learning Research, vol. 8, pp. 1007–1025, 2007. [29] T. Kuno, Y. Yajima, and H. Konno, “An outer approximation method for minimizing the product of several convex functions on a convex set,” Journal of Global Optimization, vol. 3, pp. 325–335, September 1993.
106 [30] C. J. Hsieh, K. W. Chang, C. J. Lin, S. S. Keerthi, and S. Sundararajan, “A dual coordinate descent method for large-scale linear SVM,” in Proceedings of International Conference on Machine Learning, pp. 408–415, 2008. [31] C. J. Merz and P. M. Murphy, “UCI repository of machine learning databases,” 1998. Irvine, CA: University of California, Department of Information and Computer Science. [32] V. Franc and S. Sonnenburg, “Optimized cutting plane algorithm for support vector machines,” in Proceedings of International Conference on Machine Learning, pp. 320–327, 2008. [33] S. Sonnenburg, V. Franc, E. Yom-Tov, and M. Sebag, “Pascal large scale learning challenge,” 2008. [http://largescale.ml.tu-berlin.de/workshop/]. [34] D. Mease and A. Wyner, “Evidence contrary to the statistical view of boosting,” Journal of Machine Learning Research, vol. 9, pp. 131–156, February 2008. [35] X. Zhang, A. Saha, and S. V. N. Vishwanathan, “Smoothing multivariate performance measures,” Journal of Machine Learning Research, vol. 13, pp. 3589–3646, 2013. [36] W. R. Gilks, S. Richardson, and D. J. Spiegelhalter, Markov Chain Monte Carlo in Practice. Chapman & Hall, 1995. [37] T. Minka, Expectation Propagation for Approximative Bayesian Inference. PhD thesis, MIT Media Labs, Cambridge, USA, 2001. [38] X. Boyen and D. Koller, “Tractable inference for complex stochastic processes,” in Proceedings of the Conference on Uncertain of Artificial Intelligence, 1998. [39] M. Opper, “A Bayesian approach to online learning,” in Online Learning in Neural Networks, pp. 363–378, Cambridge University Press, 1998. [40] A. R´enyi, “On measures of information and entropy,” in Proceedings of 4th Berkeley Symposium on Mathematics, Statistics and Probability, pp. 547–561, 1960. [41] J. D. Lafferty, “Additive models, boosting, and inference for generalized divergences,” in Proceedings of Annual Conference on Computational Learning Theory, vol. 12, pp. 125–133, ACM Press, New York, NY, 1999. [42] T. Minka, “Divergence measures and message passing,” Report 173, Microsoft Research, 2005. [43] I. Csisz´ar, “Information type measures of differences of probability distribution and indirect observations,” Studia Mathematica Hungarica, vol. 2, pp. 299–318, 1967. [44] J. Hiriart-Urruty and C. Lemar´echal, Convex Analysis and Minimization Algorithms, I and II, vol. 305 and 306. Springer-Verlag, 1996. [45] C. Sutton and A. McCallum, “An introduction to conditional random fields,” Foundations and Trends in Machine Learning, vol. 4, no. 4, pp. 267–373, 2011. [46] M. Meila, “Lecture 3: Graphical models of conditional independence,” STAT 535 Statistical Learning: Modeling, Prediction and Computing, 2011.
107 [47] S. V. N. Vishwanathan, N. Schraudolph, M. Schmidt, and K. Murphy, “Accelerated training conditional random fields with stochastic gradient methods,” in Proceedings of International Conference on Machine Learning, (New York, NY, USA), pp. 969– 976, ACM Press, 2006. [48] C. Bishop, Pattern Recognition and Machine Learning. Springer, 2006. [49] S. Kumar and M. Hebert, “Man-made structure detection in natural images using a causal multiscale random field,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2003. [50] M. Schmidt, “UGM, matlab code for undirected graphical models,” 2007. [http://www.di.ens.fr/ mschmidt/Software/UGM.html]. [51] G. Lebanon and J. Lafferty, “Boosting and maximum likelihood for exponential models,” in Advances in Neural Information Processing Systems 14 (T. G. Dietterich, S. Becker, and Z. Ghahramani, eds.), MIT Press, 2001. [52] D. C. Liu and J. Nocedal, “On the limited memory BFGS method for large scale optimization,” Mathematical Programming, vol. 45, no. 3, pp. 503–528, 1989.
APPENDICES
108
APPENDICES Appendix A: Fundamentals of Convex Optimizations A.1
Convex Analysis
In this section, we review some concepts and properties in the convex analysis. All definitions and most properties can be found in [3, 44]. Definition A.1 (Convex set) A set C ⊆ Rd is convex if for any two points x1 , x2 ∈ C and any λ ∈ (0, 1), we have λ x1 +(1 − λ) x2 ∈ C. In other words, the line segment between any two points must lie in C. Definition A.2 (Open set) A set C ⊆ Rd is open if for any point x ∈ C, there exists an > 0, such that z ∈ C for all z : kz − xk < . In other words, there is an -ball around x: B (x) := {z : z : kz − xk < } which is contained in C. Definition A.3 (Convex function) Given a convex set C, a function f : C 7→ R is convex if for any two points x1 , x2 ∈ C and any λ ∈ (0, 1), we have f (λ x1 +(1 − λ) x2 ) ≤ λf (x1 ) + (1 − λ)f (x2 ). In general, we can define a generalized function f : Rd 7→ R = R ∪ {+∞}, such that f (x) is +∞ for all x ∈ / C. And we call C, on which f is finite, the domain of f dom f := x ∈ Rd : f (x) < ∞ . Definition A.4 (Subgradient and subdifferential) Given a function f and a point x with f (x) < ∞, a vector u is called a subgradient of f at x if f (y) − f (x) ≥ hy − x, ui ,
∀ y ∈ Rd .
109 The set of all such u is called the subdifferential of f at x, and is denoted by ∂f (x). Function f is convex iff ∂f (x) is not empty for all x where f (x) < ∞. If f is further differentiable at x, then ∂f (x) is a singleton comprised of the gradient of f at x: ∇f (x). Definition A.5 (Bregman divergence) Let F : Ω → R be a continuously differentiable real-valued and strictly convex function defined on a closed convex set Ω. The Bregman divergence associated with F for points p, q ∈ Ω is the difference between the value of F at point p and the value of the first order Taylor expansion of F around point q evaluated at point p: DF (p, q) = F (p) − F (q) − h∇F (q), p − qi . Definition A.6 (Fenchel-Legendre conjugate) Given a function f : Rd → R, its Fenchel dual is defined as f ? (µ) := sup hx, µi − f (x). x∈Rd
Example 10 (Relative entropy) Suppose Pd w ln wi if x ∈ ∆ d i=1 i 1/n f (x) = . +∞ otherwise P where ∆d is the d-dimensional simplex: x ∈ [0, 1]d : i xi = 1 . Then d
f ? (µ) = ln
1X exp µi d i=1
! .
Theorem A.1 (Dual connection) f ? , as the supremum of linear functions, is always convex and closed. If f is convex and closed, then f (x) + f ? (µ) − hx, µi ≥ 0. And the equality is attained iff µ ∈ ∂f (x) iff x ∈ ∂f ? (µ). Furthermore, f ?? = f .
110 A.2
Numerical Optimization
In this section, we review two of the widely used numerical optimization algorithms: Limited-Memory BFGS method is the most popular Quasi-Newton algorithm which has been used in large-scale optimizations; coordinate descent methods also have wide range of applications, including the Expectation Maximization algorithms in statistics.
Limited-Memory BFGS Method The BFGS method is named for its discoverers Broyden, Fletcher, Goldfarb, and Shanno. Just like other Quasi-Newton methods, its basic idea is to estimate the Hessian matrix by the changes in gradients. Let us denote the k-th iterate to be xk and the function value to be fk . Assume the change in the k-th iteration to be sk = xk+1 − xk and the change in gradients to be yk = ∇fk+1 − ∇fk . The approximate Hessian Hk+1 in the k-th iteration is estimated by, min k H − Hk kW B
s.t. H = HT , H sk = yk . Here, k A kW = k W1/2 A W1/2 kF where k · kF denotes the Frobenius norm, and W is any matrix satisfying W sk = yk . The update of Hk+1 turns out to be, Hk+1 = (I −ρk sk yTk ) Hk (I −ρk sk yTk ) + ρk sk sTk , where ρk = 1/(yTk sk ). The limited-memory BFGS (L-BFGS) algorithm is a limited memory variation of the BFGS method. Unlike the original BFGS method which stores a dense matrix Hk , L-BFGS only stores a few vectors that represent the approximation implicitly. There are multiple published approaches to using a history of updates to form the direction vector. Interested reader may refer to [52] for more details.
111 Coordinate Descent Method Coordinate descent method is a class of algorithms which search a descent direction on a subset of the coordinates in each iteration in order to find the optimum. The simplest version is that one does a line search along one coordinate direction at the current point in each iteration and uses different directions cyclically throughout the procedure. For example, in the n-th iteration, the j-th coordinate of xn is given by, j+1 k xjn+1 = argmin f (x1n , . . . , xj−1 n , y, xn , . . . , xn ) y∈R
If one searches a descent direction in more than one coordinate a time, people sometimes call it to be the block coordinate descent method. For example, the expectationmaximization (EM) algorithm can be viewed as a block coordinate descent method. In addition, the ζ-θ algorithm proposed to solve the convex multiplicative programming is also based on the block coordinate descent method.
Appendix B: Technical Proofs and Verifications B.1
Proof of Theorem 2.3.1
Proof Since the covariance matrix is positive semi-definite, if we show that ∇2 G(θ) = Var [Φ(z)], then it automatically implies that G is convex. To show (2.15) use the regularity condition and expand R Z Φ(z) exp hΦ(z), θi d z = Φ(z)p(z; θ) d z = E [Φ(z)] . ∇θ G(θ) = R exp hΦ(z), θi d z Next take the second derivative, use (B.1), and the definition of variance to write Z 2 ∇θ G(θ) = Φ(z) [Φ(z) − ∇G(θ)]> p(z; θ) d z = E Φ(z)Φ(z)> − E [Φ(z)] E [Φ(z)]> = Var [Φ(z)] .
(B.1)
112 B.2
Proof of Theorem 2.3.2
Proof Because it is unclear how to compute the second derivative of Gφ we cannot use the same route as in Theorem 2.3.1 to prove convexity. Therefore the proof relies on more elementary arguments. Recall that expφ is an increasing and strictly convex function. Choose θ 1 and θ 2 such that Gφ (θ i ) < ∞ for i = 1, 2, and let α ∈ (0, 1). Set θ α = α θ 1 +(1−α) θ 2 , and observe that Z expφ (hΦ(z), θ α i − αGφ (θ 1 ) − (1 − α)Gφ (θ 2 )) d z Z Z ≤ α expφ (hΦ(z), θ 1 i − Gφ (θ 1 ))d z +(1 − α) expφ (hΦ(z), θ 2 i − Gφ (θ 2 )) d z = 1. On the other hand, we also have Z expφ (hΦ(z), θα i − Gφ (θ α )) d z = 1. Again, using the fact that expφ is an increasing function, we can conclude from the above two equations that Gφ (θ α ) ≤ αGφ (θ 1 ) + (1 − α)Gφ (θ 2 ). This shows that Gφ is a convex function. We now show (2.17), using (2.16) and (2.13) d combined with the fact that du expφ (u) = φ(expφ (u)): Z Z (2.16) 0 = ∇θ p(z; θ) d z = ∇θ expφ (hΦ(z), θi − Gφ (θ)) d z Z = φ expφ (hΦ(z), θi − Gφ (θ)) (Φ(z) − ∇θ Gφ (θ)) d z Z Z (2.13) = φ(p(z; θ)) d z q(z; θ)(Φ(z) − ∇θ Gφ (θ)) d z Z = φ(p(z; θ)) d z Eq(z;θ) [Φ(z)] − ∇θ Gφ (θ) .
113 B.3
Proof of Lemma 3.2.1
Proof For simplicity, let us assume that k θ k = 1, then u = y hΦ(x), θi = ykΦ(x)k cos ψ
u = ykΦ(x)k cos ψ |u| ⇒ = kΦ(x)k | cos ψ| ⇒
Therefore, k∇θ l(x, y, θ)k =kl0 (u)yΦ(x)k =|l0 (u)
|I(u)| u |= cos ψ | cos ψ|
Since I(u) is bounded, assume that |I(u)| ≤ C, then, |I(u)| → ∞) | cos ψ| C ≤P ( → ∞) | cos ψ|
P (k∇θ l(x, y, θ)k → ∞) =P (
=P (cos ψ = 0) Because there is no singleton at ψ = π2 , P (cos ψ = 0) = P (ψ =
B.4
π )=0 2
Proof of Lemma 3.2.2
Proof Using the results in Section B.4, if ψ 6= π2 , then as kΦ(x)k → ∞, u = y hΦ(x), θi = ykΦ(x)k cos ψ → ∞. Furthermore, since limu→∞ I(u) = 0, we have lim
kΦ(x)k→∞
|I(u)| = 0. u→∞ | cos ψ|
k∇θ l(x, y, θ)k = lim
114 Because there is no singleton at ψ = π2 , we conclude P(
lim
kΦ(x)k→∞
k∇θ l(x, y, θ)k = 0) = 1,
if limu→∞ I(u) = 0.
B.5
Proof of Theorem 3.2.3
Proof Since l(u) is smooth around u = 0, for any given there exists δ such that l0 (u) < l0 (0) + /2 , where u ∈ (−δ, δ).
(B.2)
Define U = max {−u1 , u2 }. We construct a set of data points which consist of x = {x1 , . . . , xn+1 } with label yi = 1, where x1 , . . . , xn = 1 and U xn+1 = − , δ 0 l (0) + /2 n= xn+1 . l0 (0) The gradient of the empirical risk is n
d X l(θxi ) + l(θxn+1 ) dθ i=1 xn+1 0 l (θxn+1 ) =n l0 (θ) + n l0 (0) Uθ 0 0 =n l (θ) − 0 l − l (0) + /2 δ
∇Remp (θ) =
Now let us investigate the gradient at the following three points 0, δuU1 , δuU2 . nl0 (0) l0 (0) 0 0 l (0) = 0 >0 when θ = 0, ∇Remp (0) = n l (0) − 0 l (0) + /2 2l (0) + l0 (0) 0 0 when θ = δu1 /U, ∇Remp (δu1 /U ) = n l (δu1 /U ) − 0 l (u1 ) l (0) + /2 l0 (0) 2 0 0 < n l (0) + /2 − 0 (l (0) + ) = <0 l (0) + /2 4(l0 (0) + ) l0 (0) 0 0 l (u2 ) when θ = δu2 /U, ∇Remp (δu2 /U ) = n l (δu2 /U ) − 0 l (0) + /2 l0 (0) 2 0 0 < n l (0) + /2 − 0 (l (0) + ) = <0 l (0) + /2 4(l0 (0) + )
(B.3)
(B.4)
(B.5)
115 where the inequalities in (B.4) and (B.5) are due to (B.2) and the definition of U . Therefore, there are at least two θ’s between (δu1 /U, δu2 /U ) with ∇ Remp (θ) = 0. Since ∇ Remp (δu1 /U ) < 0 and ∇ Remp (0) > 0, one local minimum lies in (δu1 /U, 0). On the other hand, as ∇ Remp (δu2 /U ) < 0 and the function is lower bounded, the other local minimum lies in (δu2 /U, +∞).
B.6
Proof of Theorem 3.4.1
Proof Since the objective function has a lower bound at 0, we can prove the convergence by showing the algorithm monotonically decreases. In the k-th ζ-step, assuming the current variables are θ (k−1) and ζ (k−1) , we fix θ (k−1) , denote ˜l = l(θ (k−1) ), and minimize over ζ. It turns out that: (k) ζi
m 1 Y ˜m1 = li ˜li i=1
Therefore, MP(θ (k−1) , ζ (k) ) = min MP(θ (k−1) , ζ) = mP(θ (k−1) )1/m ≤ MP(θ (k−1) , ζ (k−1) ) ζ
The θ-step is to fix ζ (k) and minimize θ. The result is MP(θ (k) , ζ (k) ) = min MP(θ, ζ (k) ) ≤ MP(θ (k−1) , ζ (k) ) = mP(θ (k−1) )1/m θ
The above two equalities hold if and only if ζ k = ζ k−1 and θ k = θ k−1 , from which the convergence of the algorithm at the k-th iteration follows. Therefore, before convergence we have MP(θ (k) , ζ (k) ) < mP(θ (k−1) )1/m < MP(θ (k−1) , ζ (k−1) ). But since P(θ) > 0, the algorithm must converge at some point. ˜ is a stationary point of the P(θ). Assume that Next, we show that the converged point θ ˜ and ζ˜ is the convergence point, then the gradient at the θ-step satisfies: θ m m Qm X X ˜ m1 dli (θ) dli (θ) i=1 li (θ) 0= ζi ˜= ˜ ˜ d θ d θ l ( θ) i θ=θ θ=θ i=1 i=1
116 Since
Qm
˜
1
m i=1 li (θ)
is positive, it implies that, Q m Qm X ˜ d( m dP(θ) i=1 li (θ) dli (θ) i=1 li (θ)) 0= ˜= ˜ = dθ ˜ ˜ d θ d θ l ( θ) i θ= θ θ=θ θ=θ i=1
˜ is a stationary point of P(θ). Therefore, θ
B.7
Proof of Theorem 4.2.1
Proof In view of (2.17) and (4.17), µ = Eq(z;θ(µ)) [Φ(z)] = ∇θ Gt (θ). We only need to consider the case when θ(µ) exists since otherwise G∗t (µ) is trivially defined as +∞. When θ(µ) exists, clearly θ(µ) ∈ (∇Gt )−1 (µ). Therefore, we have, sup {hµ, θi − Gt (θ)} = sup θ
θ
Eq(z;θ(µ)) [Φ(z)] , θ − Gt (θ)
= Eq(z;θ(µ)) [Φ(z)] , θ(µ) − Gt (θ(µ)) Z = q(z; θ(µ)) (hΦ(z), θ(µ)i − Gt (θ(µ))) dx Z = q(z; θ(µ)) logt p(z; θ(µ))d z
(B.6)
(B.7)
= − Ht (p(z; θ(µ))) Equation (B.6) follows because of the duality between θ(µ) and µ, while (B.7) is because logt p(z; θ(µ)) = (hΦ(z), θ(µ)i − Gt (θ(µ))). B.8
Verification in Section 3.1 In this section, we verify that the iterative algorithm for computing Gt is going to con-
verge. We only need to verify that a ˜(k) converges to the corresponding a ˜ of b a. First of all, given b a, since t > 1 and Z(˜ a) > 1, it is clear that 0 < a ˜ < b a. On the domain of 0 < a ˜0 < b a, it is easy to verify that Z(˜ a0 )1−tb a−a ˜0 is a monotonically decreasing function and it crosses at 0 only at a ˜. Therefore, when a ˜(k) > a ˜, a ˜(k+1) < a ˜(k) ; when a ˜(k) < a ˜, a ˜(k+1) > a ˜(k) .
117 We then prove that a ˜(k) is a monotonically decreasing sequence. We prove this by mathematical induction. Since a ˜(0) = b a, a ˜(1) < b a = a ˜(0) . Next assume that in the k-th iteration, a ˜(k) < a ˜(k−1) . Since Z(˜ a(k) ) > Z(˜ a(k−1) ), we have a ˜(k+1) < a ˜(k) . Therefore, it follows that a ˜(k) is monotonically decreasing and it is lower bounded by a ˜. Furthermore, limk→+∞ a ˜(k) exists. Finally, lim a ˜(k) = lim a ˜(k+1)
k→+∞
k→+∞
= lim Z(˜ a(k) )1−tb a k→+∞
a, = Z( lim a ˜(k) )1−tb k→+∞
(B.8)
where (B.8) is because Z(˜ a0 )1−t is continuous. Therefore, it follows that limk→+∞ a ˜(k) = a ˜.
B.9
Verification in Section 3.2.2 In this section, we verify the robust types of the losses in Table 3.1.
Logistic Regression Il (u) = l0 (u)u = −
2 u 1 + exp(2u)
As u → −∞, Il (u) goes to infinity. Therefore, logistic regression belongs to Robust Loss 0. Furthermore, one can easily verify that all the convex losses are Robust Loss 0, because limu→−∞ |l0 (u)| ≥ |l0 (0)| > 0. T -Logistic Regression
Let us define p(u) := expt (u − Gt (u)) and q(u) its escort distri-
bution, then Il (u) = l0 (u)u = −2q(−u)u · p(u)t−1 .
118 As u → −∞, q(−u) = 1 and p(u) = 0. We have lim Il (u) = lim −2u · p(u)t−1
u→−∞
u→−∞
−2u u→−∞ 1 + (t − 1)(Gt (u) − u) −2 = lim u→−∞ (t − 1)(q(u) − q(−u) − 1) −1 = lim u→−∞ (t − 1)(−q(−u) − 1) 1 = . 2(t − 1) = lim
(B.9)
(B.10)
where (B.9) comes by applying the L’Hospital principle. As u → +∞, q(−u) = 0 and p(u) = 1. We have lim Il (u) = lim −2u ·
u→+∞
u→+∞
p(−u)t p(−u)t + p(u)t
= lim −2up(−u)t−1 p(−u) u→+∞
Similar to (B.10), we have limu→+∞ −2u · p(−u)t−1 =
1 . 2(t−1)
Furthermore, as p(−u) = 0,
we conclude that limu→+∞ Il (u) = 0. Therefore, t-logistic regression belongs to Robust Loss I. Savage Loss
The non-convex loss, Savage loss, is widely used in the community of neu-
ral network, l(u) = (1 − σ(u))2 = σ(−u)2 where σ(u) =
1 1+exp(−u)
and σ(u) + σ(−u) = 1.
Il (u) = −2u · σ(−u)σ 0 (−u) = 2u · σ(−u) · σ(−u)σ(u) Since, limu→+∞ σ(u) = 0, and limu→−∞ σ(−u) = 0, we have limu→∞ |Il (u)| = 0. Therefore, Savage loss belongs to Robust Loss II.
119 B.10
Verification in Section 3.3
In this section, we verify the Bayes-risk consistency property of the multiclass t-logistic regression. The Bayes-risk consistency of a multiclass classification loss was first discussed in [28]. Define, b a(x) = (b a1 , . . . , b aC ) where, b ac (x) : X → R the margin of x in class c. η = (η1 , . . . , ηC ) where, ηc = p(y = c| x) the true conditional probability of class c l(b a) = (l1 , . . . , lC ) where, lc = l(b a, c) The conditional risk of the multiclass loss l can be written as, Cl (η, b a) = Ec| x [l(b a, c)] =
C X
ηc lc
c=1
Definition B.7 A Bayes-risk consistent loss function for multiclass classification is the class of loss function l, for which given any η, b a∗ the minimizer of Cl (η, b a) satisfies argmin l(b a∗ ) ⊆ argmax η c
(B.11)
c
For t-logistic loss, we have lc = − log expt (b ac − Gt (b a)) And Cl (η, b a) =
C X
ηc lc
c=1
=−
C X c=1
ηc log expt (b ac − Gt (b a))
Minimizing over b a results in the b a∗ which satisfies ηc = expt (b a∗c − Gt (b a∗ )). Because that log is a monotonically increasing function, argmin l(b a∗ ) = argmax η c
c
Therefore, the multiclass t-logistic loss is also Bayes-risk consistent.
(B.12)
120 B.11
Verification in Definition 4.2.2
In this section, we verify that Equation (4.24) is the Bregman divergence between q and q ˜ based on t-entropy. Let us define qr (z) = q(z) + r(˜ q(z) − q(z)), where r = [0, 1]. Clearly, q0 (z) = q(z), R and q1 (z) = q ˜(z). Define pr (z) = qr (z)1/t / qr (z)1/t d z. First assume the regularity condition holds, let us take the drivative of Ht (qr (z)) with respect to r, Z d d H(qr (z)) = qr (z) logt pr (z)d z dr dr Z d (qr (z) logt pr (z)) d z = dr Z Z d = (˜ q(z) − q(z)) logt pr (z)d z + qr (z) (logt pr (z)) d z dr Z Z qr (z) dpr (z) = (˜ q(z) − q(z)) logt pr (z)d z + dz pr (z)t dr Z Z Z d 1/t = (˜ q(z) − q(z)) logt pr (z)d z + qr (z) d z pr (z)d z dr | {z } =1 Z = (˜ q(z) − q(z)) logt pr (z)d z The Bregman divergence between q(z) and q ˜(z) based on −Ht (q) is equal to d Dt (qk q ˜) = −Ht (q1 (z)) + Ht (q0 (z)) − H(qr (z))|r=0 dr Z = q(z) logt p(z) − q ˜(z) logt p ˜(z) − (q(z) − q ˜(z)) logt p ˜(z)d z Z ˜(z)d z = q(z) logt p(z) − q(z) logt p B.12
Verification of Equation (4.33)
Assuming there are N independent variables x = (x1 , . . . , xN ) with p(x) =
QN
i=1
pi (xi ),
it is obvious that its escort is N
N
Y pt (xi ) pt (x) Y i qi (xi ) = q(x) = = Z Zi i=1 i=1
(B.13)
121 Q which indicates Z = N i=1 Zi . Now combining with (2.7), the t-entropy for p(x) is Z Ht (p(x)) = − q(x) logt p(x)d x (B.14) Z Z 1 1 p1−t (x) − 1 t dx = − 1 − pt (x)d x (B.15) =− p (x) Z 1−t (1 − t)Z Using the fact that 1 Ht (pi (xi )) = − (1 − t)Zi
Z t 1 − pi (xi )dxi
(B.16)
we have Z
pti (xi )dxi
Besides, since pt (x) = Z
t
p (x)dx =
QN
i=1
N Z Y
= (1 − t)Zi Ht (pi (xi )) +
1 (1 − t)Zi
(B.17)
pi (xi ), we further have
pti (xi )dxi
i=1
N Y = (1 − t)Zi Ht (pi (xi )) + i=1
1 (1 − t)Zi
(B.18)
Now combining with (B.15) gives N −1
Ht (p(x)) = (1 − t) B.13
N Y Ht (pi (xi )) + i=1
1 (1 − t)Zi
N
1 Y 1 − 1 − t i=1 Zi
(B.19)
Verification in Section 4.3.1
In this section, we provide intermediate derivations to obtain the mean-field updates on approximating multivariate Student’s t-distribution. The approximate distribution is ˜ = p ˜(z; θ)
k Y
˜j ) = p ˜(z ; θ j
j=1
k Y
St(z j ; µ ˜j , σ ˜ j , v˜).
j=1
Using the representation of t-exponential family of distributions j
˜ ) = exp p ˜(z j ; θ t
D
˜ Φj (z j ), θ
j
E
˜j ) − Gt (θ
122 ˜ j = [−2Ψ ˜ jK ˜j µ ˜ jK ˜ j /(1 − t)], where with Φj (z j ) = [z j ; (z j )2 ] and θ ˜j /(1 − t), Ψ ˜ j = v˜−1 (˜ K σ j )−2 , −2/(˜v +1) Γ((˜ v +1)/2) j ˜ = , Ψ Γ(˜ v /2)(π v˜)1/2 σ ˜j ˜j ) = − 1 ˜ jK ˜ j (˜ ˜j − 1 . Gt (θ Ψ µj ) 2 + Ψ 1−t Now we can write D E ˜n = 1 Ψ ˜ n · −2K ˜n µ ˜ n (z n )2 ˜n z n + K Φn (z n ), θ 1−t D E 1 Eq˜j6=n [Φ(z)], θ = Ψ · −2µ> K Eq˜j6=n [z] + tr K Eq˜j6=n [z z> ] 1−t 1 ˜ j6=n )> kj6=n,n z n + k nn (z n )2 + const. = Ψ · −2µ> kn z n + 2(µ 1−t j ˜ j6=n denotes the vector µ where µ ˜ j=1...k,j6=n , kn denotes the n-th column of K, and kj6=n,n denotes the n-th column of K after its n-th element is deleted. Using µ ˜j = Eq˜j [z j ] and (˜ σ j )2 = Eq˜j [(z j )2 ] − Eq˜j [z j ]2 , E t−1 D j ˜ , Eq˜ [Φj (z j )] − Gjt (θ ˜j ) expt θ j = expt = expt = expt =
˜j Ψ v˜
˜j ˜ jK Ψ j j j 2 ˜j ) −2 µ ˜ Eq˜j [z ] + Eq˜j [(z ) ] − Gjt (θ 1−t !t−1 ˜j ˜ jK Ψ j ˜ ) (−2(˜ µj )2 + (˜ σ j )2 ) − Gjt (θ 1−t !!t−1 ˜j 1 Ψ j ˜ −1 +Ψ 1 − t v˜ !−1 ˜j +Ψ .
!t−1
Plugging in (4.32), the iterative updates for the Student’s t-distribution are given by 1 > n j6=n > j6=n,n ˜ −2µ k +2( µ ) k , k nn −(˜v +1)/ v˜ Γ(˜ v /2)2/ v˜ π 1/ v˜ ˜ nΨ ˜n · (˜ σ n )2 = K , Γ((˜ v +1)/2)2/ v˜ v˜ !−1 Y Ψ ˜j ˜ nΨ ˜ n = Ψk nn ˜j where K +Ψ . v˜ j6=n µ ˜n =
123 B.14
Verification in Section 4.3.2
In this section, we verify the updates of the Bayesian online learning algorithms based on Student’s t-distribution in Section 4.3.2. Assumed density filtering matches the moments of Z Z qi (w) w d w = q ˜i (w) w d w, and Z Z qi (w) w w> d w = q ˜i (w) w w> d w,
(B.20) (B.21)
In order to compute the moments, we first make use of ˜ (i−1) , v), ˜ (i−1) , Σ p ˜i−1 (w) = St(w; µ ˜ (i−1) /(v + 2), v + 2), ˜ (i−1) , v Σ q ˜i−1 (w) = St(w; µ and get the following relations: Z Z1(i) = p ˜i−1 (w)t˜i (w)d w Z z(i) t t t = + (1 − ) − St(z; 0, 1, v)dz −∞ Z Z2(i) = q ˜i−1 (w)t˜i (w)d w Z z(i) t t t = + (1 − ) − St(z; 0, v/(v + 2), v + 2)dz
(B.22) (B.23) (B.24) (B.25)
−∞
f (i) = F(i) =
1 Z2(i) 1 Z2(i)
∇µ Z1(i) = yi α(i) xi ∇Σ Z1(i)
˜ (i−1) 1 yi α(i) xi , µ =− xi x> i ˜ (i−1) xi 2 x> Σ i
(B.26) (B.27)
where, α(i)
˜ (i−1) y i xi , µ ((1 − )t − t ) St(z(i) ; 0, 1, v) q = and z(i) = q . > ˜ > ˜ Z2(i) xi Σ(i−1) xi xi Σ(i−1) xi
Equations (B.23) and (B.25) are analogous to Eq. (5.17) in [37]. By assuming that a reguR larity condition1 holds, and ∇ can be interchanged in ∇Z1(i) of (B.26) and (B.27). 1
This is a fairly standard technical requirement which is often proved using the Dominated Convergence Theorem (see e.g. Section 9.2 of [20]).
124 Next, by combining with (B.22) and (B.24), we obtain the expectations of qi (w) from Z1(i) and Z2(i) (similar to Eq. (5.12) and (5.13) in [37]), Z 1 ˜ (i−1) f (i) ˜ (i−1) + Σ Eq [w] = q ˜i−1 (w)t˜i (w) w d w = µ Z2(i) (B.28) Eq [w w> ] − Eq [w] Eq [w]> =
1
Z
Z2(i)
q ˜i−1 (w)t˜i (w) w w> d w − Eq [w] Eq [w]>
˜ (i−1) ˜ (i−1) − Σ ˜ (i−1) f (i) f > −2 F(i) Σ = r(i) Σ (i)
(B.29)
where r(i) = Z1(i) /Z2(i) and Eq [·] means the expectation with respect to qi (w). ˜ (i) , after combining with (B.26) and ˜ (i) and Σ Since the mean and variance of q ˜i (w) is µ (B.27), we obtain ˜ (i−1) xi ˜ (i) = Eq [w] = µ ˜ (i−1) + α(i) yi Σ µ ˜ (i) = Eq [w w> ] − Eq [w] Eq [w]> Σ ˜ (i−1) − (Σ ˜ (i−1) xi ) = r(i) Σ B.15
(B.30) (B.31)
˜ (i) α(i) yi xi , µ > ˜ x Σ(i−1) xi
!
˜ (i−1) xi )> . (Σ
(B.32)
i
Verification in Section 6.2
In this section, we verify the three types of mismatch losses by limu→−∞ I(u). Let us denote p(u) = expt2 ( u2 − Gt2 (u)) and q(u) ∝ expt2 ( u2 − Gt2 (u))t2 , where p(u) + p(−u) = 1 and q(u) + q(−u) = 1. Using (2.17), q(u) − q(−u) 1 ∂ Gt2 (u) = = q(u) − , ∂u 2 2 therefore, ∂p(u) u ∂ u = expt2 ( − Gt2 (u))t2 ( − Gt2 (u)) ∂u 2 ∂u 2 = p(u)t2 (1 − q(u)). The first derivative of l(u) is equal to ∂ logt1 p(u) ∂u ∂p(u) = − p(u)−t1 = −p(u)t2 −t1 (1 − q(u)). ∂u
l0 (u) = −
125 As u → −∞, p(u) = q(u) = 0, so that lim I(u) = lim l0 (u)u
u→−∞
u→−∞
= lim − u→−∞
u u(1 − q(u)) = lim − . t −t u→−∞ p(u) 1 2 p(u)t1 −t2
(B.33)
When t1 < t2 , both numerator and denominator of (B.33) goes to infinity. Therefore, we are going to apply the L’Hospital Principle on (B.33), lim I(u) = lim −
u→−∞
u→−∞
u p(u)t1 −t2 1
=−
(u))t1 −t2 −1
(t1 − t2 ) expt2 (u − Gt expt2 (u − Gt (u))t2 (1 − q(u)) 1 = lim (B.34) t u→−∞ (t2 − t1 )p(u) 1 −1 (1 − q(u)) 1 = lim , (B.35) u→−∞ (t2 − t1 )p(u)t1 −1
where (B.34) is because of L’Hospital Principle. As u → +∞, p(u) = q(u) = 1. Similar to the derivations in (B.35), we have lim I(u) = − lim q(−u)u
u→+∞
u→+∞
= lim p(−u)t2 u = lim u→+∞
u→+∞
p(−u) =0 t2
Based on (B.35), we classify the mismatch losses as t1 < t2 into three robust types: • t1 > 1: Robust Loss-0; • t1 = 1: Robust Loss-I; • t1 < 1: Robust Loss-II. Appendix C: Additional Figures of Section 3.5 In this chapter, we provide the additional figures from the empirical evalutation of tlogistic regression.
126
Noise-1
Noise-3
Noise-2 15.8 17.5
15.3 15.2
15.6
Test Error (%)
Test Error (%)
Test Error (%)
15.4
15.4 15.2
15 t = 1.5
logistic
Savage
t = 1.5
logistic
Savage
t = 1.5
logistic
Savage
Noise-3
Noise-2
Noise-1
17.5
15.4 15.3 15.2
Test Error (%)
22 Test Error (%)
Test Error (%)
16 15.5
15.1
20 18
t = 1.5
Savage
16
15
14 logistic
17 16.5
15.5
16 15.1 t = 1.5
logistic
Savage
t = 1.5
logistic
2,500
1,500
2,500
2,000
Savage
Noise-3
Noise-2
Noise-1
1,500 1,000 500
Frequency
2,000 Frequency
Frequency
17 16.5
1,500 1,000
1,000
500
500
0
0 0
0.2
0.4
0.6
0.8
1
0 0
0.2
0.4
0.6
0.8
1
0
0.2
0.4
0.6
0.8
1
Fig. C.1. Experiment on adult9 Dataset. Top: Generalization Performance; Middle: Random Initialization; Bottom: Forgetting Variables.
127
23 23
Test Error (%)
21.9 21.88 21.86
22.6 22.4 22.2
21.84
22
21.82
21.8 t = 1.5
logistic
Test Error (%)
22.8
21.92 Test Error (%)
Noise-3
Noise-2
Noise-1 21.94
Savage
22.5
22
t = 1.5
logistic
Savage
23
21.84
22.6
Test Error (%)
Test Error (%)
21.86
Savage
Noise-3
22.8 21.88 Test Error (%)
t = 1.5
logistic
Noise-2
Noise-1
22.4 22.2
22.5
22
22
21.8
21.82 logistic
t = 1.5
·104
Noise-1
Savage
t = 1.5
logistic
1.5
Savage
t = 1.5
logistic
Noise-2
·104
Savage
Noise-3
·104
1 1
0.6 0.4
1
Frequency
Frequency
Frequency
0.8
0.5
0.5
0.2 0
0
0 0
0.2
0.4
0.6
0.8
1
0
0.2
0.4
0.6
0.8
1
0
0.2
0.4
0.6
0.8
1
Fig. C.2. Experiment on alpha Dataset. Top: Generalization Performance; Middle: Random Initialization; Bottom: Forgetting Variables.
128
Noise-1
4
4
3
Test Error (%)
4 Test Error (%)
Test Error (%)
Noise-3
Noise-2
5
3.5 3 2.5
2
3 2.5
2 t = 1.5
logistic
Savage
2 t = 1.5
logistic
Noise-1
t = 1.5
logistic
Savage
Savage
Noise-3
Noise-2 4
5
4
3
Test Error (%)
4 Test Error (%)
Test Error (%)
3.5
3.5 3
3.5 3 2.5
2.5 2 t = 1.5
logistic
t = 1.5
logistic
Savage
Noise-1
Frequency
Frequency
6,000 4,000
8,000
1
6,000
0.8
4,000
2,000
2,000
0
0 0.2
0.4
0.6
0.8
1
t = 1.5
logistic
Frequency
8,000
0
Savage
Noise-2
Savage
Noise-3
·104
0.6 0.4 0.2 0
0
0.2
0.4
0.6
0.8
1
0
0.2
0.4
0.6
0.8
1
Fig. C.3. Experiment on astro-ph Dataset. Top: Generalization Performance; Middle: Random Initialization; Bottom: Forgetting Variables.
129
Noise-1
Noise-3
Noise-2 6
5
5
3 2
Test Error (%)
Test Error (%)
Test Error (%)
5 4
4 3 2
t = 1.5
logistic
Savage
logistic
15
3
Test Error (%)
Test Error (%)
Test Error (%)
4
10 5
−5
Noise-1 1
2,000
Frequency
Frequency
3,000
1,000
0.6
0.8
1
Savage
Noise-3
Noise-2
·104
0.8
8,000
0.6
6,000
0.4
4,000 2,000
0 0.4
t = 1.5
logistic
Savage
0.2 0
5
−5 t = 1.5
logistic
Frequency
Savage
10
0
0 2 t = 1.5
Savage
Noise-3
15
0.2
t = 1.5
logistic
Savage
Noise-2
5
0
3 2
t = 1.5
Noise-1
logistic
4
0 0
0.2
0.4
0.6
0.8
1
0
0.2
0.4
0.6
0.8
1
Fig. C.4. Experiment on aut-avn Dataset. Top: Generalization Performance; Middle: Random Initialization; Bottom: Forgetting Variables.
130
Noise-3
Noise-2
Noise-1 49.9
50.2
49.8
49.7
Test Error (%)
Test Error (%)
Test Error (%)
50 49.9 49.8
49.8
49.7
49.6 t = 1.5
logistic
Savage
t = 1.5
logistic
Noise-1
Savage
t = 1.5
logistic
49.88
50
49.86
49.95
Savage
Noise-3
Noise-2
49.84 49.82
Test Error (%)
50 Test Error (%)
Test Error (%)
50
49.9 49.85
49.95 49.9 49.85
49.8 49.8
49.78 t = 1.5
logistic
Savage
t = 1.5
logistic
Noise-1
·104
49.8
4
Savage
t = 1.5
logistic
Noise-2
·104
6
Savage
Noise-3
·104
6
2
Frequency
Frequency
Frequency
3 4
2
4
2
1
0
0
0 0
0.2
0.4
0.6
0.8
1
0
0.2
0.4
0.6
0.8
1
0
0.2
0.4
0.6
0.8
1
Fig. C.5. Experiment on beta Dataset. Top: Generalization Performance; Middle: Random Initialization; Bottom: Forgetting Variables.
131
22.9
23.2
22.8 22.7 22.6
28 Test Error (%)
Test Error (%)
Test Error (%)
Noise-3
Noise-2
Noise-1
23 22.8
26 24
22.6 22
22.5 t = 1.5
logistic
t = 1.5
logistic
Savage
Savage
t = 1.5
logistic
Savage
Noise-3
Noise-2
Noise-1 32 30
22.8 22.7
28
28
Test Error (%)
Test Error (%)
Test Error (%)
22.9
26 24 22
22.6
26 24
20 22 t = 1.5
logistic
Noise-1
·104
1.5
Savage
Noise-2
·104
logistic
t = 1.5
·104
Noise-3
Savage
1
0.5
0
1
Frequency
Frequency
1 Frequency
t = 1.5
logistic
Savage
0.5
0
0 0
0.2
0.4
0.6
0.8
1
0.5
0
0.2
0.4
0.6
0.8
1
0
0.2
0.4
0.6
0.8
1
Fig. C.6. Experiment on covertype Dataset. Top: Generalization Performance; Middle: Random Initialization; Bottom: Forgetting Variables.
132
Noise-3
Noise-2
Noise-1 21.6
22.4
22.2 21.56 21.54 21.52
Test Error (%)
Test Error (%)
Test Error (%)
21.58 22 21.8 21.6
21.5 t = 1.5
logistic
Savage
22.2 22 21.8 21.6
t = 1.5
logistic
t = 1.5
logistic
Savage
Savage
Noise-3
Noise-2
Noise-1 21.6
21.56
Test Error (%)
30 Test Error (%)
Test Error (%)
24 21.58
23
22
21.54
t = 1.5
logistic
20 t = 1.5
logistic
Savage
Noise-1
·104
25
t = 1.5
logistic
Savage
Noise-2
·104
Savage
Noise-3
·104
1.5
0.5
0
Frequency
1 Frequency
Frequency
1 1 0.5
0
0 0
0.2
0.4
0.6
0.8
1
0.5
0
0.2
0.4
0.6
0.8
1
0
0.2
0.4
0.6
0.8
1
Fig. C.7. Experiment on delta Dataset. Top: Generalization Performance; Middle: Random Initialization; Bottom: Forgetting Variables.
133
Noise-3
Noise-2
Noise-1 11.4 Test Error (%)
10.4
Test Error (%)
11.2
10.5 Test Error (%)
11.5
11 10.8 10.6
10.5
10.4
10.3
11
10.2 t = 1.5
logistic
Savage
t = 1.5
logistic
11.4 11.2
10.5 10.45 10.4 10.35 t = 1.5
11.4
11 10.8 10.6
Savage
11.2 11 10.8 10.6 10.4
t = 1.5
logistic
Noise-1
·104
Savage
3
3
Frequency
Frequency
1
2 1
0.5 0
0 0
0.2
0.4
0.6
0.8
1
Savage
Noise-3
·104
4
1.5
t = 1.5
logistic
Noise-2
·104
2
Savage
Noise-3
10.4 logistic
t = 1.5
logistic
Test Error (%)
10.6 10.55
Frequency
Savage
Noise-2
Test Error (%)
Test Error (%)
Noise-1
2
1
0 0
0.2
0.4
0.6
0.8
1
0
0.2
0.4
0.6
0.8
1
Fig. C.8. Experiment on epsilon Dataset. Top: Generalization Performance; Middle: Random Initialization; Bottom: Forgetting Variables.
134
19.98 19.96
Test Error (%)
Test Error (%)
Test Error (%)
20.8
20.6
20
20.4 20.2
t = 1.5
logistic
20.4 20.2
Savage
20 t = 1.5
logistic
Savage
Test Error (%)
19.99 19.98 19.97
25
25
24
24
23 22 21
t = 1.5
logistic
Savage
t = 1.5
logistic
Noise-1
·104
2
23 22 21 20
20
19.96
Savage
Noise-3
Test Error (%)
20
t = 1.5
logistic
Noise-2
Noise-1
Test Error (%)
20.6
20
19.94
1.5
Noise-3
Noise-2
Noise-1 20.02
Savage
t = 1.5
logistic
Noise-2
·104
1.5
Savage
Noise-3
·104
0.5
Frequency
1
Frequency
Frequency
1.5 1 0.5 0
0 0
0.2
0.4
0.6
0.8
1
1
0.5
0 0
0.2
0.4
0.6
0.8
1
0
0.2
0.4
0.6
0.8
1
Fig. C.9. Experiment on gamma Dataset. Top: Generalization Performance; Middle: Random Initialization; Bottom: Forgetting Variables.
135
Noise-1
Noise-3
Noise-2
0.5
Test Error (%)
0.6
0.6
Test Error (%)
Test Error (%)
0.6
0.5
t = 1.5
logistic
Savage
t = 1.5
logistic
8
2 0
40
20
0
−2 t = 1.5
Savage
t = 1.5
logistic
Noise-1
·106
2.5
1.5
Savage
Noise-3
Test Error (%)
4
logistic
t = 1.5
logistic
40
6
Test Error (%)
Test Error (%)
Savage
Noise-2
Noise-1
−4
0.5 0.45 0.4
0.4
0.4
0.55
20
0
t = 1.5
logistic
Savage
Noise-2
·106
Savage
Noise-3
·106
2
0.5
Frequency
Frequency
Frequency
2 1
1.5 1
1
0.5 0
0 0
0.2
0.4
0.6
0.8
1
0 0
0.2
0.4
0.6
0.8
1
0
0.2
0.4
0.6
0.8
1
Fig. C.10. Experiment on kdd99 Dataset. Top: Generalization Performance; Middle: Random Initialization; Bottom: Forgetting Variables.
136
Noise-1
12
11.8
11.95
11.9
11.9
11.85
Test Error (%)
12.2
Test Error (%)
Test Error (%)
Noise-3
Noise-2
12.4
11.85 11.8
11.75
11.75 t = 1.5
logistic
Savage
t = 1.5
logistic
Noise-1
Savage
13
12
15
14 13 12
t = 1.5
logistic
14 13 12
t = 1.5
logistic
Savage
Noise-1
·105
Savage
Noise-3
Test Error (%)
Test Error (%)
14
t = 1.5
logistic
Noise-2 15
Test Error (%)
11.8
Savage
t = 1.5
logistic
Noise-2
·105
Savage
Noise-3
·105 6
6
2
0
Frequency
Frequency
Frequency
4 4
2
0 0
0.2
0.4
0.6
0.8
1
4
2
0 0
0.2
0.4
0.6
0.8
1
0
0.2
0.4
0.6
0.8
1
Fig. C.11. Experiment on kdda Dataset. Top: Generalization Performance; Middle: Random Initialization; Bottom: Forgetting Variables.
137
Noise-3
Noise-2
Noise-1 11.7
11.6
11.5
Test Error (%)
Test Error (%)
Test Error (%)
11.8 11.6
11.7 11.6 11.5
11.4 t = 1.5
Savage
t = 1.5
logistic
12
t = 1.5
14
13
12
t = 1.5
logistic
Savage
Noise-1
·106
Savage
Noise-3
Test Error (%)
Test Error (%)
13
t = 1.5
logistic
14
14 Test Error (%)
Savage
Noise-2
Noise-1
logistic
11.5 11.45 11.4
11.4 logistic
11.55
13
12
Savage
Noise-2
·106
6
logistic
t = 1.5
·105
Noise-3
Savage
1.5
0.5
0
1
Frequency
Frequency
Frequency
1
0.5
0 0
0.2
0.4
0.6
0.8
1
4
2
0 0
0.2
0.4
0.6
0.8
1
0
0.2
0.4
0.6
0.8
1
Fig. C.12. Experiment on kddb Dataset. Top: Generalization Performance; Middle: Random Initialization; Bottom: Forgetting Variables.
138
Noise-3
Noise-2
Noise-1 20
25
15 10 5
20
15
Test Error (%)
Test Error (%)
Test Error (%)
20
10 5
t = 1.5
logistic
Savage
0 t = 1.5
logistic
Noise-1
Savage
25
10 5 0
20
15
Test Error (%)
Test Error (%)
Test Error (%)
15
10 5
Savage
15 10 5 0
0 t = 1.5
Savage
Noise-3
20
t = 1.5
logistic
Savage
t = 1.5
logistic
Savage
Noise-3
Noise-2
Noise-1 300
300
100
0
Frequency
200
200
Frequency
Frequency
t = 1.5
logistic
Noise-2
20
logistic
10 5
0
0
15
100
0 0
0.2
0.4
0.6
0.8
1
200
100
0 0
0.2
0.4
0.6
0.8
1
0
0.2
0.4
0.6
0.8
1
Fig. C.13. Experiment on longservedio Dataset. Top: Generalization Performance; Middle: Random Initialization; Bottom: Forgetting Variables.
139
Noise-1
Noise-3
Noise-2
5
5 6
3 2
4 Test Error (%)
Test Error (%)
Test Error (%)
4
4
2
1
3 2 1
0 t = 1.5
logistic
Savage
t = 1.5
logistic
Savage
40
3 2
Test Error (%)
Test Error (%)
Test Error (%)
40 4
20
t = 1.5
t = 1.5
logistic
Savage
20
0
0
1
Noise-1
Savage
t = 1.5
logistic
Savage
Noise-3
Noise-2 400
150
Savage
Noise-3
5
logistic
t = 1.5
logistic
Noise-2
Noise-1
600
50
Frequency
100
Frequency
Frequency
300 200 100 0
0 0
0.2
0.4
0.6
0.8
1
400
200
0 0
0.2
0.4
0.6
0.8
1
0
0.2
0.4
0.6
0.8
1
Fig. C.14. Experiment on measewyner Dataset. Top: Generalization Performance; Middle: Random Initialization; Bottom: Forgetting Variables.
140
0.5
0.5
0.5
0
Test Error (%)
1
−0.5
0 −0.5
t = 1.5
logistic
Savage
0 −0.5
t = 1.5
logistic
Noise-1
Savage
1
0.5
0.5
0.5
−0.5
Test Error (%)
1
0
0 −0.5
t = 1.5
Savage
Savage
Noise-3
1
logistic
t = 1.5
logistic
Noise-2
Test Error (%)
Test Error (%)
Noise-3
Noise-2 1
Test Error (%)
Test Error (%)
Noise-1 1
0 −0.5
t = 1.5
logistic
Noise-1
Savage
t = 1.5
logistic
Savage
Noise-3
Noise-2
2,000 2,000
2,000
1,500
1,500
1,000
1,000
500
500
0
0 0
0.2
0.4
0.6
0.8
1
Frequency
Frequency
Frequency
1,500
1,000 500 0
0
0.2
0.4
0.6
0.8
1
0
0.2
0.4
0.6
0.8
1
Fig. C.15. Experiment on mushrooms Dataset. Top: Generalization Performance; Middle: Random Initialization; Bottom: Forgetting Variables.
141
Noise-1
Noise-3
Noise-2
6
6
Test Error (%)
8
Test Error (%)
Test Error (%)
5.5
5
t = 1.5
logistic
4
Savage
t = 1.5
logistic
Savage
Noise-3 4.8
7 6 5
6
Test Error (%)
Test Error (%)
8
5
4
4 t = 1.5
logistic
Savage
4.4 4.2 4 3.8
t = 1.5
logistic
Noise-1
·104
4.6
Savage
t = 1.5
logistic
Savage
Noise-3
Noise-2 3,000
0.8 Frequency
2,000
0.6 0.4
Frequency
Test Error (%)
t = 1.5
logistic
Savage
Noise-2
Noise-1
Frequency
4.5
4
4
1
5
1,000
2,000
1,000
0.2 0
0 0
0.2
0.4
0.6
0.8
1
0 0
0.2
0.4
0.6
0.8
1
0
0.2
0.4
0.6
0.8
1
Fig. C.16. Experiment on news20 Dataset. Top: Generalization Performance; Middle: Random Initialization; Bottom: Forgetting Variables.
142
5 4
5
7
4.5
6 Test Error (%)
Test Error (%)
6 Test Error (%)
Noise-3
Noise-2
Noise-1
4 3.5 3
3
5 4 3
2.5 t = 1.5
logistic
Savage
t = 1.5
logistic
Noise-1
t = 1.5
logistic
5 4
5
7
4.5
6
4 3.5 3
3
Savage
Noise-3
Test Error (%)
Test Error (%)
6 Test Error (%)
Savage
Noise-2
5 4 3
2.5 t = 1.5
logistic
Savage
t = 1.5
logistic
Noise-1
t = 1.5
logistic
Savage
Savage
Noise-3
Noise-2 3,000 3,000
1,000
0
Frequency
2,000 Frequency
Frequency
2,000
1,000
0 0
0.2
0.4
0.6
0.8
1
2,000 1,000 0
0
0.2
0.4
0.6
0.8
1
0
0.2
0.4
0.6
0.8
1
Fig. C.17. Experiment on real-sim Dataset. Top: Generalization Performance; Middle: Random Initialization; Bottom: Forgetting Variables.
143
Noise-1
Noise-3
Noise-2 2.65 3
2.6 2.55
2.6
Test Error (%)
Test Error (%)
Test Error (%)
2.65
2.55
2.8
2.6
2.5
2.5 t = 1.5
logistic
t = 1.5
logistic
Savage
Savage
3
2.6 2.55
2.6
Test Error (%)
Test Error (%)
2.55
2.8
2.6
2.5
2.5 t = 1.5
logistic
1.2
1.2 1
0.8
0.8
Frequency
1
0.6 0.4
0.2 0 0.4
0.6
0.8
1
t = 1.5
logistic
4
Savage
Noise-3
·105
3
0.4
0 0.2
Savage
Noise-2
·105
0.6
0.2
0
t = 1.5
logistic
Savage
Noise-1
·105
Frequency
Test Error (%)
Savage
Noise-3
2.65
Frequency
t = 1.5
logistic
Noise-2
Noise-1
2 1 0
0
0.2
0.4
0.6
0.8
1
0
0.2
0.4
0.6
0.8
1
Fig. C.18. Experiment on reuters-c11 Dataset. Top: Generalization Performance; Middle: Random Initialization; Bottom: Forgetting Variables.
144
Noise-1
Noise-3
Noise-2
5.4
Test Error (%)
5.2 5.1 5
Test Error (%)
5.6
5.3 Test Error (%)
6
5.4 5.2
5.5
5
5
4.9 4.8 t = 1.5
logistic
Savage
t = 1.5
logistic
t = 1.5
logistic
Savage
Savage
Noise-3
Noise-2
Noise-1 5.4
5.2 5.1 5
Test Error (%)
5.8 Test Error (%)
Test Error (%)
6
5.6
5.3
5.4 5.2
5.6 5.4 5.2
5
5
4.9
4.8 t = 1.5
logistic
6
t = 1.5
logistic
Savage
Noise-1
·104
Savage
t = 1.5
logistic
Noise-2
·104
Savage
Noise-3
·104
6
2
4
Frequency
Frequency
Frequency
4 4
2
0
0 0
0.2
0.4
0.6
0.8
1
2
0 0
0.2
0.4
0.6
0.8
1
0
0.2
0.4
0.6
0.8
1
Fig. C.19. Experiment on reuters-ccat Dataset. Top: Generalization Performance; Middle: Random Initialization; Bottom: Forgetting Variables.
145
Noise-1
Noise-3
Noise-2
1.2 1.1 1
4
1.2
Test Error (%)
Test Error (%)
Test Error (%)
1.3
1.1
3 2
1 t = 1.5
logistic
1 t = 1.5
logistic
Savage
Noise-1
Savage
t = 1.5
logistic
1.3
Savage
Noise-3
Noise-2 1.2
1.1
1.1 1.05
t = 1.5
logistic
Noise-1
·104
Savage
0.4
6,000
6,000 4,000 2,000
0.2 0
0 0.4
0.6
0.8
1
Savage
Noise-3
Frequency
Frequency
0.6
0.2
t = 1.5
logistic
8,000
0
2
Noise-2
0.8 Frequency
t = 1.5
logistic
Savage
3
1
1
1
1
Test Error (%)
Test Error (%)
Test Error (%)
4 1.2
1.15
4,000
2,000
0 0
0.2
0.4
0.6
0.8
1
0
0.2
0.4
0.6
0.8
1
Fig. C.20. Experiment on web8 Dataset. Top: Generalization Performance; Middle: Random Initialization; Bottom: Forgetting Variables.
146
Noise-1
Noise-3
Noise-2 3
1 0.8 0.6
Test Error (%)
1.2 Test Error (%)
Test Error (%)
1.2
1 0.8 0.6
t = 1.5
logistic
Savage
t = 1.5
logistic
Savage
1.2
15
0.8 0.6
Test Error (%)
Test Error (%)
1
20
0
t = 1.5
Savage
t = 1.5
logistic
Noise-1 1.2
6
0
t = 1.5
logistic
Savage
Savage
Noise-3
·105 1
0.8
Frequency
Frequency
2
5
Noise-2
·105
1 4
10
−5
0.4 logistic
Savage
Noise-3
40
·104
t = 1.5
logistic
Noise-2
Noise-1
Test Error (%)
1
0.4
0.4
Frequency
2
0.6 0.4
0.5
0.2 0
0
0 0
0.2
0.4
0.6
0.8
1
0
0.2
0.4
0.6
0.8
1
0
0.2
0.4
0.6
0.8
1
Fig. C.21. Experiment on webspamtrigram Dataset. Top: Generalization Performance; Middle: Random Initialization; Bottom: Forgetting Variables.
147
Noise-1
Noise-3
Noise-2 8.5
7.5
8
7
6.5
Test Error (%)
Test Error (%)
Test Error (%)
8 7.5 7 6.5
t = 1.5
logistic
Savage
7.5 7 6.5
t = 1.5
logistic
Noise-1
Savage
t = 1.5
logistic
Savage
Noise-3
Noise-2 8.5
7.5
8
7
6.5
7 6.5
Savage
6
Frequency
2
1
0
t = 1.5
logistic
Noise-1
·104
0.2
0.4
0.6
0.8
1
7
Savage
t = 1.5
logistic
Noise-2
·104
Savage
Noise-3
·104 6
4
2
0 0
7.5
6.5
Frequency
t = 1.5
logistic
Frequency
Test Error (%)
Test Error (%)
Test Error (%)
8 7.5
4
2
0 0
0.2
0.4
0.6
0.8
1
0
0.2
0.4
0.6
0.8
1
Fig. C.22. Experiment on webspamunigram Dataset. Top: Generalization Performance; Middle: Random Initialization; Bottom: Forgetting Variables.
148
Noise-1
6
2.2 2 1.8
5
1.9
Test Error (%)
2.4
Test Error (%)
Test Error (%)
Noise-3
Noise-2 2
2.6
1.8 1.7
4 3
1.6
2
1.6 1.5
1.4 t = 1.5
logistic
Savage
t = 1.5
logistic
Noise-1
Savage
Noise-3
6
4 3 2
Test Error (%)
6 Test Error (%)
Test Error (%)
t = 1.5
logistic
Savage
Noise-2
4
2
4
2
1 t = 1.5
logistic
Savage
Savage
1
1
0.4
0.6 0.4
0.2
0.2
0
0 0
0.2
0.4
0.6
0.8
1
Frequency
Frequency
0.6
Savage
Noise-3
·105
0.8
0.8
0.8
t = 1.5
logistic
Noise-2
·105
1 Frequency
t = 1.5
logistic
Noise-1
·105
0.6 0.4 0.2 0
0
0.2
0.4
0.6
0.8
1
0
0.2
0.4
0.6
0.8
1
Fig. C.23. Experiment on worm Dataset. Top: Generalization Performance; Middle: Random Initialization; Bottom: Forgetting Variables.
149 Noise-1
6.2
7.5
6
5.8
5.6
Test Error (%)
7 Test Error (%)
Test Error (%)
Noise-3
Noise-2
6.5 6
Savage
6
t = 1.5
logistic
Savage
t = 1.5
logistic
6.2
20
5.8
Test Error (%)
Test Error (%)
7 6
Savage
Noise-3
Noise-2
Noise-1
Test Error (%)
6.5
5.5
5.5 t = 1.5
logistic
7
6.5
6
10
0 5.6
5.5 t = 1.5
logistic
4
Frequency
1
0
2
0 0
0.2
0.4
0.6
0.8
1
Savage
Noise-3
3
1
0.5
t = 1.5
logistic
·104
3
1.5 Frequency
Savage
Noise-2
·104
Frequency
2
t = 1.5
logistic
Savage
Noise-1
·104
2 1 0
0
0.2
0.4
0.6
0.8
1
0
0.2
0.4
0.6
0.8
1
Fig. C.24. Experiment on zeta Dataset. Top: Generalization Performance; Middle: Random Initialization; Bottom: Forgetting Variables. Noise-3
Noise-2
Noise-1
0.7
Test Error (%)
Test Error (%)
0.29
0.29
Test Error (%)
0.3
0.3
0.29
0.29
0.6 0.5 0.4 0.3
0.28
0.28 logistic
t = 1.5
logistic
Savage
t = 1.5
Savage
logistic
t = 1.5
Savage
Fig. C.25. Generalization Performance on dna Dataset. Noise-1
23.9
23.7
23.6
Test Error (%)
24 Test Error (%)
Test Error (%)
Noise-3
Noise-2
23.8
23.8
23.6
23.8 23.7 23.6 23.5
23.5 logistic
t = 1.5
Savage
logistic
t = 1.5
Savage
Fig. C.26. Generalization Performance on ocr Dataset.
logistic
t = 1.5
Savage
150
Noise-3
Noise-2
Noise-1
16 7
6.5 6 5.5
14 Test Error (%)
Test Error (%)
Test Error (%)
7
6.5 6 5.5
savage
t = 1.5
logistic
8 6
logistic
Noise-3 14
6 5.5
Test Error (%)
Test Error (%)
7
6.5
6.5 6 5.5
savage
t = 1.5
savage
t = 1.5
logistic
Noise-2
7 Test Error (%)
10
savage
t = 1.5
Noise-1
logistic
12
12 10 8 6
savage
t = 1.5
logistic
Noise-1
savage
t = 1.5
logistic
Noise-3
Noise-2
80
60
60
60
40
40
20
20
0
0 0
0.2
0.4
0.6
0.8
1
Frequency
80 Frequency
Frequency
100 80
40 20 0
0
0.2
0.4
0.6
0.8
1
0
0.2
0.4
0.6
0.8
1
Fig. C.27. Experiment on dna Dataset. Top: Generalization Performance; Middle: Random Initialization; Bottom: Forgetting Variables.
151
Noise-1
Noise-3
Noise-2
22
20
26
23
Test Error (%)
24
Test Error (%)
Test Error (%)
24
22 21 20
savage
t = 1.5
logistic
22 20
savage
t = 1.5
logistic
Noise-1
Noise-3
24
26
23
Test Error (%)
Test Error (%)
22
22 21 20
20
savage
t = 1.5
logistic
Noise-2
24 Test Error (%)
24
24 22 20
19 savage
t = 1.5
logistic
Noise-3
300 200
800
400
600
300 Frequency
Frequency
400
400 200
100
0
0.2
0.4
0.6
0.8
1
200 100
0
0
savage
t = 1.5
logistic
Noise-2
500
Frequency
savage
t = 1.5
logistic
Noise-1
0 0
0.2
0.4
0.6
0.8
1
0
0.2
0.4
0.6
0.8
1
Fig. C.28. Experiment on letter Dataset. Top: Generalization Performance; Middle: Random Initialization; Bottom: Forgetting Variables.
152
Noise-3
Noise-2
9
8.5
8.5 8 7.5 7
10
Test Error (%)
9
Test Error (%)
Test Error (%)
Noise-1 9.5
8 7.5
savage
t = 1.5
logistic
Noise-1
savage
t = 1.5
logistic
Noise-3
Noise-2
9.5
9 14
8.5 8
8.5 8
7.5
7.5
7
7 savage
t = 1.5
logistic
Test Error (%)
Test Error (%)
9 Test Error (%)
8
7
7 savage
t = 1.5
logistic
9
12 10 8 6 4
savage
t = 1.5
logistic
Noise-1
savage
t = 1.5
logistic
Noise-3
Noise-2
4,000 4,000
2,000 1,000
3,000
3,000
Frequency
Frequency
Frequency
3,000
2,000 1,000 0
0 0
0.2
0.4
0.6
0.8
1
2,000 1,000
0 0
0.2
0.4
0.6
0.8
1
0
0.2
0.4
0.6
0.8
1
Fig. C.29. Experiment on mnist Dataset. Top: Generalization Performance; Middle: Random Initialization; Bottom: Forgetting Variables.
153
Noise-1
Noise-3
Noise-2
33.4 33.2
33 32.8 32.6
33
33
Test Error (%)
Test Error (%)
Test Error (%)
33.2
32.8 32.6
32.2
32.2 savage
t = 1.5
logistic
32.2 Noise-3
32.6 32.4
33.2
32.7
33
32.6
Test Error (%)
Test Error (%)
Test Error (%)
32.8
32.8 32.6 32.4
savage
t = 1.5
32.5 32.4 32.3
32.2
32.2
32.2 savage
t = 1.5
logistic
Noise-1
savage
t = 1.5
logistic
Noise-3
Noise-2
400
savage
t = 1.5
logistic
Noise-2
33
400 400
300 200
200
100
100
0
0 0
0.2
0.4
0.6
0.8
1
Frequency
300 Frequency
Frequency
savage
t = 1.5
logistic
Noise-1
logistic
32.6 32.4
32.4
32.4
32.8
300 200 100 0
0
0.2
0.4
0.6
0.8
1
0
0.2
0.4
0.6
0.8
1
Fig. C.30. Experiment on protein Dataset. Top: Generalization Performance; Middle: Random Initialization; Bottom: Forgetting Variables.
154
Noise-1 25
15
10
20 Test Error (%)
Test Error (%)
Test Error (%)
Noise-3
Noise-2
20
20 15
10
10 5
5 savage
t = 1.5
logistic
savage
t = 1.5
logistic
Noise-1
Noise-3 80
25
8
Test Error (%)
Test Error (%)
10
savage
t = 1.5
logistic
Noise-2
12 Test Error (%)
15
20 15
60 40 20
10 0
6 savage
t = 1.5
logistic
Noise-2
·104
5
0.5
Frequency
Frequency
1
1 0.5
0
0.2
0.4
0.6
0.8
1
t = 1.5
·104
Noise-3
savage
3 2 1
0
0
logistic
4
1.5
1.5 Frequency
savage
t = 1.5
logistic
Noise-1
·104
0 0
0.2
0.4
0.6
0.8
1
0
0.2
0.4
0.6
0.8
1
Fig. C.31. Experiment on rcv1 Dataset. Top: Generalization Performance; Middle: Random Initialization; Bottom: Forgetting Variables.
155
Noise-1
Noise-3
Noise-2
32
32
30 29
Test Error (%)
Test Error (%)
Test Error (%)
32 31
31 30
savage
t = 1.5
30 29
29
logistic
31
savage
t = 1.5
logistic
savage
t = 1.5
logistic
Noise-3
Noise-2
Noise-1
34
32 31 30 29
Test Error (%)
Test Error (%)
Test Error (%)
32 31 30
savage
t = 1.5
Noise-3
500
2,000
2,000
1,500
1,500
Frequency
Frequency
1,000
1,000 500
0 0.4
0.6
0.8
1
1,000 500
0 0.2
savage
t = 1.5
logistic
Noise-2
1,500 Frequency
savage
t = 1.5
logistic
Noise-1
0
30
28
29 logistic
32
0 0
0.2
0.4
0.6
0.8
1
0
0.2
0.4
0.6
0.8
1
Fig. C.32. Experiment on sensitacoustic Dataset. Top: Generalization Performance; Middle: Random Initialization; Bottom: Forgetting Variables.
156
Noise-3
Noise-2
Noise-1 20.5
19.5 19 18.5
20 Test Error (%)
Test Error (%)
Test Error (%)
20 20
19.5
19
19
18.5
18.5 savage
t = 1.5
logistic
19.5
savage
t = 1.5
logistic
Noise-1
savage
t = 1.5
logistic
Noise-3
Noise-2 20
24
19.5 19 18.5
Test Error (%)
Test Error (%)
Test Error (%)
20 19.5
19
18
savage
t = 1.5
logistic
Noise-1
20
16
18.5 savage
t = 1.5
logistic
22
Noise-2
·104 1
1
0.8
0.8
logistic
t = 1.5
·104
Noise-3
savage
2,000
0.6 0.4
1,000
0.2
0
0 0
0.2
0.4
0.6
0.8
1
Frequency
3,000
Frequency
Frequency
4,000 0.6 0.4 0.2 0 0
0.2
0.4
0.6
0.8
1
0
0.2
0.4
0.6
0.8
1
Fig. C.33. Experiment on sensitcombined Dataset. Top: Generalization Performance; Middle: Random Initialization; Bottom: Forgetting Variables.
157
Noise-1
Test Error (%)
Test Error (%)
29
28.5
28.5
28
Test Error (%)
29
28 27.5 27
27
27.5
savage
t = 1.5
logistic
savage
t = 1.5
logistic
Noise-3
Noise-2
Noise-1 29
29
28
40
28.5
Test Error (%)
Test Error (%)
30
28 27.5 27
27 savage
t = 1.5
logistic
Noise-1
4,000 Frequency
3,000 2,000
1 0.8
0.6 0.4 0.2
0
0 0.2
0.4
0.6
0.8
1
Noise-3
·104
1
savage
t = 1.5
logistic
0.8
1,000
0
30
Noise-2
·104
5,000
35
25 savage
t = 1.5
logistic
Frequency
Test Error (%)
28
27
savage
t = 1.5
logistic
Frequency
Noise-3
Noise-2 29
30
0.6 0.4 0.2 0
0
0.2
0.4
0.6
0.8
1
0
0.2
0.4
0.6
0.8
1
Fig. C.34. Experiment on sensitseismic Dataset. Top: Generalization Performance; Middle: Random Initialization; Bottom: Forgetting Variables.
158
7
6.5
6
8 Test Error (%)
Test Error (%)
7 Test Error (%)
Noise-3
Noise-2
Noise-1
6.5
6
7
6 savage
t = 1.5
logistic
savage
t = 1.5
logistic
Noise-1
savage
t = 1.5
logistic
Noise-3
Noise-2 10
7
6.5 6 5.5
Test Error (%)
Test Error (%)
Test Error (%)
7 6.5
6
6
5.5 savage
t = 1.5
logistic
600
400
200
0
0 0.4
0.6
0.8
1
500 400
400
200
0.2
Noise-3
Frequency
600 Frequency
800
savage
t = 1.5
logistic
Noise-2
800
0
savage
t = 1.5
logistic
Noise-1
Frequency
8
300 200 100 0
0
0.2
0.4
0.6
0.8
1
0
0.2
0.4
0.6
0.8
1
Fig. C.35. Experiment on usps Dataset. Top: Generalization Performance; Middle: Random Initialization; Bottom: Forgetting Variables.
159 Noise-1
Noise-3
Noise-2 19
Test Error (%)
Test Error (%)
15.3
Test Error (%)
15.8 15.4
15.6 15.4
18 17 16
15.2
15.2 Logistic
Mis-0
Mis-I
Mis-II
15
Savage
Logistic
Mis-0
Noise-1
Mis-I
Mis-II
Savage
Logistic
Mis-0
Mis-I
Mis-II
Savage
Noise-3
Noise-2 24
15.5
24
15.4
15.3
Test Error (%)
Test Error (%)
Test Error (%)
22 20 18 16
22 20 18 16
14
14
15.2 Logistic
Mis-0
Mis-I
Mis-II
Logistic
Savage
Mis-0
Mis-I
Mis-II
Savage
Logistic
Mis-0
Mis-I
Mis-II
Savage
Fig. D.1. Experiment on adult9 Dataset. Top: Generalization Performance; Bottom: Random Initialization. Noise-3
Noise-2
Noise-1 23
21.92
23
21.88 21.86
22.6
Test Error (%)
Test Error (%)
Test Error (%)
22.8 21.9
22.4 22.2 22
21.84
22.5
22
21.8 Logistic
Mis-0
Mis-I
Mis-II
Savage
Logistic
Mis-0
Mis-I
Mis-II
Savage
Logistic
Mis-0
Mis-I
Mis-II
Savage
Noise-3
Noise-2
Noise-1 21.9
21.86 21.84
30
22.6
Test Error (%)
Test Error (%)
Test Error (%)
22.8 21.88
22.4 22.2 22 21.8
21.82 Logistic
Mis-0
Mis-I
Mis-II
Savage
25 20 15
Logistic
Mis-0
Mis-I
Mis-II
Savage
Logistic
Mis-0
Mis-I
Mis-II
Savage
Fig. D.2. Experiment on alpha Dataset. Top: Generalization Performance; Bottom: Random Initialization.
Appendix D: Additional Figures of Section 6.4 In this chapter, we provide the additional figures from the empirical evalutation of generalized t-logistic regression.
160
Noise-1
Noise-3
Noise-2
14
10 8 6
Test Error (%)
Test Error (%)
Test Error (%)
8
10
12
8 6 4
4 2
2 Logistic
Mis-0
Mis-I
Mis-II
Savage
4 2
Logistic
Mis-0
Noise-1
Mis-I
Mis-II
Savage
Logistic
Mis-0
25
25
12
20
20
8 6
Test Error (%)
14
10
Mis-I
Mis-II
Savage
Mis-II
Savage
Noise-3
Noise-2
Test Error (%)
Test Error (%)
6
15 10 5
15 10 5
4 0
0
2 Logistic
Mis-0
Mis-I
Mis-II
Savage
Logistic
Mis-0
Mis-I
Mis-II
Logistic
Savage
Mis-0
Mis-I
Fig. D.3. Experiment on astro-ph Dataset. Top: Generalization Performance; Bottom: Random Initialization.
Noise-1
Noise-3
Noise-2
20
10
Test Error (%)
8
15
Test Error (%)
Test Error (%)
8
6
4
6 4
5 2 Logistic
Mis-0
Mis-I
Mis-II
Savage
2 Logistic
Mis-0
Mis-I
Mis-II
Savage
Logistic
Mis-0
Mis-I
Mis-II
Savage
Noise-3
Noise-2
Noise-1
30
20
10 5
Test Error (%)
15
Test Error (%)
Test Error (%)
30 20 10
Mis-0
Mis-I
Mis-II
Savage
10
0
0
Logistic
20
Logistic
Mis-0
Mis-I
Mis-II
Savage
Logistic
Mis-0
Mis-I
Mis-II
Savage
Fig. D.4. Experiment on aut-avn Dataset. Top: Generalization Performance; Bottom: Random Initialization.
161
Noise-1
Noise-3
Noise-2
49.9
50.1
49.8
49.7
Test Error (%)
Test Error (%)
Test Error (%)
50 49.9 49.8
Logistic
Mis-0
Mis-I
Mis-II
Savage
Logistic
Mis-0
Mis-I
Mis-II
Savage
Logistic
Mis-0
Mis-I
49.84 49.82 49.8
49.95
Test Error (%)
Test Error (%)
49.86
49.9 49.85 49.8
49.78
Savage
50.1
50
49.88
Mis-II
Noise-3
Noise-2
Noise-1
Test Error (%)
49.8 49.7
49.7
49.6
50 49.9
50
49.9
49.8
49.75 Logistic
Mis-0
Mis-I
Mis-II
Logistic
Savage
Mis-0
Mis-I
Mis-II
Savage
Logistic
Mis-0
Mis-I
Mis-II
Savage
Fig. D.5. Experiment on beta Dataset. Top: Generalization Performance; Bottom: Random Initialization.
Noise-1
Noise-3
Noise-2
23 23
22.6 22.4 22.2 22
28 Test Error (%)
Test Error (%)
Test Error (%)
22.8
22.5
26 24
22 22
21.8 Logistic
Mis-0
Mis-I
Mis-II
Savage
Logistic
Mis-0
Noise-1
Mis-I
Mis-II
Logistic
Savage
Mis-0
Mis-I
Mis-II
Savage
Noise-3
Noise-2
23
Test Error (%)
Test Error (%)
22.6 22.4 22.2 22
Test Error (%)
30
22.8
25
20 Mis-0
Mis-I
Mis-II
Savage
25
20
21.8 Logistic
30
Logistic
Mis-0
Mis-I
Mis-II
Savage
Logistic
Mis-0
Mis-I
Mis-II
Savage
Fig. D.6. Experiment on covertype Dataset. Top: Generalization Performance; Bottom: Random Initialization.
162
Noise-3
Noise-2
Noise-1 21.6
22.4
22.2
21.56 21.54 21.52
Test Error (%)
Test Error (%)
Test Error (%)
21.58 22 21.8 21.6
21.5 Logistic
Mis-0
Mis-I
Mis-II
Savage
22.2 22 21.8 21.6
Logistic
Mis-0
Mis-I
Mis-II
Savage
Logistic
Mis-0
Mis-I
Mis-II
Savage
Noise-3
Noise-2
Noise-1 21.6
21.56 21.54
35 Test Error (%)
Test Error (%)
Test Error (%)
35 21.58
30 25
30 25 20
20 21.52 Logistic
Mis-0
Mis-I
Mis-II
Logistic
Savage
Mis-0
Mis-I
Mis-II
Logistic
Savage
Mis-0
Mis-I
Mis-II
Savage
Fig. D.7. Experiment on delta Dataset. Top: Generalization Performance; Bottom: Random Initialization.
19.98 19.96
Test Error (%)
20
Test Error (%)
Test Error (%)
20.8
20.6
20.02
20.4 20.2 20
19.94 Logistic
Mis-0
Mis-I
Mis-II
Savage
20.6 20.4 20.2 20
Logistic
Mis-0
Mis-I
Mis-II
Savage
Logistic
Mis-0
Mis-I
Mis-II
Savage
Noise-3
Noise-2
Noise-1 25
20
24
19.98 19.97
30 Test Error (%)
19.99
Test Error (%)
Test Error (%)
Noise-3
Noise-2
Noise-1
23 22 21
25
20
19.96
20 Logistic
Mis-0
Mis-I
Mis-II
Savage
Logistic
Mis-0
Mis-I
Mis-II
Savage
Logistic
Mis-0
Mis-I
Mis-II
Savage
Fig. D.8. Experiment on gamma Dataset. Top: Generalization Performance; Bottom: Random Initialization.
163
Noise-1
Noise-3
Noise-2
0.5
Test Error (%)
Test Error (%)
Test Error (%)
0.6 0.6
0.6
0.5
0.4
0.4
0.4
Logistic
Mis-0
Mis-I
Mis-II
Savage
Logistic
Mis-0
Mis-I
Mis-II
Savage
Logistic
Mis-0
8
2 0
Test Error (%)
Test Error (%)
4
Mis-II
Savage
Mis-II
Savage
40
40
6
Mis-I
Noise-3
Noise-2
Noise-1
Test Error (%)
0.5
20
0
20
0
−2 −4
Logistic
Mis-0
Mis-I
Mis-II
Savage
Logistic
Mis-0
Mis-I
Mis-II
−20
Savage
Logistic
Mis-0
Mis-I
Fig. D.9. Experiment on kdd99 Dataset. Top: Generalization Performance; Bottom: Random Initialization.
Noise-3
Noise-2
Noise-1 20
20
15 10 5 0
15
Test Error (%)
Test Error (%)
Test Error (%)
20
10 5 0
Logistic
Mis-0
Mis-I
Mis-II
Savage
Mis-0
Mis-II
Savage
Logistic
Mis-0
15 10 5
Mis-II
Savage
Mis-II
Savage
Mis-II
Savage
20
15 10 5
15 10 5 0
0
0
Mis-I
Noise-3
Test Error (%)
Test Error (%)
Test Error (%)
Mis-I
20
Mis-I
5
Noise-2
20
Mis-0
10
0 Logistic
Noise-1
Logistic
15
Logistic
Mis-0
Mis-I
Mis-II
Savage
−5
Logistic
Mis-0
Mis-I
Fig. D.10. Experiment on longservedio Dataset. Top: Generalization Performance; Bottom: Random Initialization.
164
Noise-3
Noise-2
Noise-1 5
5 6
3 2
Test Error (%)
4 Test Error (%)
Test Error (%)
4
4
2
1
2 1
0
0 Logistic
Mis-0
Mis-I
Mis-II
Savage
0 Logistic
Mis-0
Mis-I
Mis-II
Savage
Logistic
5
2
20
0
1 Mis-0
Mis-I
Mis-II
Savage
Mis-II
Savage
40 Test Error (%)
Test Error (%)
3
Mis-I
Noise-3
40
4
Logistic
Mis-0
Noise-2
Noise-1
Test Error (%)
3
20
0
Logistic
Mis-0
Mis-I
Mis-II
Logistic
Savage
Mis-0
Mis-I
Mis-II
Savage
Fig. D.11. Experiment on measewyner Dataset. Top: Generalization Performance; Bottom: Random Initialization.
Noise-1
0.2 0.1
1
0.5
0.5
Test Error (%)
Test Error (%)
0.3 Test Error (%)
Noise-3
Noise-2 1
0 −0.5
0 −0.5
0 Logistic
Mis-0
Mis-I
Mis-II
Savage
Logistic
Mis-0
Mis-II
Savage
Logistic
Mis-0
Test Error (%)
0.4 0.3 0.2 0.1
Mis-I
Mis-II
Savage
Mis-II
Savage
Noise-3
1
1
0.5
0.5
Test Error (%)
0.5
Test Error (%)
Mis-I
Noise-2
Noise-1
0 −0.5
0 −0.5
0 Logistic
Mis-0
Mis-I
Mis-II
Savage
Logistic
Mis-0
Mis-I
Mis-II
Savage
Logistic
Mis-0
Mis-I
Fig. D.12. Experiment on mushrooms Dataset. Top: Generalization Performance; Bottom: Random Initialization.
165
Noise-1
Noise-3
Noise-2 9
12
7
8 6
Test Error (%)
Test Error (%)
Test Error (%)
8 10
7 6 5
Logistic
Mis-0
Mis-I
Mis-II
Savage
Logistic
Mis-0
Mis-II
Savage
Logistic
Test Error (%)
10 8 6 4
40
8
30
7
20 10 0
Mis-0
Mis-I
Mis-0
Mis-II
Savage
Mis-I
Mis-II
Savage
Mis-II
Savage
Noise-3
Test Error (%)
12 Test Error (%)
Mis-I
Noise-2
Noise-1
Logistic
5 4
4
4
6
6 5 4
Logistic
Mis-0
Mis-I
Mis-II
Logistic
Savage
Mis-0
Mis-I
Fig. D.13. Experiment on news20 Dataset. Top: Generalization Performance; Bottom: Random Initialization.
Noise-3
Noise-2
Noise-1 20
15
10 5
12
Test Error (%)
Test Error (%)
Test Error (%)
14 15
10 8 6
10
5
4 2 Logistic
Mis-0
Mis-I
Mis-II
Savage
Logistic
Mis-0
Mis-II
Savage
30
15
20
10 5
Mis-0
Mis-I
Mis-0
Mis-II
Savage
Mis-I
Mis-II
Savage
Mis-II
Savage
Noise-3 20
10
15 10 5 0
0 Logistic
Logistic
Test Error (%)
20
Test Error (%)
Test Error (%)
Mis-I
Noise-2
Noise-1
Logistic
Mis-0
Mis-I
Mis-II
Savage
Logistic
Mis-0
Mis-I
Fig. D.14. Experiment on real-sim Dataset. Top: Generalization Performance; Bottom: Random Initialization.
166
Noise-1 7
6 5 4
8 Test Error (%)
Test Error (%)
7 Test Error (%)
Noise-3
Noise-2
6 5 4
6
4
3
3
2 Logistic
Mis-0
Mis-I
Mis-II
Logistic
Savage
Mis-0
Noise-1
Mis-II
Savage
Logistic
4
Test Error (%)
Test Error (%)
5
4 3
3 Mis-I
Mis-II
Savage
Mis-II
Savage
Mis-II
Savage
6
4
2
2 Mis-0
Mis-I
Noise-3
5
6
Logistic
Mis-0
Noise-2
7 Test Error (%)
Mis-I
Logistic
Mis-0
Mis-I
Mis-II
Logistic
Savage
Mis-0
Mis-I
Fig. D.15. Experiment on reuters-c11 Dataset. Top: Generalization Performance; Bottom: Random Initialization.
Noise-3
Noise-2
Noise-1
9 9
8 Test Error (%)
Test Error (%)
7 6 5
Test Error (%)
8
8
7
6
Mis-0
Mis-I
Mis-II
Savage
6 5
5 Logistic
7
Logistic
Mis-0
Noise-1
Mis-I
Mis-II
Savage
Logistic
Mis-0
Mis-I
Mis-II
Savage
Noise-3
Noise-2
20
7 6
Test Error (%)
20
8
Test Error (%)
Test Error (%)
9
10
10
0
0 5 Logistic
Mis-0
Mis-I
Mis-II
Savage
Logistic
Mis-0
Mis-I
Mis-II
Savage
Logistic
Mis-0
Mis-I
Mis-II
Savage
Fig. D.16. Experiment on reuters-ccat Dataset. Top: Generalization Performance; Bottom: Random Initialization.
167
Noise-1 1.3
1.2 1.1
5 Test Error (%)
1.3
Test Error (%)
Test Error (%)
Noise-3
Noise-2
1.4
1.2
1.1
1
3 2 1
Logistic
Mis-0
Mis-I
Mis-II
Savage
Logistic
Mis-0
Mis-I
Mis-II
Savage
Logistic
2.5
Test Error (%)
Test Error (%)
1.1
2 1.5
Mis-0
Mis-I
Mis-II
Logistic
Savage
Savage
Mis-II
Savage
4 3 2 1
1
1
Mis-II
5
3
1.2
Mis-I
Noise-3
1.3
Logistic
Mis-0
Noise-2
Noise-1
Test Error (%)
4
Mis-0
Mis-I
Mis-II
Savage
Logistic
Mis-0
Mis-I
Fig. D.17. Experiment on web8 Dataset. Top: Generalization Performance; Bottom: Random Initialization.
Noise-3
Noise-2
Noise-1
8.5 8
7
6.5
8 Test Error (%)
Test Error (%)
Test Error (%)
7.5
7.5 7 6.5
Logistic
Mis-0
Mis-I
Mis-II
Logistic
Savage
Mis-0
Noise-1
Mis-II
7
6.5
Mis-I
Mis-0
Mis-II
Savage
Mis-I
Mis-II
Savage
Noise-3
20
20
15
15
10 5
10 5 0
0
6 Mis-0
Logistic
Savage
Test Error (%)
Test Error (%)
Test Error (%)
Mis-I
Noise-2
7.5
Logistic
7 6.5 6
6
6
7.5
Logistic
Mis-0
Mis-I
Mis-II
Savage
Logistic
Mis-0
Mis-I
Mis-II
Savage
Fig. D.18. Experiment on webspamunigram Dataset. Top: Generalization Performance; Bottom: Random Initialization.
168
Noise-1
6
2
2.6
2.2 2 1.8
5
1.9
Test Error (%)
2.4
Test Error (%)
Test Error (%)
Noise-3
Noise-2
1.8 1.7
4 3
1.6
2
1.6 1.5
1.4 Logistic
Mis-0
Mis-I
Mis-II
Savage
Logistic
Mis-0
4
Mis-II
Savage
Logistic
2
4
2
1 Mis-0
Mis-I
Mis-II
Savage
Mis-I
Mis-II
Savage
6 Test Error (%)
3
Logistic
Mis-0
Noise-3
6 Test Error (%)
Test Error (%)
Mis-I
Noise-2
Noise-1
4
2
Logistic
Mis-0
Mis-I
Mis-II
Savage
Logistic
Mis-0
Mis-I
Mis-II
Savage
Fig. D.19. Experiment on worm Dataset. Top: Generalization Performance; Bottom: Random Initialization.
Noise-1
6.2
7.5
6 5.8 5.6
Test Error (%)
7 Test Error (%)
Test Error (%)
Noise-3
Noise-2
6.5 6
6 5.5
5.5 Logistic
Mis-0
Mis-I
Mis-II
Savage
7 6.5
Logistic
Mis-0
Mis-I
Mis-II
Savage
Logistic
Mis-0
Mis-I
Mis-II
Savage
Mis-II
Savage
Noise-3
Noise-2
Noise-1 6.2
20
5.8
Test Error (%)
Test Error (%)
Test Error (%)
7 6
6.5 6
10
0 5.6
5.5 Logistic
Mis-0
Mis-I
Mis-II
Savage
Logistic
Mis-0
Mis-I
Mis-II
Savage
Logistic
Mis-0
Mis-I
Fig. D.20. Experiment on zeta Dataset. Top: Generalization Performance; Bottom: Random Initialization.
169
Noise-1
Noise-3
Noise-2 16
10
14
Test Error (%)
Test Error (%)
Test Error (%)
20 15
12 10 8 6
5 Logistic
Mis-0
Mis-I
Mis-II
Savage
10
5 Logistic
Mis-0
Noise-1
Mis-I
Mis-II
Logistic
Savage
Mis-0
Test Error (%)
Test Error (%)
10
Mis-II
Savage
Mis-II
Savage
40
30
20
Mis-I
Noise-3
Noise-2
30 Test Error (%)
15
20
30 20
10 10 Logistic
Mis-0
Mis-I
Mis-II
Logistic
Savage
Mis-0
Mis-I
Mis-II
Savage
Logistic
Mis-0
Mis-I
Fig. D.21. Experiment on dna Dataset. Top: Generalization Performance; Bottom: Random Initialization.
Noise-3
Noise-2
Noise-1 24
22
26
23
Test Error (%)
Test Error (%)
Test Error (%)
24
22 21 20
20 Logistic
Mis-0
Mis-I
Mis-II
Mis-0
Mis-I
Mis-II
Savage
Logistic
20
Mis-II
Savage
Savage
Mis-II
Savage
26
23 22 21
24 22 20
20
Mis-I
Mis-II
28
Test Error (%)
Test Error (%)
Test Error (%)
22
Mis-I
Noise-3
24 24
Mis-0
Mis-0
Noise-2
Noise-1
Logistic
22 20
Logistic
Savage
24
Logistic
Mis-0
Mis-I
Mis-II
Savage
Logistic
Mis-0
Mis-I
Fig. D.22. Experiment on letter Dataset. Top: Generalization Performance; Bottom: Random Initialization.
170
Noise-1
Noise-3
Noise-2
11
12
9 8
11 Test Error (%)
Test Error (%)
Test Error (%)
10 10
9
8
10 9 8 7
Logistic
Mis-0
Mis-I
Mis-II
Savage
Logistic
Mis-0
Mis-II
Savage
Logistic
Mis-0
25
15 10
Mis-II
Savage
Mis-II
Savage
60 Test Error (%)
Test Error (%)
25
20
Mis-I
Noise-3
Noise-2
Noise-1
Test Error (%)
Mis-I
20 15 10
40
20
5
5 Logistic
Mis-0
Mis-I
Mis-II
Logistic
Savage
Mis-0
Mis-I
Mis-II
Logistic
Savage
Mis-0
Mis-I
Fig. D.23. Experiment on mnist Dataset. Top: Generalization Performance; Bottom: Random Initialization.
Noise-1
Noise-3
Noise-2 33.4
33.2 33 32.8
33.2
33.5
Test Error (%)
Test Error (%)
Test Error (%)
33.4
33
33 32.8 32.6
32.6 32.5 Logistic
Mis-0
Mis-I
Mis-II
Savage
Logistic
Mis-0
Noise-1
Mis-I
Mis-II
Savage
Logistic
Mis-0
33.2
Mis-I
Mis-II
Savage
Noise-3
Noise-2 42
34.5
32.8
34
Test Error (%)
Test Error (%)
Test Error (%)
40 33
33.5 33
38 36 34
32.6 32
32.5 Logistic
Mis-0
Mis-I
Mis-II
Savage
Logistic
Mis-0
Mis-I
Mis-II
Savage
Logistic
Mis-0
Mis-I
Mis-II
Savage
Fig. D.24. Experiment on protein Dataset. Top: Generalization Performance; Bottom: Random Initialization.
171
32
30 29 28
Test Error (%)
32
31
Test Error (%)
Test Error (%)
Noise-3
Noise-2
Noise-1 32
30
30 29 28
28
27
27 Logistic
Mis-0
Mis-I
Mis-II
Savage
Logistic
Mis-0
Noise-1
Mis-I
Mis-II
Savage
Logistic
Mis-0
32
30 29 28
Mis-II
Savage
Mis-II
Savage
32 Test Error (%)
Test Error (%)
31
Mis-I
Noise-3
Noise-2
32 Test Error (%)
31
31 30 29 28
30 28 26
27 Logistic
Mis-0
Mis-I
Mis-II
Savage
Logistic
Mis-0
Mis-I
Mis-II
Logistic
Savage
Mis-0
Mis-I
Fig. D.25. Experiment on sensitacoustic Dataset. Top: Generalization Performance; Bottom: Random Initialization.
Noise-3
Noise-2
Noise-1 20.5 20
20
19.5 19 18.5
Test Error (%)
Test Error (%)
Test Error (%)
20 19.5 19
Mis-0
Mis-I
Mis-II
Logistic
Savage
19 18.5
18.5
Logistic
19.5
Mis-0
Mis-I
Mis-II
Savage
Logistic
Mis-0
Mis-I
Mis-II
Savage
Noise-3
Noise-2
Noise-1 20.5 20
19.5 19
Test Error (%)
Test Error (%)
Test Error (%)
20 19.5 19
25
20
18.5
18.5
15 Logistic
Mis-0
Mis-I
Mis-II
Savage
Logistic
Mis-0
Mis-I
Mis-II
Savage
Logistic
Mis-0
Mis-I
Mis-II
Savage
Fig. D.26. Experiment on sensitcombined Dataset. Top: Generalization Performance; Bottom: Random Initialization.
172
29
28 27
29 Test Error (%)
29
Test Error (%)
Test Error (%)
30
28
27
Logistic
Mis-0
Mis-I
Mis-II
Savage
28
27
26
26
26
Logistic
Mis-0
Mis-I
Mis-II
Savage
Logistic
Mis-0
Mis-II
Savage
Mis-II
Savage
45
29
30
Mis-I
Noise-3
Noise-2
Noise-1
28 27
Test Error (%)
40
29
Test Error (%)
Test Error (%)
Noise-3
Noise-2
Noise-1
28
27
35 30 25
26
26 Logistic
Mis-0
Mis-I
Mis-II
Savage
Logistic
Mis-0
Mis-I
Mis-II
Savage
Logistic
Mis-0
Mis-I
Fig. D.27. Experiment on sensitseismic Dataset. Top: Generalization Performance; Bottom: Random Initialization.
Noise-3
Noise-2
Noise-1
16
14
10 8
12
Test Error (%)
Test Error (%)
Test Error (%)
12
10 8
Logistic
Mis-0
Mis-I
Mis-II
Savage
10
6 Logistic
Mis-0
Noise-1
Mis-I
Mis-II
Logistic
Savage
Mis-0
Mis-I
Mis-II
Savage
Mis-II
Savage
Noise-3
Noise-2
60
16 80
40
20
Test Error (%)
Test Error (%)
14 Test Error (%)
12
8
6
6
14
12 10 8
4 Logistic
Mis-0
Mis-I
Mis-II
Savage
40 20
6 0
60
0 Logistic
Mis-0
Mis-I
Mis-II
Savage
Logistic
Mis-0
Mis-I
Fig. D.28. Experiment on usps Dataset. Top: Generalization Performance; Bottom: Random Initialization.
173 Appendix E: Additional Tables of Section 3.5 In this chapter, we provide the additional tables from the empirical evalutation of tlogistic regression.
174
Table E.1 CPU time spent on binary datasets (Total time, Averaged time per function evaluation) on seconds. Dataset
logistic
t = 1.5
Savage
adult9
(0.30, 0.01)
(1.37, 0.04)
(0.16, 0.01)
alpha
(134.75, 0.36)
(283.94, 0.85)
(164.48, 0.46)
astro-ph
(3.12, 0.04)
(7.06, 0.12)
(3.25, 0.03)
aut-avn
(0.94, 0.03)
(4.48, 0.07)
(0.86, 0.03)
beta
(37.79, 0.79)
(50.17, 1.43)
(85.65, 2.76)
covertype
(2.64, 0.03)
(42.71, 0.49)
(2.27, 0.03)
delta
(94.49, 0.41)
(188.50, 0.83)
(90.03, 0.41)
epsilon
(339.18, 2.49)
(867.48, 2.39)
(243.70, 2.54)
gamma
(123.16, 0.40)
(238.67, 0.85)
(100.31, 0.42)
kdd99
(40.47, 0.29)
(560.90, 4.42)
(28.69, 0.28)
kdda
(1317.03, 5.21)
kddb
(2717.57, 14.01) (8164.70, 30.02) (1278.16, 9.47)
(4223.01, 12.68) (1152.17, 6.03)
longservedio
(0.09, 0.00)
(0.11, 0.01)
(0.05, 0.00)
measewyner
(0.07, 0.00)
(0.14, 0.00)
(0.10, 0.00)
mushrooms
(0.07, 0.01)
(0.15, 0.01)
(0.12, 0.01)
news20
(12.98, 0.28)
(17.32, 0.35)
(40.86, 0.48)
real-sim
(0.60, 0.03)
(1.67, 0.09)
(0.70, 0.04)
reuters-c11
(7.91, 0.72)
(14.56, 1.46)
(30.26, 2.75)
reuters-ccat
(37.24, 0.33)
(83.72, 1.20)
(19.68, 0.21)
web8
(0.58, 0.00)
(0.84, 0.06)
(0.20, 0.01)
webspamtrigram
(1450.52, 12.29)
(3008.55, 7.45)
(799.84, 12.90)
webspamunigram
(20.28, 0.08)
(80.65, 0.35)
(17.16, 0.08)
worm
(38.88, 0.83)
(79.65, 1.85)
(52.63, 0.60)
zeta
(365.85, 1.54)
(297.36, 2.94)
(274.32, 2.59)
175
Table E.2 CPU time spent on multiclass datasets (Total time, Averaged time per function evaluation) on seconds. Dataset
logistic
t = 1.5
Savage
dna
(0.14, 0.00)
(0.31, 0.01)
(0.18, 0.00)
letter
(2.34, 0.01)
(63.88, 0.20)
(15.15, 0.01)
mnist
(47.47, 0.08)
(99.76, 0.44)
(36.34, 0.09)
protein
(2.89, 0.01)
(10.04, 0.04)
(2.98, 0.01)
rcv1
(3620.79, 3.56)
sensitacoustic
(3.46, 0.02)
(37.36, 0.18)
(6.63, 0.02)
sensitcombined
(9.23, 0.04)
(62.84, 0.20)
(13.71, 0.03)
sensitseismic
(3.11, 0.02)
(109.98, 0.31)
(10.97, 0.02)
usps
(6.44, 0.04)
(31.20, 0.06)
(2.78, 0.02)
(2391.85, 16.27) (726.63, 3.95)
VITA
176
VITA Nan Ding was born in Shanghai, China, on Feburary 14, 1986. After completing his work at Weiyu High School in Shanghai, he entered Tsinghua University in Beijing, China. In June 2008, he completed a Bachelor of Engineering in Electronic Engineering. He subsequently entered the Department of Computer Science at Purdue University, West Lafayette for graduate study. He completed a Master of Science in December 2010 and a Doctor of Philosophy in May 2013. His research interests are statistical machine learning, graphical models, Bayesian inference, and convex optimization.