To PAC and Beyond

Viewer
Transcript

To PAC and Beyond

Thesis submitted in partial fulfillment of the degree of Doctor of Philosophy by Ran Gilad-Bachrach

Submitted to the Senate of the Hebrew University February 2006

ii

iii This work was carried out under the supervision of Prof. Naftali Tishby.

iv

Acknowledgements The work presented here is a result of a period of 10 years I spent at the Hebrew University in Jerusalem. During this time I got married to the adorable Efrat Gilad and together we gave birth to our two incredible sons Dvir and Yinon. It sounds like a clich´e but without the support of my closest family, especially my parents Yehudit (Dita) and Daniel (Dani) Bachrach, my brother and sisters Yuval, Yael and Nurit my grandparents and of course Efrat, Dvir and Yinon, I would not had any chance succeeding. I do not think there is any way to express gratitude to them in words. I have been lucky enough to be surrounded by extremely talented people who were my teachers, peers and friends. I take this opportunity to thank them. Ran El-Yaniv opened the doors for me to the fantastic world of scientific research. I am glad to be able to call him both a mentor and a friend. I had the opportunity to collaborate with Eli Shamir who is a role model for me as a person who is extremely smart, yet modest. Shai Fine is my “older scientific brother” and I thank him for that (Shai - we still have an outstanding bet ...). Amir Navot has been a great peer to work with and a great friend as well. We spent plenty of time together discussing everything. Without his bike riding enthusiasm I would never had ridden a bike down the Kicking Horse ski slopes without brakes. My other room mates, Amir Globerson and Gal Chechik also shared with us great moments of inspiring discussions. While at the Hebrew U. I spent most of my time at the machine learning lab. I owe something to each person in the lab: Yoseph Barash, Gil Bejerano, Koby Crammer, Ofer Dekel, Gal Elidan, Michael Fink, Nir Friedman, Eyal Krupka, Shahar Mendelson, Elad Schneidman, Shai ShalevShwartz, Lavi Shpigelman, Yoram Singer, Yair Weiss, Golan Yona, and all the people at that incredible place. Of course, a great thank you goes to my advisor, Naftali Tishby, who taught me a lot about the scientific method, the beauty of science and how to conduct a scientific research. Although we debated allot, he had a greater influence on my scientific approach than he might think. v

vi I had the best possible teachers from whom I have learned how to teach (which is important for someone who is learning how to learn). Especially I would like to mention Ehud de Shalit, Nati Linial, Israel (Robert) Aumann, Saharon Shelah and Hillel Furstenberg. I thank the Clore foundation for the generous funding they have provided me as a Ph.D. Student. I also thank the Chorfas foundation, Vatat and the Amirim program for additional support. I thank Eyal Rosenman and the ultimate Frisbee team for the great fun. Special thank goes to Esther Singer for the English editing of this dissertation and to Nitsa Movshovitz-Hadar for inspiring discussions. There are so many people who deserve to be mentioned here, I thank each and every one of you. Finally, I would like to thank again Efrat who motivated me, challenged me and supported me. Your name should appear first on the title page of this work.

Abstract The greatest advantage of learning algorithms is their ability to adjust themselves to the required task. While “traditional” algorithms are tailored to perform a single task, learning algorithms acquire the capability to perform it. Learning techniques have been applied successfully in various domains ranging from optical character recognition and information retrieval to medical diagnostics and fraud detection. Although there are different frameworks for learning, the most common one is the model of learning from examples. In this framework, the learner is presented with a set of observations and the required action (or label) for each of these observations. The learner needs to learn the map between observations and labels. For example, in an optical character recognition task, an observation can be a bit-map representation of a written character and the label is the name of this character. This framework is intuitive to the teacher (typically human) and learner (the machine) However, the length of the learning process makes it unfeasible in many cases. The teacher is required to present a large set of example before an acceptable learning level is reached. Consider for example the task of teaching a machine to perform a medical diagnosis. An example in this case is a full profile of a patient, and the label is a medical diagnosis of his or her medical condition. Supplying these data is a labor-intensive task, which requires experts. Active Learning tries to shorten the learning process by allowing the learner some control over the learning process. This control enables the learner to direct the teacher to the areas in which the learner is less confident. One possible method of applying active learning is via Membership Queries. In this model the learner can construct observations and ask the teacher to label them. This model has been proved successful in solving theoretical problems such as constant depth circuit learning [75], decision tree learning [65] and others. However, membership queries work poorly when the teacher is human [67]. This deficiency is due to the fact that many of the queries directed by the learner are not clear to the human teacher since they lack the consistency vii

viii expected from a real world observation. For example, when Lang and Baum [67] tried to apply their algorithm for optical character recognition using membership queries, many of the queries directed by the learner consisted of fragments of different characters and thus could not be labeled by the human oracle. An alternative active learning scheme is the filtering framework. In this setting, the learner is presented with a set of observations but without the labels. The learner can select those observations for which the teacher is required to supply the labels. The rationale of this setting comes from the fact that in many real world scenarios, unlabeled data are readily available, whereas the labels are hard (or expensive) to obtain. Consider for example a text classification task. The unlabeled data are a set of documents that can be downloaded automatically from the World Wide Web; however, a label can require an expert to read the document. Thus, unlabeled data are “cheap” while the labels are “expensive”. Therefore, the goal in this framework is to reduce the number of labels used in the course of learning. The filtering model has been applied in various domains, such as part-of-speech tagging [33], text classification [112], etc. In all cases the authors reported a high success ratio. From a theoretical point of view Freund et al. [46], for example, analyzed the Query By Committee algorithm [104]. They proved that the error of this algorithm reduces exponentially faster than passive algorithms. However, this algorithm is not practical since the Query By Committee (QBC) algorithm requires an access to random hypotheses. Most of the literature on active learning either lacks theoretical grounding or is impractical to use. The goal of this dissertation is to close the gap between theory and practice. Much of the work presented here focuses on the QBC algorithm. We extend the theoretical understanding of this algorithm and show that it has exponential learning rates under milder assumptions than were previously known. We present an efficient implementation of the QBC algorithm for learning linear classifiers by the active learning filtering model. Our construction is based on the observation that the sampling problem can be converted into a problem of sampling from convex bodies or computing the volumes of such bodies. The latter problems were addressed by various authors [39, 78, ...]. To make it even more applicable we show that the computational complexity of applying the QBC algorithm is independent of the input dimension. Not only does it make the running time more reasonable, it enables the use of kernels when learning. We report on the success of the kernelized QBC algorithm when applied to a couple of benchmarks.

ix We also address the problem of applying active learning in the presence of noise. In this setting we do not assume that the teacher is always correct when it answers the queries of the learner. This scenario forces the learner to balance between the redundancy needed to overcome noise and the need to shorten the learning process. We study this scenario both in the membership queries framework and the selective sampling (filtering) framework. All our findings help reduce the gap between the theoretical study of active learning and its practical use. We believe that many applications can benefit from the use of active learning once this field matures. We hope that this work will constitute a step in this direction. In this work we prove and apply at least the first part of an ancient hebrew phrase: (zea` iwxt)

“ nln

o twd `le , nl oyiad `l''

“He who is shy does not learn, and he who is pedant shall not teach” (Pirkey Avot - an ancient Jewish book)

Contents

I

Acknoldgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

v

Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

vii

Introduction

1

1 Learning

II

2

1.1

Learning from Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

3

1.2

Machine Learning and Artificial Intelligence . . . . . . . . . . . . . . . . . . . . . .

4

1.3

A Brief History of Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . .

5

1.4

Probably Approximately Correct (PAC) . . . . . . . . . . . . . . . . . . . . . . . .

6

1.5

On-line Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

8

1.6

Active Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

10

Membership Queries

12

2 Preliminaries 2.1

13

The Power of Membership Queries . . . . . . . . . . . . . . . . . . . . . . . . . . .

14

2.1.1

Constant Depth Circuits . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

14

2.1.2

Decision Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

14

2.1.3

Intersections of Halfspaces . . . . . . . . . . . . . . . . . . . . . . . . . . . .

16

2.2

The Limitations of Membership Queries . . . . . . . . . . . . . . . . . . . . . . . .

16

2.3

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

18

3 Noise Tolerant Learnability using Dual Representation

19

3.1

Learning in the presence of noise . . . . . . . . . . . . . . . . . . . . . . . . . . . .

20

3.2

The Dual Learning Problem

21

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x

CONTENTS

xi

3.3

Dense Learning Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

23

3.4

Noise Immunity Scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

23

3.5

A Few Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

26

3.5.1

Monotone Monomials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

26

3.5.2

Geometric Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

27

3.5.3

Periodic Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

28

Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

29

Estimating V Cε (C, X ∗∗ ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

30

3.7

VC Dimension of Dual Learning Problems . . . . . . . . . . . . . . . . . . . . . . .

32

3.8

Banach Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

33

3.9

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

34

3.6

3.6.1

III

Selective Sampling

35

4 Preliminaries 4.1

36

Empirical Studies of Selective Sampling . . . . . . . . . . . . . . . . . . . . . . . .

37

4.1.1

Committee-Based Scores . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

38

4.1.1.1

Part-of-Speech Tagging . . . . . . . . . . . . . . . . . . . . . . . .

38

4.1.1.2

Spoken Language Understanding . . . . . . . . . . . . . . . . . . .

39

4.1.1.3

Ensemble of Active Learners . . . . . . . . . . . . . . . . . . . . .

40

4.1.1.4

Other Committee-Based Approaches . . . . . . . . . . . . . . . . .

42

Confidence-Based Scores . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

42

4.1.2.1

Margin Based Confidence

. . . . . . . . . . . . . . . . . . . . . .

43

4.1.2.2

Probability Based Confidence . . . . . . . . . . . . . . . . . . . . .

43

Look-ahead Principles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

43

4.2

Theoretical Studies of Selective Sampling . . . . . . . . . . . . . . . . . . . . . . .

45

4.3

Label Efficient Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

50

4.4

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

50

4.1.2

4.1.3

5 The Query By Committee Algorithm 5.1

51

Termination Procedures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

53

5.1.1

The “Optimal” Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . .

53

5.1.2

Random Gibbs Hypothesis . . . . . . . . . . . . . . . . . . . . . . . . . . .

55

xii

CONTENTS

5.2

5.1.3

Bayes Point Machine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

56

5.1.4

Avoiding the Termination Rule . . . . . . . . . . . . . . . . . . . . . . . . .

57

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

57

6 Theoretical Analysis of Query By Committee

59

6.1

The Information Gain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

60

6.2

The Fundamental Theory of QBC . . . . . . . . . . . . . . . . . . . . . . . . . . .

61

6.3

Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

64

6.4

Lower Bound on the Expected Information Gain for Linear Classifiers . . . . . . .

70

6.4.1

The Class of Parallel Planes . . . . . . . . . . . . . . . . . . . . . . . . . . .

71

6.4.2

Concave Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

71

6.4.3

The Function G (ρ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

72

6.5

Proof of Theorem 6.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

73

6.6

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

84

7 The Bayes Model Revisited

85

7.1

PAC-Bayesian Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

85

7.2

Symmetry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

86

7.3

Incorrect Priors and Distributions

. . . . . . . . . . . . . . . . . . . . . . . . . . .

88

7.4

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

94

8 Noise Tolerance 8.1

8.2

95

“Soft” QBC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

96

8.1.1

The Case of Learning with Noise . . . . . . . . . . . . . . . . . . . . . . . .

96

8.1.2

The case of stochastic concepts . . . . . . . . . . . . . . . . . . . . . . . . .

97

8.1.3

A variant of the QBC algorithm . . . . . . . . . . . . . . . . . . . . . . . .

97

Information Gain Revisited . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

99

8.2.1

Observations of the State of a Random Variable . . . . . . . . . . . . . . . 101

8.2.2

Information Processing Inequality . . . . . . . . . . . . . . . . . . . . . . . 104

8.3

SQBC Sample Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

8.4

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

9 Efficient Implementation Using Random Walks 9.1

111

Linear Classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

CONTENTS

xiii

9.2

Sampling from Convex Bodies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112

9.3

A Polynomial Implementation of QBC . . . . . . . . . . . . . . . . . . . . . . . . . 114

9.4

A Geometric Lemma . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119

9.5

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120

10 Kernelizing the QBC

121

10.1 Kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 10.1.1 Commonly used Kernel Functions . . . . . . . . . . . . . . . . . . . . . . . 122 10.1.2 The Grahm Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 10.1.3 Mercer’s conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 10.2 A New Method for Sampling the Version-Space . . . . . . . . . . . . . . . . . . . . 124 10.3 Sampling with Kernels 10.4 Hit and Run

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129

10.5 Generalizing to Unseen Instances . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 10.6 Summary and Further Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 11 Empirical Evidence

132

11.1 Empirical Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 11.1.1 Synthetic Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 11.1.2 Label Efficient Learning over Synthetic Data . . . . . . . . . . . . . . . . . 134 11.1.3 Face Image Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134 11.2 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137

IV

Discussion

12 Summary

138 139

12.1 Active Learning in Humans . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140 12.2 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 List of Publications

142

Bibliography

144

Summary in Hebrew Abstract (in hebrew) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

I V

xiv

CONTENTS

Introduction (in hebrew) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . VII

Part I

Introduction

1

Chapter 1

Learning Learning is the processing of information we encounter, which leads to changes or increase in our knowledge and abilities [68]. The high capacity to learn is one of the key features of human beings, and one which distinguishes us from the rest of the animal kingdom. It is learning that allows us to adapt to changing environments and to solve complicated problems. The quote cited at the top of this page captures some of the fundamental aspects of learning. Learning is a process. It is a process that converts information into knowledge and abilities. Thus learning is a process that enriches our capabilities. Machine learning is the art of designing computerized learning processes; i.e., machine learning is the art of designing computer processes which are capable of turning the information we encounter into knowledge and abilities. Machine learning uses the following flow to solve problems: 1. Acquire data 2. Find concise representations of the data 3. Find patterns 4. Turn patterns into knowledge 5. Turn knowledge into actions (optional) It has been shown that this process is successful in many domains such as text classification, speech recognition, machine vision, the study of the genome, etc. 2

1.1. Learning from Examples

1.1

3

Learning from Examples

There are many scenarios that incorporate learning. We are interested in learning from examples and more specifically in supervised learning. Two players are involved here, the teacher and the learner. The teacher has some knowledge that the learner is interested in. Thus, the learner watches the teacher while acting. It is hoped that the learner will be able to collect enough information, and will be wise enough to convert this information into knowledge. Therefore, there are two sources of complexity in this process: the amount of time the learners needs to watch the teacher and the amount of “wisdom” the learner is required to have. In the machine learning literature these complexities are referred to as the sample complexity and computational complexity of the learning process. We consider the learning from examples framework. In the learning processes, the learner has access to examples. Each example is a pair (x, y) where x is an instance from the sample space X and y is the label of this instance taken from the output space (or label space) Y. We assume that there exists an underlying target concept, c that maps inputs to outputs. The target concept can be a deterministic map or a stochastic one. The goal of the learner is to approximate this target concept. Much of the research in machine learning focuses on ways to reduce complexities, both sample and computational complexity. In a sense, this work deals with the trade-off between sample and computational complexity. We are interested in ways of reducing the sample complexity without scarifying the computational complexity too much. In other words, we are interested in transferring some workload from the teacher to the learner. In order to achieve this acceleration in learning we will go beyond the traditional learning models, e.g. PAC (see definition 1.1) and allow active learning. By active learning we mean that the learner plays an active role in the learning process. While a passive learner only watches the teacher, an active learner can guide the teacher by asking questions. In this work we study active learning as an extension to passive learning. We show that in many cases active learners significantly outperform passive learners. We study these frameworks both with analytical tools (theory) and with experiments (empirical evidence). As opposed to most of the active learning literature, we have used the same algorithm in our theoretical study as in our empirical study. By doing this we are able to build a bridge over the gap between theory and practice in active learning. In the remainder of this chapter we briefly discuss fundamental definitions and theorems in

4

Chapter 1. Learning

machine learning. A reader who is familiar with this field my wish to skim through it simply for the notation we will be using in this document (see table 1.1 on page 11 for a summary of the notation).

1.2

Machine Learning and Artificial Intelligence

Machine learning is a sub-division of Artificial Intelligence (AI). Both AI and machine learning try to mimic the way the human brain solve complicated problems. Traditional artificial intelligence defines knowledge as a set of logical rules. These rules are used to infer new unseen cases. Machine learning diverges from this approach in two ways: first, machine learning puts greater emphasis on the learning process; i.e. the process by which we acquire the knowledge. Second, machine learning typical uses statistical and probabilistic properties whereas traditional artificial intelligence uses logical deductions. In a sense, logic-based artificial intelligence represents the “old school” whereas machine learning represents the “modern school” [98]. The difference between the two approaches can be seen in the following example. Assume you would like to build a machine to perform a certain task, say a medical diagnosis machine. The “logical” approach towards building such a machine would be to contact an expert (a physician in the example of a medical diagnosis machine) and ask for a set of rules which differentiates sick from healthy people. These rules are hard coded in the machine and used to diagnose patients. This approach has several flaws: first, it is usually impossible to define these logical rules. Second, it is difficult to maintain and debug such a set of rules: in a system with thousands of rules, how do you find the one rule which leads to the wrong prediction? How do you correct it without destroying the whole system? Finally, how do you adjust such a system to a changing environment or to a new diagnosis task? The flaws presented above primarily affect the acquisition process. Machine learning uses a different approach. In the acquisition stage, here called learning, the learner watches an expert at work and collects statistics about various correlations in the data. Once the learner has collected enough information, it can be used to generate insights and make predictions. Learning from examples is useful in variety of domains such as medical diagnostics, speech recognition, information retrieval, etc. It has the advantage that the training process only requires watching an expert at work. It is easier to maintain and is more reusable than “logical” machines. Machine learning, as suggested by its name, focuses on the acquisition stage. Machine learning can be broken into various sub-fields based upon the nature of the acquisition stage (e.g. su-

1.3. A Brief History of Machine Learning

5

pervised, semi-supervised, unsupervised) and the task the machine has to perform (e.g. batch, on-line, classification, regression). In this dissertation we focus on the task of supervised learning.

1.3

A Brief History of Machine Learning

Machine learning has been studied under various names for more than a half of a century. A comprehensive review of the history of machine learning is beyond the scope of this work. Here we present the key principles that will be used in the rest of this document. Machine learning attracts researchers from different disciplines: mathematics, computer science, neuro science, biology, etc. There are three main motivations for research in this field: 1. The study of the brain 2. Learning as a way to solve hard problems 3. The study of “learning” as an abstract concept Brain researchers have found that the brain is made up of atomic building blocks, the neurons. These neurons are connected together in a network. The ability of our brain to solve complicated problems and adjust itself to changing environments and new tasks prompted researchers to believe that by building artificial neural networks we would be able to learn to solve complex problems. It would also allow us to better understand the way our brain works. This line of research began during the 1940s. McCulloch and Pitts [85] and later Hebb [51] suggested ways in which neural networks could work. These were the opening chapters in a very fruitful and inspiring line of research that has generated hundreds of books. The main building block of the neural network is the neuron. A neuron has many inputs (synapses) and a single output (axon). The artificial neuron is the perceptron or the linear classifier [96]. Like the neuron, it has many inputs and a single output. To facilitate the notation we assume that there is a single input which is a vector x ∈ IRd , such that each component in this vector resembles a synapse. The perceptron calculates a linear threshold function over its input. Each perceptron holds a vector of weights w ∈ IRd and a threshold θ ∈ IR. It computes the function cw,θ (x) = sign (w · x − θ) where w · x is the inner product between the vectors w and x. Although the perceptron was defined almost 50 years ago, it is probably the most commonly used tool in machine learning. A large part of this work is devoted to learning perceptrons. We

6

Chapter 1. Learning

usually refer to perceptrons as linear classifiers. Whenever the threshold θ is set at zero, i.e. the classification rule is cw (x) = sign (w · x) we call the classifier a homogeneous linear classifier. Artificial neural networks serve both as a tool to study the brain and as a method to solve problems that are otherwise considered to be hard. To solve problems, other approaches have been suggested as well. Two representative approaches in this category are the nearest neighbor rules and the window based rules [32, 44, 111]. Both methods assume some metric over the input space and the predictions on the label of a new instance are based on their proximity to some of the points in the training set. In nearest neighbor rules, the predicted label of a new instance is chosen by holding a majority vote among the k nearest neighbors to the instance in hand. In the window based approaches, the label is chosen by holding a majority vote among all training instances which are close to the instance at hand. Both approaches have been analyzed and proved to be consistent; i.e., optimal in an asymptotically sense, provided that the right choice of parameters is made. The most recent motivation for research in machine learning is the attempt to study “learning” as an autonomous concept. Valiant [116] defined the Probably Approximately Correct (PAC) model, which was the first attempt to define learning as a mathematical object. By defining learning in a way which does not assume a certain way of doing it, Valiant was able to raise issues which had never been formulated before such as “Is it possible to learn?”, “Is it possible to learn everything?” or in more generally “What can be learned?”. Valiant’s work marks the beginning of Machine Learning as we know it today. In the following sections we review Valiant’s PAC model and other definitions of learning and some of the important findings in this field.

1.4

Probably Approximately Correct (PAC)

Valiant made several important observations which are fundamental ingredient of the PAC model [116]. First of all, learning is a finite process in the sense that we should be able to benefit from learning after a finite time. Therefore, learning should be possible after seeing only a finite set of examples. Valiant also made a distinction between inaccuracy of the learner and a failure of the learning process. Inaccuracy is caused by the fact that the learner sees only a finite sample. However, sometimes the learning process can fail when the training sequence is a-typical. Valiant claims that as long as we have good enough accuracy with high confidence we can learn. The PAC model defines learnable concept classes. A class C is learnable if the number of examples needed to learn a concept in this class is finite. Valiant was primarily interested in the

1.4. Probably Approximately Correct (PAC)

7

binary setting where Y = {±1}. Since there are canonical ways to convert multi-class learning problems into a set of binary learning problems, we will assume the problems are binary unless otherwise specified. Definition 1.1 Probably Approximately Correct [116] Let X be a sample space and let C be a binary concept class over X . Let the label space Y be {±1}. We say that C is PAC learnable if for any ǫ, δ > 0, there exist m < ∞ and an algorithm L : (X × Y)m → C such that for any probability measure µ on X × Y Prm errorµ (L (S)) > ǫ + inf (errorµ (c)) < δ c∈C

S∼D

where errorµ (c) = µ {(x, y) : c (x) 6= y} A concept class is PAC learnable if a finite sample suffices to learn a hypothesis from the class which is almost the best possible concept in this class in terms of generalization error. Vapnik and Chervonenkis [118] showed that PAC learnable classes have a unique geometric property: a concept class C is PAC learnable if and only if it has a finite Vapnik Chervonenkis (VC) dimension1 . In order to define the VC dimension we need to define the shatter coefficient of a class C. Definition 1.2 Let C be a concept class. The m’th shatter coefficient of C is ΠC (m) =

max

x1 ,...,xm ∈X

|{(c (x1 ) , . . . , c (xm )) : c ∈ C}|

The shatter coefficient measures the number of different ways that a concept class can assign labels to m instances. Clearly, since |Y| = |{±1}| = 2 then ΠC (m) ≤ 2m . This is the rationale behind the definition of the VC dimension: Definition 1.3 A concept class C has a VC dimension d if d = max {m : ΠC (m) = 2m } The VC dimension is infinite if ΠC (m) = 2m for all m. The VC dimension gives an exact classification of PAC learnable concept classes. Vapnik and Chervonenkis [118] proved the following seminal theorem (this is a rephrased version of the original result). 1 Vapnik and Chervonenkis presented their results more than a decade before the PAC model was defined. Blumer et al [16] found the connection between the results presented by Vapnik and Chervonenkis and the PAC model.

8

Chapter 1. Learning

Theorem 1.1 [6, Theorem 4.2 and Theorem 4.8] Let C be a concept class of VC dimension d. Let L be an algorithm which given a sample m

S ∈ (X × Y)

of labeled instances, returns a hypothesis c = L (S) ∈ C which minimizes the

empirical error |{(x, y) ∈ S : c (x) 6= y}| Then for any δ > 0 and any probability measure µ over X × Y the following holds: Prm errorµ (L (S)) > ǫ + inf errorµ (c) ≤ δ c∈C

S∼µ

as long as ǫ≥

s

32 m

2em 4 d ln + ln d δ

where error· (·) is as defined in the definition of the PAC model, Definition 1.1. Furthermore, if inf c∈C errorµ (c) = 0, i.e. the target concept is in the concept class C then Pr [errorµ (L (S)) > ǫ] ≤ δ

S∼µm

as long as 2 ǫ≥ m

2em 2 d ln + ln d δ

If the VC dimension of C is ∞ then C is not PAC learnable.

q d Theorem 1.1 shows that the learning rates we can expect are O∗ in the general case m d and O∗ m when the target concept is in the concept class2 . Note that only the constants in the

bounds we presented can be improved. We will see in Chapter 6 that when active learning is used we obtain significantly better results.

1.5

On-line Learning

The on-line learning model [77] is another attempt to define learning. Littlestone tried to capture the fact that learning is a continuous process. In the PAC model there are two phases: the learning phase and the inferring phase (or generalizing phase). In the on-line learning model, these two phases are interleaved. Learning takes place in rounds. In round t, the teacher presents the instance xt , the user suggests that the label of xt is yˆt . After this, the teacher presents the label yt and the learner suffers a loss of L (yt , yˆt ) where L (·, ·) is some non-negative loss function. See figure 1.1 for an illustration of a single round in the on-line learning model. 2 We

use the notation O ∗ (·) to indicate that we neglect logarithmic factors.

1.5. On-line Learning

9

Figure 1.1: On-line Learning An illustration of a single round in the on-line learning model Teacher xt yt

Learner ⇒ ⇐ ⇒ L (yt , yˆt )

yˆt

The goal of the learner in this setting is to minimize

P∞

t=1

L (yt , yˆt ) under the mildest possible

assumptions. In most cases one of the following assumptions is made: 1. There exists an underlying target concept c chosen from a class C such that yt = c (xt ). Under this assumption we can sometimes prove that ∞ X t=1

L (yt , yˆt ) ≤ M < ∞

We call M the mistake bound since it provides an upper bound on the number of mistakes for any single sequence x1 , x2 , . . . and any concept c ∈ C. 2. There is no restriction on the target concept but the learner is compared only to a limited reference class C. In this case we seek a function f (·) such that for any sequence (x1 , y1 ) , (x2 , y2 ) , . . . we have that ∞ X t=1

L (yt , yˆt ) ≤ inf

c∈C

∞ X

f (L (c (xt ) , yˆt ))

t=1

These bounds are called regret bounds. They provide a bound on the difference between the cumulative loss of the algorithm studied and the best concept in the reference class. Many of the on-line learning algorithms are very simple, fast and use a small amount of memory. For example, the perceptron algorithm [96] when applied to a d dimensional problem, uses O (d) memory cells and each prediction is made in O (d) operations3 . It is important to note, though, that the constraints on memory and CPU usage are not a part of the definition of the on-line learning model. Since in most of the cases we study here, the labels are either +1 or −1, the natural loss function is the 0-1 loss which has the value 0 whenever yt and yˆt are equal and has the value 1 3 We

assume here that the perceptron is represented in the primal space. When the perceptron is used with kernels, there is a need to represent data in the dual space. In this case, the memory usage and CPU usage change dramatically. See [38] for more about this issue.

10

Chapter 1. Learning

otherwise L0−1 (yt , yˆt ) = (1 − yt yˆt ) /2 = In this case

P∞

t=1

   0 if yt = yˆt

  1 if yt 6= yˆt

L0−1 (yt , yˆt ) is a count of the number of prediction mistakes the learning algo-

rithm made. It is known that if the perceptron algorithm is used on a sequence (x1 , y1 ) , (x2 , y2 ) , . . . P∞ then t=1 L0−1 (yt , yˆt ) ≤ R2 /θ2 provided that kxt k2 ≤ R for every t, and there exist w ∈ IRd

such that kwk2 = 1 and yt (w · xt ) ≥ θ . This is the mistake bound for the perceptron algorithm

that was proved in [90].

1.6

Active Learning

Stone’s celebrated theorem proves that given a large enough training sequence, even naive algorithms such as the k-nearest neighbors can be optimal [111]. However, collecting large training sequences poses two main obstacles. First, collecting these sequences is a lengthy and costly task. Second, processing large data-sets requires enormous resources. Obviously we need to process the data while training. However, in most cases, the complexity of inferring the labels of new data items is affected by the size of the training data. This is the case for the commonly used Support Vector Machines [19] and Adaboost [45] algorithms. Therefore, reducing the size of the training sequence is of major concern. Active learning suggests that the size of the training sequence can be reduced considerably if we allow ourselves to go beyond the standard definitions of learning, e.g. PAC and on-line learning, and allow the learner some control over the learning process. In the learning frameworks we have discussed so far, the teacher selected the instances to be presented to the learner. Therefore we call these frameworks passive learning. In active learning frameworks, the learner has some influence on the selection of data points. Having control over the learning process allows the learner to focus on the more informative data points and thus increase the learning rate. In many cases, active learning can indeed accelerate the learning rate. We will show that the speedup can be exponential. However, in some cases there is a price to be paid. Since the learner has control over the learning process, it needs to make decisions that passive learners do not make. Therefore, in some cases the computational complexity of learning can increase when moving from passive learning to active learning. At the same time however, the sample complexity of learning reduces considerably. This means that we shift the workload from the teacher to the learner and from the generalization (inference) phase to the training phase. This makes perfect sense since

1.6. Active Learning

11

Table 1.1: Summary of the notation used in this dissertation Symbol log ln X x D Y y C c ν d m H (·) ǫ δ η

Description base 2 logarithm natural logarithm the sample space in instance in the sample space a distribution over a sample space the label space a label in the label space a concept class a concept in the concept class a prior (or posterior) over a concept class dimension a size of a training sample Shanon’s binary entropy error rate failure probability noise rate

Remarks

typically Y = {±1}

H (p) = −p log p − (1 − p) log (1 − p)

the teacher is typically a human while the learner is a machine; thus, active learners require less human labor but may require more computing effort. We discuss two active learning frameworks in this work. In Part II we discuss the membership queries framework and in Part III we discuss selective sampling. The difference between these frameworks is in the type of control the learner is assumed to have over the learning process. In the Membership Queries framework [3] the learner is allowed to pose questions to the teacher. These questions are presented as instances and the teacher is queried for the labels of these instances. The selective sampling framework [28] is more restrictive. The learner is presented with unlabeled instances and may query for the labels of a subset of these instances. This framework subdivides into two varieties: the batch framework, which we call selective sampling whenever this is not confusing, and the on-line framework, which is called label efficient learning [52]. Alternative active learning frameworks do exist. In the Equality Query model [3] the learner can present a hypothesis to the teacher. The teacher can either accept this hypothesis as a good one, or reject it while presenting an instance on which it deviates from the target concept. Another model, experiment design (see e.g. [7]) is being studied extensively by statisticians. In this model, the problem at hand is a regression problem, and the learner is allowed to select the experiment to run. Although this can be viewed as an active learning framework, it is not proactive learning as the learner does not refine the selection of experiments based on the results of previous experiments.

Part II

Membership Queries

12

Chapter 2

Preliminaries Active learners have some control over the learning process. Passive learners can observe the training data, but can-not alter them, whereas active learners can direct the teacher to what the learner consider to be the most interesting cases. The capability to play an active role in the training process gives the learner much more latitude for action than passive learners. It also more closely resembles the way humans learn. Human learning is a bi-directional process [17, 87, 109]. A good teacher needs to adjust his or her mode of instruction to the student’s prior knowledge and state of mind. It is a well established fact that a teacher who is not tuned to feedback from the students will not be able to teach effectively [17, 87, 109]. When trying to design a framework for computerized active learning, we need to define the way bi-directional communication between learner and teacher takes place. The first active learner framework explored here is the Membership Queries (MQ) framework [3, 4]. In this framework, the learner is allowed to direct questions to the teacher.

Definition 2.1 Let X be a sample space. A membership query is an instance x ∈ X . The teacher’s response to such a query is the label y associated with x.

A learning algorithm makes a membership query, much like humans ask their teachers questions. The membership query oracle is very powerful since it allows the learning algorithm to query for the label of any instance. 13

14

Chapter 2. Preliminaries

2.1

The Power of Membership Queries

The power of membership queries has been demonstrated in hundreds of papers. This section looks at a few key articles, all of which show that membership queries can solve problems that are difficult to tackle. Enumerating all the tasks in which membership queries have been used to provide solutions is beyond the scope of this dissertation. The tasks examined here illustrate the variety and diversity of applications of membership queries.

2.1.1

Constant Depth Circuits

There are many ways to represent Boolean functions: truth tables, logical formulas Karnaugh maps and others. A Boolean circuit is another representation of a Boolean function which captures the engineering point of view. A circuit is made of gates that are wired together. The gates are the atoms of this structure. A gate can perform a simple task: it receives one or more inputs and generates an output. The output of the gate can be wired to the input of other gates. Each gate can be connected to any other gate provided that the directed graph, which describes the circuit, is acyclic1 . Through the right choice of gates and wiring, a circuit can perform sophisticated Boolean functions. See figure 2.1 for an illustration of a Boolean circuit. Boolean circuits play an important role in electronics and in theoretical computer science. Linial, Mansour and Nissan have proved the following about learning such circuits: Theorem 2.1 [75] n

Let c : {−1, 1} → {−1, 1} be a Boolean function which is computable by a circuit of size S and depth d using AND and OR gates. Let ǫ, δ > 0 then there exists a learning algorithm which generates a hypothesis h such that with probability 1 − δ the hypothesis h is ǫ-close to c. The d−1 algorithm works in time poly n(14 log S/ǫ) , log 1/δ and uses membership queries. In this celebrated result, the Fourier Transform of the concept c is analyzed and used to generate hypothesis h. Membership queries are necessary for this algorithm to work. It is not known how such circuits can be learned in poly (n) time without membership queries.

2.1.2

Decision Trees

Decision trees are an important tool in artificial intelligence. The main advantage of decision trees over many other models used in artificial intelligence is their lucidness to humans. A concept that 1 Such

a graph is called a DAG, which stands for Directed Acyclic Graph.

2.1. The Power of Membership Queries

15

Figure 2.1: An illustration of a Boolean circuit. This circuit has three inputs: I1 , I2 and I3 and a single output marked by O. The circuit contains two AND gates and a single OR gate. The circuit has a depth of two.

is presented as a decision tree is easily understood by humans. A decision tree has a condition term on each of its internal nodes. Each leaf of the tree contains one of the possible outputs. Once an input is presented to a tree, it is matched against the condition at the root of the tree. If the condition is satisfied, we move to the left sub-tree otherwise we move to the right sub-tree. We match the input against the root of the chosen sub-tree and depending on the outcome we move either left or right. This process continues until we reach a leaf of the tree. In that case we report the value at the leaf as the outcome of the tree calculation. See figure 2.2 for an illustration of a decision tree. For more about this subject see [99]. The problem of learning a decision tree is fundamental in artificial intelligence and machine learning. The main algorithms for learning decision trees include ID3 [94], C4.5 [95], and CART [20]. However, all these algorithms fail to meet PAC requirements. Even in a noise free environment, these algorithms can learn a tree which is exponentially bigger than the smallest possible tree [59]. This is a major bottleneck in the theoretical analysis of these algorithms. Alternative algorithms have been designed for learning decision trees with accompanying theoretical analysis. These algorithms include an algorithm by Kushilvitz and Mansour [65] which is based on Fourier analysis, and an algorithm by Bshouty [22] which is based on monotonicity. These algorithms use

16

Chapter 2. Preliminaries

Figure 2.2: An illustration of a decision tree. The decision tree in this illustration has six internal nodes, each associated with a condition C1 , . . . , C6 . It has seven leaves, each is associated with either TRUE (+1) or FALSE (-1).

membership queries which enables them to have performance guaranties.

2.1.3

Intersections of Halfspaces

Intersections of halfspaces, or polytopes, are very interesting geometrical objects. Learning such objects is interesting both as regards its geometrical representation and also the reduction of learning DNF formulas to this problem (see e.g. [61] for the reduction of Boolean formulas to geometric concepts).

1 t log(t log ρ ) State of the art algorithms for learning intersections of halfspaces [62] require O n ρt

instances where n is the dimension, t is the number of halfspaces and ρ is a margin term. However, once membership queries are allowed, Kwek and Pitt [66] showed that poly (n, t) instances suffice for learning in this setting.

2.2

The Limitations of Membership Queries

Membership queries provided the theorists of machine learning with a phenomenal tool to propose algorithms, and to use for analytical purposes. However when it comes to most real world problems, membership queries fall short. This failure has been demonstrated by Lang and Baum in their paper entitled “Query Learning can Work Poorly When a Human Oracle is Used ” [67]. In this paper, the authors tried to apply an algorithm for learning 2 layer neural networks presented earlier

2.2. The Limitations of Membership Queries

17

Figure 2.3: Handwritten character recognition using membership queries [67]. The lower left and right corners are images of the figures “7” and “5”. The rest of the images represent combinations of these two figures. Note that some of these images are neither “7” nor “5”. Some of them do not look like any figure.

by Baum [12]. The main idea behind this algorithm is to take two instances with alternating labels and use membership queries on instances that are on a path connecting these instances. By doing so, the algorithm can find the exact transition point where the label changes. Lang and Baum [67] tried to apply Baum’s algorithm [12] to the task of recognizing handwritten digits. In this task, a bitmap that is a digital representation of a handwritten character needs to be identified as one of the digits 0 − 9. The authors expected that the novel learning algorithm would generate extremely accurate hypotheses by identifying the exact boundaries between the different digits. Unexpectedly, the experiment failed. The cause of this failure was that for many of the queries the algorithm generated, the teacher could not provide any answer. Figure 2.3 presents a demonstration of this problem. Two images of the figures “7” and “5” were used to generate a handful of queries for images which are combinations of the original images. However, many of

18

Chapter 2. Preliminaries

these queries are neither “7” nor “5”. Some do not resemble any figure at all. This led Lang and Baum to the conclusion that query learning can work poorly when a human oracle is used, as the title of their paper suggests. The reason for this failure lies in the fact that not all images are valid representations of hand written figures. The computer views such images as an array of numbers which represents gray levels. However, most of these arrays do not represent any figure at all. This phenomenon is not unique to the problem of handwritten digit recognition. On the contrary, we expect this to occur in most applications in which the oracle is human. Consider for example the task of medical diagnosis. In this case, when a computer generates medical files and lab results, it will most likely lack the consistency of medical files and lab results of human beings. Moreover, a physician who needs to “label” these instances may need to see the patient to conduct other medical examinations, but since such a patient does not exist, the whole process will fail. Another limitation of membership queries is the fact that there are problems in which even membership queries will not allow us to learn in a reasonable time. [5] showed that under common cryptologic assumptions2 there are problems with finite VC dimensional but no polynomial learning algorithms.

2.3

Summary

Membership queries are a powerful tool for the analysis and development of machine learning. They have inspired many authors and provide a way to evaluate the limits of learning algorithms. However, when it comes to real world applications, membership queries usually fall short. This said, there are vital tasks to which membership queries can be applied. Recently, learning algorithms which use membership queries have been applied successfully to verification problems [42]. In these cases, the teacher (oracle) is not a human being but rather a machine itself. This makes it possible to overcome the problem of using membership queries with human oracles. In the next chapter we present a method of overcoming noise while learning. This method uses membership queries in its core. It enables us to study fundamental problems in learning in the presence of noise.

2 It suffices to assume that there is a one-way function; i.e. a one-to-one function that is easy to compute but hard to invert.

Chapter 3

Noise Tolerant Learnability using Dual Representation Much of the research in machine learning and neural computation assumes the existence of a perfect teacher, one who gives the correct answers to the learning algorithm. However, in many cases this assumption is faulty, since different sorts of noise may prevent the teacher from providing the correct answers. This noise can be caused by noisy communication, human errors, measuring equipment and many other distortion sources. In some cases, problems which are efficiently learnable without noise, become hard to learn when noise is introduced [11]. In other cases, it is possible to learn efficiently even in the presence of noise (see e.g. [58]). However, no simple parameters are known to distinguish between classes that are learnable in the presence of noise and those which become hard to learn. In this section we introduce a noise cleaning procedure. Our procedure is capable of generating a clean sample even when the data source is corrupted with noise. In order to generate the noise free sample we exploit the structure of the dual learning problem. In the dual learning problem the teacher has an instance in mind and the goal of the learner is to approximate it by having access to the labels several classifiers assign to it. For any instance whose label we would like to query, we generate an approximation set consisting of many instances which are close to it. We query for the labels of the instances in the approximation set, assuming we have access to a Membership Query oracle, and use majority vote to label the instance we were interested at. In the study below we show that the noise cleaning procedure works as long as the dual learning problem is learnable and dense. Thus any learning problem, for which these criteria hold, 19

20

Chapter 3. Noise Tolerant Learnability using Dual Representation

can be learned efficiently in the presence of noise. We show that these assumptions are valid for a variety of learning problems, such as smooth functions, general geometrical concepts, and monotone monomials. We are particularly interested in the analysis of smooth function classes. We show that there is a uniform upper bound on the fat-shattering dimension of both the primal and dual learning problems which are derived by a geometric property of the class called type. We also show how the dual learning problem is related to the dual Banach space which is an important tool in functional analysis. The work presented in this chapter is based on joint research with Shai Fine, Shahar Mendelson and Naftali Tishby.

3.1

Learning in the presence of noise

In many learning models (e.g. PAC) it is assumed that the learner has access to an oracle (a teacher) which picks an instance from the sample space and returns this instance and its correct label. In real world problems, the existence of such oracle is doubtful: human mistakes, communication errors and various other problems make the existence of such an oracle unfeasible. In these cases, a realistic assumption would be that the learner has access to some sort of a noisy oracle. This oracle may have internal mistakes which prevent it from generating the correct labels constantly, or even influence its ability to randomly select points from the sample space. The VC dimension [118, 16] completely characterizes PAC learnability of Boolean functions in terms of the size of the sample needed. A class with a finite VC dimension can be learned from a finite sample whose size depends on the VC dimension, the accuracy and the required confidence level. However, the computational complexity of learning is not controlled by the VC dimension. In fact there are classes with a finite VC dimension but learning these classes is NP complete [60]. Things become even more complicated when we no longer assume the existence of a perfect oracle, i.e. a noise free oracle. We weaken this assumption, and assume that when querying the oracle for the label of x we obtain the true label with a probability of 1−η. We further assume that the oracle is consistent in the sense that if the label of x was requested twice, then the oracle will produce the same result. This model is called the “persistent random classification noise model: [37]. Classes with a finite VC dimension are learnable in the presence of persistent random classification noise in terms of the sample size needed (see e.g. [6] chapter 4). However, the computational complexity of the learning task can change dramatically. If learning in the noise free case is un-

3.2. The Dual Learning Problem

21

feasible, then it will remain so in the noisy case. However, there are cases in which the noise free problem is efficiently learnable, while learning in the noisy environment is unfeasible [11, 13]. The gap between the noise-free case and the noisy case appears not only in the PAC model, but also occurs in other models such as the online learning model [76]. Here we present a procedure which convert noisy oracles to noise-free oracles. In order for our procedure to work, the dual learning problem needs to be learnable and dense. These criteria characterize learning problems which are efficiently learnable in the presence of noise. The procedure we introduce is fairly simple. Given an instance, for which the learner would like to know its label, we generate an approximation set . This set consists of instances for which we have reason to believe the target concept assigns the same label as it assigns to the instance we are interested in. We use the majority vote among the labels of the instances in the approximation set to deduce the label of the instance we are interested in. The approximation set is generated by using the dual learning problem. In the dual learning problem the instances and hypotheses switch roles. We learn an instance by looking at the labels different hypotheses assign to it. We need to be able to learn efficiently in the dual learning problem, and we need the dual learning problem to be dense for this scheme to work. For this purpose, we need to work in a Bayesian setting in which there is a known probability measure over the concept class from which the target is chosen. When these conditions are present, noise can be filtered out by a simple procedure, which makes learning in presence of noise possible. More formally, the main result is the following: Let C be a concept class endowed with a probability measure ν. Assume that the target concept c∗ ∈ C was selected according to ν. Further assume that both the primal and dual learning problems are efficiently learnable in the noise free model and that the dual learning problem is dense. Then the noisy oracle can be converted to a noise-free oracle and learning in the presence of noise can take place.

3.2

The Dual Learning Problem

A learning problem may be characterized by a tuple hX , Ci where X is the sample space and C is the concept class. Learning can be viewed as a two player game: one player, the teacher, picks a target concept, while his counterpart, the learner, tries to identify this concept. Different learning models differ in the way the learner interacts with the teacher (PAC Sampling Oracle, Statistical Queries, Membership Queries, Equivalence Queries, etc.) and the method used to evaluate performance.

22

Chapter 3. Noise Tolerant Learnability using Dual Representation Every learning problem has a dual learning problem [92] which may be characterized by the

tuple hC, X i. In this representation the learning game is reversed: first the teacher chooses an instance x ∈ X and then the learner tries to approximate this instance by querying the value x assigns to different concepts. We view an instance x as the evaluating function δx on C such that δx (c) = c (x). We denote by X ∗∗ these evaluating functions: X ∗∗ = {δx : x ∈ X } To clarify this notion we present two dual learning problems:

• Let X be the interval [0, 1] and set C to be the class of all intervals [0, a] where a ∈ [0, 1]. If x is an instance and ca is the hypothesis [0, a] then ca (x) = 1 if and only if a ≥ x. Turning to the dual learning problem, note that δx (ca ) = 1 if and only if x ≤ a. Hence, the dual learning problem is equivalent to learning intervals of the form [x, 1] where x ∈ [0, 1]. • Let X = IRn , and let C to be the class of linear separators, i.e., cw ∈ C is the concept which assigns to each x ∈ IRn the label sign (x · w). The dual learning problem thus becomes a problem of learning linear separators and hence this problem is dual to itself.

The VC-dimension of the dual learning problems (also called co-VC ) obeys the inequalities: ⌊log (d)⌋ ≤ d∗ ≤ 2d+1

(3.1)

where d is the VC-dimension of the primal problem and d∗ is the co-VC (see lemma 3.1 on page 33). As can be seen, the gap between the complexities of the primal and dual learning problem can be exponential. However, in both examples presented here and in most realistic cases this gap is only polynomial. Therefore, our assumption that the dual learning problem is efficiently learnable holds in many if not most of the interesting cases (see Troyansky’s thesis [114] for a survey of dual representation). In section 3.6 we broaden the discussion to handle regression problems in which the concepts assign a real value to each instance rather than a Boolean value as in the classification case. Thus, we replace the notion of VC-dimension with the fat-shattering dimension. We show that for classes which consist of sufficiently smooth functions, both the fat-shattering and the cofat-shattering have an upper bound which is polynomial in the learning parameters and enables efficient learning of the dual problem.

3.3. Dense Learning Problems

3.3

23

Dense Learning Problems

A learning problem is dense [40] if every hypothesis has many hypotheses which are close but not equal to it: Definition 3.1 Let X be a sample space, C a concept class and D be distribution of the instances. The learning problem defined by the triplet hX , C, Di is dense if for c ∈ C and every ǫ > 0 there exists c′ such that 0 < Pr [c (x) 6= c′ (x)] < ǫ x∼D

The density property is distribution dependent: for every learning problem there exists a distribution in which the resulting learning problem is not dense. In fact, if the distribution is supported on a finite set , the problem cannot be dense. If a learning problem is indeed dense, any hypothesis can be approximated by an infinite number of hypotheses thus finite concept classes are not dense according to definition 3.1. We would like to extend the definition of a dense learning problem to finite hypothesis classes as well. We replace the demand of approximating h for every ǫ, by the same demand for every polynomial ǫ. In the above definition the requirement is that each hypothesis h can be approximated by infinite number of hypotheses. In the finite case we replace the infinity assumption by a super-polynomial number of approximating hypotheses. Definition 3.2 Let Xn be a sample space, Cn a concept class and Dn be distribution of the in∞

stances. The sequence of learning problem {hXn , Cn , Dn i}n=1 is dense if for every polynomial p (n) there exists N such that for every n > N and every c ∈ Cn there exists c′ such that 0 < Pr [c (x) 6= c′ (x)] < 1/p (n) x∼Dn

3.4

Noise Immunity Scheme

We now present our noise immunity scheme. This scheme immunizes any learning problem to noise, if the requirements of learnability and density with respect to its dual learning problem are satisfied. The main idea is to generate a noise free oracle and use it for learning. Let x be an instance whose label we would like to know. Since we have access to a noisy oracle, querying for the label of x does not produce good enough results. Furthermore, since the noise is persistent, repeated sampling of the label will not provide any additional information. However, if there are

24

Chapter 3. Noise Tolerant Learnability using Dual Representation

enough instances in the sample space to which the target concept assigns the same label as it assigns to x, we can sample the labels of these instances and use majority vote to deduce the label of x. The problem is to identify these instances. For this purpose, we use the dual learning problem. The requirements of learnability and density ensure that with high probability, for any instance x, the dual learning algorithm will find many instances x′ such that almost all concepts assign them with the same label. Since the learner of the primal learning problem knows the dual target x and the probability measure on C (the Bayesian assumption), it can provide a clean sample to the dual learning problem. Hence, the dual learning problem is noise free and therefore far easier. The algorithm is detailed in algorithm 1. In the following theorem we prove the efficiency of the noise cleaning algorithm. Theorem 3.1 Assume that the dual learning problem is dense. With a probability of 1−δ the noise cleaning algorithm (algorithm 1) will return the correct label of x. The computational complexity 1 of the algorithm is poly d∗ , log δ1 , |1−2η| . As stated it theorem 3.1, the noise cleaning algorithm is polynomial with respect to its parameters, and with high probability it returns the true label which can be used to learn the original learning problem. Proof: of theorem 3.1 We begin by showing that the function approx, when called with ǫˆ, δˆ and x will return x′ such that with probability 1 − δˆ the instance x′ will be such that Prc∼ν [c (x) 6= c (x′ )] < ǫˆ. Note that this is simply the definition of learning in the dual learning problem. Note also that by assuming that the dual learning problem is dense1 , it follows that x′ 6= x with probability 1. By the choice of parameters the noise cleaning algorithm makes, it follows that with probability 1 − 2δ for any choice of 1 ≤ i ≤ k, the instance xi is indeed a good approximation of x in the sense that Pr [c (xi ) 6= c (x)] <

c∼ν

1 4

(3.2)

Assume that (3.2) holds for all i. When we query for the label of xi the correct label is returned with a probability of 1 − η. This is independent of c (xi ) 6= c (x) or c (xi ) = c (x). Hence, with a probability greater than 3 1 (1 − η) + η 4 4

=

3 η − 4 2

we will obtain the correct label. 1 If

the learning problem is finite, then this holds for large enough n as defined in definition 3.2.

3.4. Noise Immunity Scheme

25

Algorithm 1 Noise Cleaning Algorithm Inputs: • confidence parameter 1 − δ. • VC-dim of dual learning problem d∗ . • A bound on the noise level η. • An instance x. Output: • A label y of x. Algorithm: 2 2 1. By simulating the dual learning problem k = (1−2η) times, generate an ensemble 2 ln δ S = {x1 , . . . , xk } by applying to the approx function (see below) with the point x, accuracy δ 1 4 and confidence 2k . 2. Use MQ to get a label yi of xi . 3. Let y be the majority vote over the yi ’s. Function approx Inputs: • A point x. • required accuracy ǫˆ. ˆ • required confidence 1 − δ. Output: • A point x′ . Algorithm: 1. Let m = poly

1 1 ∗ ǫˆ , log δˆ , d

.

2. Using the prior ν over the concept class C generate the sample c1 , . . . , cm . 3. Assign to every ci the label ci (x). 4. Apply to the learning algorithm of the dual learning problem with the labeled sample generated in steps 2 and 3 to generate x′ .

26

Chapter 3. Noise Tolerant Learnability using Dual Representation Using Hoeffding’s inequality and the fact the all xi are chosen independently, we see that the η

probability that the majority vote will fail to predict the label of xi is smaller than e−2( 4 − 2 ) 1

2

k

and by the choice of k, this is smaller than 2δ . The procedure presented can fail in two cases: the xi ’s generated do not form a good approximation set or alternatively, too many of the labels of the xi ’s are corrupted. Each of these event can occur with a probability of less than δ/2 and thus the whole process will work with a probability of 1 − δ. The computational complexity of the algorithm follows easily from the definition of the algorithm.

In the following sections we present a variety of problems to which the paradigm just presented is applicable. We demonstrate it both on continuous classes such as neural networks and finite classes such as monotone monomials. However, there are classes to which the paradigm can not be applied. For example, consider the class of parity functions. Although this class is dual to itself, and thus has moderated co-VC, it is not dense and therefore fails to meet the requirements.

3.5

A Few Examples

In this section, we discuss a few classes to which the noise cleaning algorithm can be applied. We present a few examples which have the properties required by theorem 3.1.

3.5.1

Monotone Monomials

The first problem we present is a Boolean learning problem of a discrete nature: the problem n

of learning monotone monomials. In this problem the sample space is {0, 1} and the concepts n

are conjunctions (e.g. x (1) ∧ x (3) ∧ x (4)). The hypothesis class is C = {cI : I ⊂ {0, 1} } where V cI (x) = i∈I x (i). To simplify the notation we will assume that x ⊆ [1 . . . n], and c ⊆ [1 . . . n] such that c (x) = 1 ⇐⇒ c ⊆ x. The dual learning problem for this class is learning monotone

monomials with reverse order, i.e., x (c) = 1 ⇐⇒ x ⊇ c. Both the primal and the dual learning problems have the same VC-dimension, n. Instead of showing that the dual class is dense, we will give a direct argument showing that the label of each instance can be approximated. Let Zx = {z : x ⊆ z ⊆ [1 . . . n]}. Since the concept class is monotone, if c (x) = 1 then for every z ∈ Zx , c (z) = 1. On the other hand, if c (x) = 0

3.5. A Few Examples

27

then there exists some i ∈ c \ x. Therefore, half of the instances in Zx have i ∈ / z, implying that c (z) = 0 for each such z. Thus, Prz∈Zx [c (z) = 0] ≥ 1/2 with respect to the uniform distribution on Zx . Hence, if c (x) = 1 then Prz∈Zx [c (z) = 0] = 0 whereas if c (x) = 0 then Prz∈Zx [c (z) = 0] ≥ 1/2. This allows us to distinguish between the two cases. In order to be able to do the same thing in the presence of noise we have to require that Zx is big enough. From the definition of Zx it follows that |Zx | = 2n−|x| . It will suffice to require that |x| ≤ np for p < 1 with a high probability, since in this case |Zx | is exponentially large. This condition holds with a high probability for the uniform distribution and many other distributions. Note that in this case there is no need for a Bayesian assumption, i.e., we do not assume the existence of a distribution on the concept class. Moreover, the dual learning problem is reduced in this case to a simple sampling procedure for Zx . However, we have used a slightly relaxed definition of density in which for most of the instances there exists a sufficient number of approximating instances.

3.5.2

Geometric Concepts

In contrast to the previous example, when dealing with geometric concepts, the sample space and the concept class are both infinite. For the sake of simplicity let the sample space be IR2 and assume that the concept class consists of axis aligned rectangles. In this case, the VC-dimension of the primal problem is 4 and the dual problem has VC-dimension 3. Moreover if a “smooth” probability measure is defined on the concept class, it is easily seen that each instance is approximated by all the instances with distance r from it (when r depends on the defined measure and the learning parameters). Therefore, this class is dense. This example can be extended to large variety of problems, such as neural networks, general geometric concepts [23] and high dimensional rectangles. We describe two methods of doing so below.

Using the Bayesian Assumption: The first method uses the Bayesian assumption. Each geometric concept divides the instance space into two sets. The edge of these sets is the decision boundary of the concept. Assume that for every instance there is a ball around it which does not intersect the decision boundary of “most” of the concepts. Denote by ν a probability measure on

28

Chapter 3. Noise Tolerant Learnability using Dual Representation

C, and assume for every δ > 0 there exists r = r (δ, x) > 0 such that Prc∼ν [B (x, r) ∩ ∂c 6= φ] < δ

(3.3)

(B (x, r) is the ball of radius r centered at x and ∂c is the decision boundary of c). If (3.3) holds then all the points in B (x, r) can be used to predict the label of x, and therefore, for verifying the correct label of x.

Geometric Concepts without Bayesian Assumption: A slightly different approach can be used when there is no Bayesian assumption but the distribution over the sample space is non singular. Given δ > 0, then for every concept c there exists a distance rc > 0 such that the measure of all points with a distance smaller than rc from the decision boundary of c does not exceed δ. If 0 < r = inf c∈C rc , then with high probability (on the instance space) a random point x has a distance which is greater than r to the decision boundary of the target concept c. Hence, the ball of radius r around x can be used to select approximating instances.

3.5.3

Periodic Functions

In this example we present a case where the approximating instances are not in the near neighborhood of x. Let X = IR and set C=

2πx sign sin : such that p is prime p

Since the number of primes is countable, the probability measure on C is induced via a measure on IN. Note that C consists of periodic functions, but for each function, the period is different. Given a point x ∈ IR, and a confidence parameter δ, there is a finite set of concepts A, such that ν (A) ≥ 1 − δ. Since the set A is finite, the elements of A have a common period. Therefore, there is some t, such that for every c ∈ A and every m ∈ IN, c (x) = c (x + mt). It is reasonable to assume that the noise in the primal learning problem is not periodic (because the elements of the class do not have a common period), therefore, it is possible to find many points which agree with x with high probability, but are far away from the metric point of view. Moreover, using the same idea, given any sample c1 , . . . , ck ∈ C, it is possible to construct an infinite number of points xi which agree with x on the given sample.

3.6. Regression

3.6

29

Regression

So far in this discussion, we have focused on binary classification problems. In this section, we extend our discussion to regression were the target concept is a continuous function. For every given x we attempt to find “many” instances xi , such that with a high probability f (xi ) is “almost” f (x). When the concepts are continuous functions, it is natural to search for the desired xi near x. However, if there is no a-priori bound on the modulus of continuity of the concepts, it is not obvious when xi is “close enough” to x. Moreover, in certain examples the desired instances are not necessarily close to x, but may be found “far away” from it (e.g. periodic functions as presented in section 3.5.3). Algorithm 1 needs to be adjusted for the regression setting. We present this modified algorithm in algorithm 2. The following theorem proves the correctness of this algorithm. Theorem 3.2 Assume that the dual learning problem is dense. Assume also that the learning problem is bounded, i.e. ∀c, x |c (x)| ≤ 1. With probability 1 − δ the noise cleaning algorithm for regression (algorithm 2) will return y such that |y − c (x)| < ǫ. Proof: The proof is very similar to the proof for theorem 3.1. We begin by showing that the function approx, when called with ǫˆ, δˆ and x will return x′ such that with a probability of 1 − δˆ the instance x′ will be such that the L1 (ν) distance between δx and δx′ is smaller than ǫˆ. This is simply the definition of learning in the dual learning problem. Note also that by the assumption that the dual learning problem is dense, it follows that x′ 6= x with a probability of 1. By the choice of parameters the noise cleaning algorithm makes, it follows that with a probability of 1 −

δ 3

for any choice of 1 ≤ i ≤ k, instance xi is indeed a good approximation of x in the

sense that kδx − δx′ kL1 (ν) < ˆǫ

(3.4)

Assume that (3.4) holds for all i. Therefore, for γ > 0 ǫˆ ′ ′ ′ <γ |c (x) − c (x )| > Pr c′ ∼ν γ thus, using the parameters in the algorithm, and applying them to the union bound2 , we obtain that with a probability of 1 − 32 δ (over the choice of the target concept and the internal randomization of the function approx, the following property will hold: ∀i 2 The

|c (x) − c (xi )| ≤ ǫ

union bound is very loose in this case. However, for the sake of brevity we use it here.

(3.5)

30

Chapter 3. Noise Tolerant Learnability using Dual Representation Assuming that (3.5) holds, we obtain that for each i Pr [yi = c (xi )] ≤ 1 − η

due to noise. It suffices that more than half of the labels are correct. This will guarantee that the median is not more than ǫ away from the true value (due to 3.5). Using Hoeffding’s inequality and the fact the all xi are chosen independently, we see that the probability that the majority vote will fail to predict the label of xi is smaller than e−2(

1−2η 2

2

)

k

and by the choice of k, this is

smaller than 3δ . This completes the proof.

3.6.1

Estimating V Cε (C, X ∗∗ )

In general, the question of learnability of the dual problem may be divided into two parts. The first is to construct a learning algorithm L which assigns to each sample Sm = {f1 , . . . , fm } a point x′ such that for every fi , |fi (x) − f (x′ )| < ε. The second part is to show that the class of functions X ∗∗ = {δx : x ∈ X } on C satisfies some compactness condition (e.g., a finite fatshattering dimensions V Cε (C, X ∗∗ )). We provided an answer to the first problem in the previous section. We now move forward to address the second problem. Let X ⊆ IRd such that X is infinite, and let (C, k·k) be a subset of a Banach space (see section 3.8) consisting of functions on X . Furthermore, assume that C has a reproducing kernel (see definition 3.4 on page 33). In this case, the dual learning problem is always a linear learning problem as for any x ∈ X , the functional δx (c) = c (x) is in C ∗ and thus X ∗∗ ⊆ C ∗ . We will show that if C is a bounded subset of a Banach space with a reproducing kernel and if X ∗∗ is a bounded subset of C ∗ , then the fat-shattering dimension V Cε (C, X ∗∗ ) is finite for every ε > 0 provided that C has a non trivial type, i.e. a type greater than 1. Classical representatives for spaces with non-trivial type are Sobolev spaces W k,p , (cf. [48] for basic information regarding Sobolev spaces, or [1] for a comprehensive survey). For example, W 1,2 (0, 1) is the space of continuous functions f : [0, 1] → IR, for which the derivative f ′ belongs to L2 with respect to the Lebesgue measure. The inner product in the space is defined f · g = R1 (f g + f ′ g ′ ) dx. 0

Mendelson [86], explored the relation between the type of a Banach space and the fat-shattering

dimension:

Theorem 3.3 (Theorem 1.5 in [86])

3.6. Regression

31

Algorithm 2 Noise Cleaning Algorithm for Regression Inputs: • Confidence parameter 1 − δ. • Fat Shattering dimension of dual learning problem d∗ . • A bound on the noise level η. • An instance x. • Required accuracy ǫ. Output: • An approximation y of c (x). Algorithm: 1. By simulating the dual learning problem k =

2 (1−2η)2

ln

3 δ

times, generate an ensemble

S = {x1 , . . . , xk } by applying to the approx function (see below) with point x, accuracy δ and confidence 3k .

ǫδ 3k

2. Use MQ to get the value yi of xi . 3. Let y be the median of the yi ’s. Function approx Inputs: • A point x. • required accuracy ǫˆ. ˆ • required confidence 1 − δ. Output: • A point x′ . Algorithm: 1. Let m = poly

1 1 ∗ ǫˆ , log δˆ , d

.

2. Using the prior ν over the concept class C generate the sample c1 , . . . , cm . 3. Assign to every ci the value ci (x). 4. Apply the labeled sample generated in steps 2 and 3 to the learning algorithm of the dual learning problem to generate x′ . When learning in the dual learning problem we require that with confidence 1 − δˆ the returned instance x′ is ǫˆ close to x in L1 (ν) norm.

32

Chapter 3. Noise Tolerant Learnability using Dual Representation Let X be a infinite dimensional Banach space with type p. The fat-shattering dimension

VCǫ (B (X) , B (X ∗ )) is finite if and only if the type of X is greater than 1. Furthermore, if p′ < p then there are constants K and κ such that p p−1 ′p′ 1 1 p −1 κ ≤ VCǫ (B (X) , B (X ∗ )) ≤ K ε ε

The following is a simplified version of theorem 3.3: Corollary 3.1 Let C be a bounded subset of a Banach space of functions over an infinite set X . Assume the Banach space has a non-trivial type, i.e. greater than 1. Assume further that the evaluation functionals δx ∈ X ∗∗ are uniformly bounded then V Cε (X , C) < ∞ for every ε > 0. Proof: X is bounded, hence w.l.o.g. we assume it is a subset of the unit ball of a Banach space X. C is a bounded subset of the dual space, hence w.l.o.g. we assume C ⊆ B (X ∗ ). Hence by our assumptions, for every ε > 0 V Cε (X , C) ≤ V Cε (B (X) , B (X ∗ )) < ∞

Corollary 3.1 provides a bound on the sample complexity of learning a problem based on the type of the Banach space. If the type of the Banach space is non-trivial then the sample needed for learning the problem is polynomial. Moreover, if the Banach space X has non-trivial type, then X ∗ has a non-trivial type as well (see [91]) thus the dual learning problem has polynomial sample complexity as well. Note that in both cases, i.e. the complexity of the primal and the dual learning problems, the fact that the spaces are bounded is essential. The computational complexity of these learning problems is domain specific. However, Mendelson [86] showed that learning subsets of Hilbert spaces with reproducing kernels can be done efficiently. Finally, we turn to an examination of the density of the dual learning problem. Let x ∈ X be an instance and c1 , . . . , cm be any finite set of concepts. In most cases of interest, there are infinitely many x′ ∈ X such that ∀1 ≤ i ≤ m ci (x) = ci (x′ ). Hence, the problem is naturally dense.

3.7

VC Dimension of Dual Learning Problems

For the sake of completeness we present here the following lemma:

3.8. Banach Spaces

33

Lemma 3.1 Let C be a concept class defined over the sample space X . For every x ∈ X we define the function δx such that δx (c) = c (x). Let X ∗ = {δx : x ∈ X } be a concept class defined over C. Finally let d be the VC dimension of C and d∗ be the VC dimension of X ∗ then ⌊log2 d⌋ ≤ d∗ ≤ 2d+1 Proof: Let x0 , . . . , xm−1 be a sample shattered by C. Let c0 , . . . , c2m −1 be concepts for C which shatter m

this sample. For every choice of y¯ ∈ {0, 1}

there is 0 ≤ i < 2m such that ci assigns the labels y¯

to x0 , . . . , xm−1 . Consider the m × ⌊log2 m⌋ table T such that the j’th row in T is simply the binary representation3 of j. Let y¯ be a column of T . There exist 0 ≤ i < 2m such that ci assigns the labels y¯ to x0 , . . . , xm−1 . W.l.o.g. assume that c0 , . . . , c⌊log2 m⌋ generate the labelings described in T . Therefore, x0 , . . . , xm−1 shatters c0 , . . . , c⌊log2 m⌋ , and hence d∗ ≥ ⌊log2 m⌋ for any m ≤ d and thus d∗ ≥ ⌊log2 d⌋. By switching the roles of the instances and concepts and applying the same proof we obtain d ≥ ⌊log2 d∗ ⌋ and thus 2d+1 ≥ d∗

3.8

Banach Spaces

This section provides a brief introduction to Banach spaces. Definition 3.3 A space X endowed with a norm k·k is a Banach space if it close with respect to the distance measure d (x1 , x2 ) = kx1 − x2 k. A Banach space has a dual space which consists of all linear functionals over X. The dual space is denoted by X ∗ and it is a Banach space itself using the norm kx∗ k = supkxk=1 |x∗ (x)|. Any Banach space is naturally embedded in its dual-dual space via the duality map x → δx given by δx (x∗ ) = x∗ (x). Definition 3.4 Let X be a Banach space consisting of functions over some space Ω. We say that X has a reproducing kernel if for every ω ∈ Ω the evaluation functional δω is norm continuous, i.e. for any ω ∈ Ω there exist some κω such that |δw (f )| = |f (w)| ≤ κω kf k 3 For the sake of this lemma it will simplify our notation to assume that the concepts assign the values {0, 1} rather than {±1} to the instances.

34

Chapter 3. Noise Tolerant Learnability using Dual Representation Another important property of Banach spaces is the type of the Banach space.

Definition 3.5 A Banach space X has type p, if there is some constant κ such that for every x1 , ..., xn ∈ X, Eσ1 ,...,σn [kσi xi k] ≤ κ

X i

p

kxi k

!1/p

(3.6)

where the σi s are i.i.d. random variables taking the values +1 or −1 with a probability of 1/2. It follows that the type of a Banach space is always in the range of [1, 2]. If the space is a Hilbert space, then its type is exactly 2. The basic facts concerning the concept of type may be found, for example, in [74] or in [91].

3.9

Summary

In this section we have presented a noise immunizing scheme for learning. Our scheme utilizes the structure of the learning problem, mainly by exploiting properties of the dual learning problem. Having access to a membership query oracle, we were able to devise a conversion scheme which applies noise robustness to many learning problems. In this presentation we focused on random classification noise. However, our method apparently works in many other noise models (e.g. malicious noise). In section 3.6 we generalized our scheme to handle real valued functions. In this setting, the dual learning problem is related to the dual Banach space. Hence, the study of the dual learning problem is very natural. We used the type of Banach space as a measure of the complexity of the learning problem and showed that if the type is non-trivial then both primal and dual learning problems are feasible. Our construction provides a set of sufficient conditions for noise tolerance learnability. However, we believe that the essence of these conditions reflects fundamental principles which may turn out to be a step towards a complete characterization of noise tolerant learnability.

Part III

Selective Sampling

35

Chapter 4

Preliminaries In the selective sampling framework [28, 29], the learner is presented with a large set of unlabeled instances from which it can choose the instances for which the teacher will be asked to provide labels. The selective sampling framework differentiates two features of the process: the complexity of obtaining a random unlabeled instance and the complexity of labeling it. In the PAC framework [116] these two features are merged, however in many applications, collecting unlabeled instances is an easy task which is almost cost free, but labeling these instances is costly and lengthy. Consider for example the task of text classification. In many cases, collecting random instances can be done automatically without human involvement in the process, for instance by retrieving documents from the Internet. However, labeling these texts may be lengthy (the need to read the document) and may require experts. This situation is not unique to text classification, and applies to a variety of tasks including medical diagnostics, speech recognition, natural language processing and others. Selective sampling (sometime called query filtering) is an active learning framework in which the learner sees unlabeled instances and selects those instances for which the teacher will be asked to provide labels. This framework has several advantages over membership queries. First and foremost, the selective sampling framework is applicable in many cases whereas membership queries cannot (see section 2.2 on page 16). Furthermore, selective samplers are tuned to the underlying distribution of the instances. This is significant, as the learner can focus on the more probable instances. There are two types of selective sampling settings: batch and online. In the batch setting, a large set of unlabeled instances is provided and the learner selects the instances to be labeled by repeatedly searching for informative instances in this set. By contrast, in the online setting, 36

4.1. Empirical Studies of Selective Sampling

37

unlabeled instances are presented in a sequential manner. Whenever an unlabeled instance is presented, the learner needs to decide whether to query for its label or not. In this online setting, the learner cannot rewind the stream of unlabeled instances and hence the learner cannot defer querying for the label of an instance later in the process (unless this instance is presented again). To further understand the difference between the batch and online settings consider for example the greedy algorithm, presented by Dasgupta [34] (see algorithm 3 on page 47). At each round, this algorithm searches the entire batch of unlabeled instances for the most informative instance and queries for its label. Therefore, for each query it needs to scan the whole batch of unlabeled instances for the most informative instance. While this is reasonable when the size of the batch is moderate, in other cases this may be unfeasible. Note that in some cases there is a constant stream of unlabeled instances; hence the batch is in a sense infinite in its size. Whereas the membership queries framework does not have significant implications as regards real world applications, the selective sampling framework has been applied in many domains with great success. Several key examples are presented below.

4.1

Empirical Studies of Selective Sampling

Many algorithms operate in the selective sampling framework [28]. These algorithms have been applied in many domains including text classification, part of speech tagging, etc. The core of these algorithms is typically a scoring function. This function assigns a score to unlabeled instances based on the labels seen so far. The score is designed to measure the benefits from labeling this instance; i.e. the additional information or reduction in uncertainty a label will provide. The score is used in two ways. In the batch setting, all the unlabeled instances are scored and the next query point is chosen as the one with the highest score (a greedy strategy). In the online setting, the instances are scored one at a time and the next query point is selected by thresholding over it or by a random criterion. Score functions take different forms, but most of them can be assigned to three categories: committee-based scores, scores based on a confidence levels of a single classifier, and look-ahead principles.

38

Chapter 4. Preliminaries

4.1.1

Committee-Based Scores

Committee-based scores use several learners who learn in parallel. In most cases, the committee consists of different learning algorithms, and thus when seeing the same training sequence, each learner generates a different hypothesis. Another possibility is to use the same learning algorithm for all learners but in this case, each learner only sees a subset of the training data collected so far. The goal here is to have broad diversity in the committee, much like in Bagging [21]. When a new instance is introduced, each learner in the committee predicts its label. If all committee members agree on the predicted labels, then most likely, labeling this instance will not provide any additional information. However, if there is a considerable disagreement among committee members, labeling this instance is guaranteed to provide new information, at least for those learners who made a wrong prediction. The committee principle has led to many active learning algorithms. The leading algorithm using this principle is the Query By Committee algorithm [104] discussed in chapter 5 on page 51. Several other algorithms are summarized below.

4.1.1.1

Part-of-Speech Tagging

Dagan and Engelson [33] used active learning in the domain of Natural Language Processing (NLP). Specifically, they were interested in the task of part-of-speech tagging. In this task, a sentence in a natural language is presented to the machine, and the machine needs to assign the grammatical role to each word in the sentence. Typically, algorithms for tackling this problem use either expert knowledge, which is coded in the algorithm, or a large annotated training sequence which is used to train a learning algorithm. Both methods require a vast amount of work from human experts and thus it is hard to adapt these algorithms for many languages. Dagan and Engelson suggested the use of active learning for this task. They argue that obtaining the raw data (texts in this case) is almost cost free whereas annotating it is costly and lengthy and thus selective sampling is a natural match for this task. The part-of-speech tagging task is a complicated procedure. One reason for this complication is the inherent ambiguity of natural languages. Sentences such as “We saw the park with the binocular” has more than a single valid interpretation. Moreover, words can have multiple meanings, for instance “this is my head ”, “he is the head of the group” and “we should head south” all use the word “head ” but with different meanings. Therefore, it is common for part of speech taggers to use probabilistic models that assign probabilistic scores to possible grammatical analyses of a sentence rather than trying to find the “correct” grammatical structure.

4.1. Empirical Studies of Selective Sampling

39

Figure 4.1: Active vs. Passive Learning for Part-of-Speech Tagging [33].Accuracy appears on the x-axis and the number of tagged words used while training on the y-axis.

The complexity of the task forced Dagan and Engelson [33] to suggest a heuristic based on the committee principle. The base learners (which form the committee) are a special kind of a Hidden Markov Model (HMM). A committee is constructed on the basis of the training sequence seen so far, and a random choice of the free parameters. Since this is a multi-class problem, because there are many possible tags for a word, an entropy-based criterion is used to measure the disagreement between committee members. In Figure 4.1 some of the results obtained are presented. On the x-axis, the accuracy achieved by the algorithm of [33] is presented and on the y-axis, the number of tagged words is shown. The accuracy of an active and a passive learning algorithm is compared. The difference between the two is apparent. For example, reaching an accuracy level of 90% required only ∼ 4000 words in the active algorithm whereas the passive algorithm needed ∼ 12500 words. Obtaining an accuracy level of 91% requires ∼ 7000 words in the active algorithm and ∼ 25000 in the passive algorithm. 4.1.1.2

Spoken Language Understanding

Tur et al [115] used active learning as part of a spoken language understanding system. The system is a part of an automatic operator. People can call the operator and ask for a variety of services. For example, a user can call and ask “What is my balance?”- The operator needs to

40

Chapter 4. Preliminaries

react to these requests. This is done by applying an automatic speech recognizer that identifies the spoken words. The transcribed words are fed into an “understanding” unit that assigns the task the operator will perform. As in many of the tasks presented here, there is constant feed data in this task as people keep calling the operator. However, labeling these requests is a labor-intensive task. Tur et al [115] used a committee-based approach to accelerate learning in the understanding component of this system. The committee used consisted of two learners, an SVM [19] and AdaBoost [45]. These two classifiers were trained over the same training sequence. Both SVM and AdaBoost are able to provide a confidence parameter together with their predictions, i.e. margin. Tur et al [115] used this confidence and selected the next instances to be labeled as instances for which SVM and AdaBoost disagreed and gave low confidence to their predictions. The results are presented in figure 4.2. The experiment compares the committee-based approach to a confidence based approach and to random sampling (i.e. passive learning). For most of training sizes, the gain of using committee-based active learning is 1-2% over passive learning. They noted that the committee-based approach seems preferable to the confidence based approach.

4.1.1.3

Ensemble of Active Learners

Ensemble methods, such as Bagging [21] and Boosting [45] are successful tools for passive learning. Baram, El-Yaniv and Luz [10] presented an ensemble method for active learning. Their novel approach combines different active learning algorithms using a master algorithm. The assumption behind this master algorithm is that any active learning algorithm will fail on some data sets. The goal of the master algorithm is to find the best performing active learning algorithm on the specific data set at hand and use it for training. In order to do so, the authors had to come up with a method to evaluate the performance of active learners. In the passive learning model this can be achieved by using leave-one-out estimates or a hold-out set, but in the active learning model these approaches cannot be used as the labeled training set is heavily biased towards difficult instances. Therefore the error estimates are typically much worse than actual performance. Baram et al [10] use an entropy criterion as a scoring function. Given a training sequence, each learner is asked to label a set of unseen instances. These labels are viewed as groupings of these instances, where each group consists of the instances to which the learner assigns the same label. The score of the learner is the entropy of the partition1 . 1 This criterion appears to assume that the classes are equally sized. However, empirically it works well even when this is not the case [10].

4.1. Empirical Studies of Selective Sampling

41

Figure 4.2: Active vs. Passive Learning for Spoken Language Understanding [115]. The x-axis shows the number of labeled instances while the y-axis shows accuracy. In this figure three filtering methods are compared: a random selection of instances to be labeled, committee-based active learning and confidence based active learning.

42

Chapter 4. Preliminaries Once the algorithm uses a certain active learner to query for the next query point, the entropy

based scoring function is used to evaluate the benefit obtained from the label. Therefore, at any point we can evaluate the active learners based on previous decisions they made. However, we still need to decide which learner will make the next query. The fundamental problem here is the exploration vs. exploitation dilemma. On one hand, we would like to give a fair chance to all learners, but on the other hand, giving a poor learner too many opportunities to make queries might undermine the performance of the whole ensemble. To rectify this, Baram et al [10] used the analogy to the multi-armed bandit problem and used algorithms suggested by Auer et al to solve it [8]. The approach suggested by Baram et al [10] has proven itself to be successful on many experiments the authors have conducted. This again proves the efficiency of committee-based approaches. However, note that this time, active learners are used in the ensemble, whereas in all other algorithms presented here (and all other algorithms we are familiar with), the committee consists of passive learners.

4.1.1.4

Other Committee-Based Approaches

Many committee-based approaches have been devised. McCallum and Nigram [84] used a probabilistic model to sample a committee. An interesting property of the approach presented in [84] is the combination of semi-supervised models with active learning. In their algorithm, an expectation maximization (EM) algorithm is used to label the instances which are not yet labeled and thus the learner can train over a larger training sequence. Liere [72] used a straight forward committee-based approach where the core classifiers are linear threshold functions (Winnow [76] and Perceptron [90]). Krogh and Vedelsby [63] used committee-based methods with neural networks. Muslea et al. [89] introduced a committee-based active learning algorithm. They used multiple views of data [15] to devise their algorithm. Another interesting approach was presented by Mamitsuka and Abe [81]. They generated a committee by training the same learning algorithm over random subsets of the training data.

4.1.2

Confidence-Based Scores

Confidence-based scores use a single base classifier which is able not only to make predictions for the labels of unseen instances, but also to assign confidence levels to its predictions. Many of the classifiers used today have this capability of reporting their confidence. In SVM [19] and AdaBoost

4.1. Empirical Studies of Selective Sampling

43

[45, 101] the margin can be used as a measure of confidence. Other classifiers, such as Bayesian networks [56], have internal probabilistic structure which can be used to measure confidence. These and other confidence scores have been used to devise active learning algorithms. An overview of some of these is presented below.

4.1.2.1

Margin Based Confidence

Tong and Koller [113], Campbell et al [24] and Schohn and Cohn [102] introduced a simple active learning scheme based on large margin principles. They suggested training SVM [19] over the training sequence seen so far and choosing the next query point to be a point with the smallest possible margin. This type of point is close to the decision boundary induced by SVM. Thus, the label of this instance will shift the decision boundary considerably, making it an informative instance. See figure 4.3 for a comparison of this simple scheme with various other active learning schemes and passive learning algorithms. This example shows that this simple approach significantly outperforms the passive learning SVM algorithm on a text classification task while performing comparably to other more sophisticated active learners. Schohn and Cohn [102] reported that in some cases when the active learner simply uses a subset of a training sequence, it outperforms the passive learner that uses the fully labeled set. The same surprising result was reported by Tur et al [115]. This is usually explained by the tendency of active learners to avoid querying the labels of outliers.

4.1.2.2

Probability Based Confidence

Lewis and Gale [71] used a probability based confidence active learner. They used a logistic regression based classifier and queried for instances for which the probability for the leading class was the smallest. [71] report that in some cases using active learning reduced the number of labels 500-fold over passive learning.

4.1.3

Look-ahead Principles

The ultimate criterion for selecting query points is the reduction in generalization error. However we do not have access to this parameter and thus estimates of this error need to be used. Several methods have been suggested to utilize such principles. Cohn, Ghahramani and Jordan [30] designed an active learning algorithm for parametric models such as neural networks and mixtures of Gaussians. Their algorithm minimizes the variance

44

Chapter 4. Preliminaries

Figure 4.3: Margin Based Active Learners [113]. This figure presents several active learning methods based on SVM large margin principles. The different algorithms were applied to a text classification task. The three different active learning algorithms (Ratio, MaxMin and Simple) perform similarly outperforming the passive learning algorithm (Random). Note that the active learner used 100 labels to obtain the same accuracy as the passive learner which was trained over a set of 1000 instances (full).

4.2. Theoretical Studies of Selective Sampling

45

of the estimates of the parameters. For any instance x, the expected reduction in variance is calculated and used as a score for choosing the next query point. Cohn et al. [30] showed that in certain models such as mixtures of Gaussians, this parameter can be calculated efficiently. Roy and McCallum [97] designed an algorithm which estimates the future error based on a sampling technique. The future error is calculated over the set of unlabeled instances available to the learner. The learner uses a probabilistic model through which it can estimate the log-loss or 0-1 loss (assuming that the current probabilistic model is accurate). The next query point is selected to be the one that will reduce this loss the most. Tong and Koller [113] introduced a look-ahead algorithm called MaxMin. Similar to Query By Committee (see chapter 5 on page 51 ), MaxMin tries to estimate the reduction in the size of the version space. Given an instance x, the algorithm calculates rx+ and rx− which are the radii of the largest balls in the version space when x is used for training with the labels +1 or −1 respectively. The radius of the largest ball in the version space assuming x is labeled gives a lower bound on the volume of the version space. This gives an estimate of the reduction of the volume of the version space. The next query point is selected to be the point which maximizes min (rx+ , rx− ). A point for which min (rx+ , rx− ) is large is expected to bisect the version space most equally and thus will reduce the volume of the version space. Another algorithm, called Ratio by + − r r [113] will query the label of the instance x for which min rx− , rx+ is maximized. See figure 4.3 x

x

for the results of applying both MaxMin and Ratio to a text classification task. It is clear that

both algorithms significantly outperform passive learning in this task. Note that in the experiment reported in [113] the Simple Active Learning Algorithm (subsection 4.1.2.1 on page 43) is equally good. However in other experiments reported by Tong and Koller [113] Ratio and MaxMin were better than Simple. Another approach using a look-ahead principle was introduced by Zhang and Chen [119] who used it for information retrieval of visual objects.

4.2

Theoretical Studies of Selective Sampling

The theoretical study of selective sampling is still in its infancy. Only a few authors have studied the theoretical aspect of selective sampling. Freund et al [46] were the first to show that selective sampling can reduce the number of needed labels exponentially. Their results are discussed and extended in chapter 6 on page 59. Recently, Dasgupta [34, 35] proved some positive and negative results about selective sampling. The negative results show that there are cases where selective sampling cannot reduce the number of labels needed significantly. Consider for example the

46

Chapter 4. Preliminaries

indicating functions concept class. In this case, the sample space X is a finite set and the concept class is C = {cx : x ∈ X } where

   1 if x = x′ ′ cx (x ) =   −1 otherwise

it is easy to see that in this case O (1/ |X |) labels are needed on average in order to have an accuracy of O (1/ |X |). This is similar to the number of labels a passive learner will use. Dasgupta [34] showed that the situation we have just demonstrated is not unique to the class of indicating functions. He proved the following lemma: Lemma 4.1 (Claim 1 in [34]) Let C be the class of linear separators in IR2 . For any set of m distinct instances in the unit sphere there are hypotheses in the concept class which cannot be identified without querying all m labels. Lemma 4.1 shows that if we take m points on the unit sphere and assume the uniform distribution over these points, we may need m labels in order to have an accuracy of O (1/m). This is not a great saving over passive learning. Moreover, this is not unique to the case discussed in lemma 4.1: Lemma 4.2 (Claim 2 in [34]) For any d ≥ 2 and m ≥ 2d there is a sample space X of size m and a concept class C of VCdimension d over the domain X , with the following property: if a concept c is chosen uniformly from C then the average number of labels needed in order to identify c is greater than m/8. Lemmas 4.1 and 4.2 show that a small VC dimension does not guarantee that the concept class can be learned with few labels. Dasgupta [34] showed that even when we restrict our selves to the class of non-homogeneous linear classifiers, no selective sampling algorithm can guarantee a significant reduction in the number of needed labels. Along-side the negative results, Dasgupta provided an encouraging positive result. He studied the Greedy Strategy for selective sampling (see algorithm 3). This algorithm receives a batch of unlabeled instances and finds the next query point in a greedy fashion. Whenever it needs to decide on the next query point, it goes over all the instances that have not yet been labeled. For each such instance, it calculates the measure of hypotheses which label it with +1 and the measure of hypotheses which label it with −1 and chooses to query for the label of the instance for which

4.2. Theoretical Studies of Selective Sampling

47

Algorithm 3 Greedy Strategy for Selective Sampling [34] Inputs: • A sample S. • A concept class C defined over S. • A distribution π over C. Output: • An hypothesis h. Algorithm: 1. Let V1 = C. 2. For t = 1, . . . , |S| (a) For every x ∈ S let Vt+ (x) = {c ∈ Vt : c (x) = 1} and Vt− (x) {c ∈ VT : c (x) = −1}. (b) If maxx min π Vt+ (x) , π Vt− (x) then break the loop. (c) Query for the label y of x for which min π Vt+ (x) , π Vt− (x) is maximized.

=

(d) Let Vt+1 = {c ∈ Vt : c (x) = y}. 3. Endfor 4. Return any c ∈ Vt .

these two measures are most equal. The target concept remains in the version space. Any other concept, which disagrees with the target concept on an instances in the sample, will be removed from the version space during the process. Thus it is guarantied that the concept returned by the greedy strategy is consistent with the target concept on the sample. Dasgupta proved the following property of the greedy strategy: Theorem 4.1 (Theorem 3 in [34]) Let π be any distribution over C. Let Q be any query strategy. Let µgreedy

=

Ec∼π [number of queries needed by the greedy stratigy to indetify c]

and µQ

=

Ec∼π [number of queries needed by Q to indetify c]

then 1 µgreedy ≤ 4µQ ln minc∈C π (c)

(4.1)

48

Chapter 4. Preliminaries Theorem 4.1 proves that the average number of queries for labels the greedy strategy will make

is comparable to the best possible strategy. To see this, assume that C has a VC dimension d < ∞ and let m = |S|. From Sauer’s lemma [100] we have that the number of different hypotheses in C d

when we restrict it to S is at most (em) . Assume that π is uniform over these hypotheses then from (4.1) we have that µgreedy

≤ 4µQ ln

1

1/ (em)d = 4µQ d ln (em)

hence the average number of queries needed by the greedy strategy exceeds the best possible strategy by a factor of O (ln m) at most. Another significant theoretical result was proven by Dasgupta, Kalai and Monteleoni [36]. They suggested a modification to the well-known Perceptron algorithm [90] (see algorithm 4). This very simple modification learns homogeneous linear classifiers. It has several advantages over most of the algorithms we have discussed so far. First, it is a very simple algorithm and second, it works in a streaming (online) fashion as opposed to the batch fashion that is used in the greedy algorithm. Both these properties make it very attractive to use even with extremely large data sets. However, since the dimensionality of the data d is explicitly used in this algorithm, it is not possible to efficiently use it with kernels, especially kernels which map the data to infinite spaces (see Chapter 10 for more on kernels). Dasgupta et al [36] proved the following property of the Perceptron based active learning algorithm: Theorem 4.2 (theorem 3 in [36]) Let ǫ, δ > 0. Let L = O d log

1 ǫδ

log dδ + log log 1ǫ

and R = O

d δ

+ log log 1ǫ . Assume that

the underlying distribution over the sample space is a uniform distribution over the unit sphere

in IRd and that the target concept is a homogeneous linear classifier. With probability 1 − δ, the Perceptron based active learning algorithm will use L labels, will make O ( L) errors while learning and will return a hypothesis which is ǫ close to the target concept. Theorem 4.2 proves that under the assumption of uniform distribution, the Perceptron based active learning algorithm will use O log 1ǫ labels in order to return a hypothesis which is ǫ close to the target concept. This is an exponential improvement over any passive learner which will need O (1/ǫ) labels in the same setting.

4.2. Theoretical Studies of Selective Sampling

49

Algorithm 4 Perceptron Based Active Learning [36] Inputs: • Dimension d. • Maximum number of labels L. • A patience parameter R. Output: • A homogeneous linear classifier v. Algorithm: 1. Let v1 = y1 x1 for the first example (x1 , y1 ). √ 2. Let s1 = 1/ d 3. For t = 1 . . . L (a) Wait for the next instance x such that |x · vt | ≤ st and query for its label. Call this example (xt , yt ). (b) If yt (xt · vt ) < 0 then

i. vt+1 = vt − 2 (xt · vt ) xt . ii. st+1 = st .

(c) else i. vt+1 = vt ii. If no prediction mistakes were made for the last R instances for which a query for label was made then A. st+1 = st /2. iii. else A. st+1 = st . 4. Endfor 5. Return vt

50

Chapter 4. Preliminaries

4.3

Label Efficient Learning

The online selective sampling framework is called label efficient learning [52]. In this setting, the learner sees a stream of unlabeled instances. When a new instance is introduced, the learner can either predict its label or query for it. The learner tries to minimize the number of prediction mistakes and at the same time minimize the number of queries. Note that unlike the passive online learning framework [77], the true label is not revealed to the learner unless a query for label was made. This model has been studied by several authors. Cesa-Bianchi et al. [27], following Helmbold and Panizza [52] studied label efficient learning in use of experts advice framework. In this model it is assumed that there are many experts where one of them makes very few or even no prediction mistakes. The task of the learner is to find this expert. They showed that if the learner has a limited budget of queries for labels he is allowed to make at any given time frame then it is still possible to predict almost as well as the best expert, as long as the budget for queries grows to infinity at not too slow a rate. Cesa-Bianchi, Conconi and Gentile [26] studied learning linear classifiers in the label efficient setting. The algorithm presented by the authors uses the margin of the linear classifier with respect to the prediction it makes as a criterion to select the right instances to query for their label. In addition, they were able to analyze a slightly modified version of this algorithm in which, if the label of an instance is predicted with too small a margin, then a query for label is made for the label of the next instance in the sequence. They were able to show that when certain conditions apply, the number of prediction mistakes can be logarithmic with respect to the number of instances in the sequence.

4.4

Summary

As we saw, there are many selective sampling algorithms but only few have theoretical grounding. Unfortunately, the algorithms that have theoretical grounding lack practical implementation. In the next chapter we introduce the Query By Committee (QBC) algorithm. Much of the presentation from this point and on is about providing both theoretical grounding and practical implementation for the QBC algorithm.

Chapter 5

The Query By Committee Algorithm The Query By Committee (QBC) algorithm was presented by Seung et al. [104] and analyzed in [46, 110]. The algorithm assumes the existence of some underlying probability measure over the hypotheses class. At each stage, the algorithm operates on the version-space: the set of hypotheses that were correct so far. Upon receiving a new instance the algorithm has to decide whether to query for its label are not. This is done by randomly selecting two hypotheses from the version-space. A query for label is made only if these two hypotheses predict different labels for the instance under consideration. The algorithm is presented as Algorithm 5 on page 52. The QBC algorithm is, as its name suggests, a committee-based algorithm (see section 4.1.1 on page 38). The committee is formed of all the possible classifiers in the sense that any classifier which may be the target concept is considered. Alternatively, QBC can be viewed as though it used a look-ahead principle (see section 4.1.3 on page 43). To see this, Let V be the current version space and let x be an instance. Denote by V + the set of hypotheses in the version space which predict a label +1 for x. Similarly, let V − be the set of hypotheses in the version space which predict the label of−1 for x. It follows, that QBC will query for the label of x with the probability of 2ν (V + ) ν (V − ). QBC tends to query for instances which split the version space evenly. QBC works in an online fashion; it sees an instance once and makes its decision whether to query for its label or not. Although this limits the algorithm, the probabilistic way in which QBC makes its decision utilizes the online setting to remain tuned to the underlying distribution of the inputs. Thus, the 51

52

Chapter 5. The Query By Committee Algorithm

Algorithm 5 Query By Committee [104] Inputs: • Required accuracy ǫ. • Required confidence 1 − δ. • A prior ν over the concept class. Output: • A hypothesis h. Algorithm: 1. Let V1 = C. 2. Let k ← 0. 3. Let l ← 0. 4. For t = 1, . . . (a) Receive an unlabeled instance xt . (b) Let l ← l + 1.

(c) Select c1 and c2 randomly and independently from the restriction of ν to Vt .

(d) If c1 (x) 6= c2 (x) then i. ii. iii. iv.

(e) else

Query for the label c (x). Let k ← k + 1. Let l ← 0. Let Vt+1 ← {c ∈ Vt : c (xt ) = yt }.

i. Let Vt+1 ← Vt .

(f) If‡ l ≥ tk

i. Choose a hypothesis h according to the termination rule. ii. Return h.

‡

Step 4f is the termination procedure of QBC. The exact choices of tk and the returned hypothesis are discussed in section 5.1. For a short summary, see table 5.1.

5.1. Termination Procedures

53

probability that QBC will query for the label of an instance depends on two factors: the “evenness” of the split induced on the version space, and the probability of observing this instance.

5.1

Termination Procedures

According to the definition of the QBC algorithm by Seung et al [104], once the algorithm reaches a steady version space, i.e. once the algorithm has not queried for a label for a long consecutive set of instances, QBC terminates and returns a random hypothesis from the version space. Below we study this rule as well as some alternative procedures (step 4f in algorithm 5). We also prove the correctness of the algorithm in the sense that the hypothesis returned by QBC is indeed a good one. To simplify the presentation we discuss the “original” termination procedure later.

5.1.1

The “Optimal” Procedure

Assume that the QBC algorithm queries for the labels of k instances. At this point the learner has a posterior over the hypotheses class which is the restriction of the prior to the version space V . Given this information, the optimal classifier that the learner can return is the Bayes classifier which is defined:

   +1 if Prc∼ν|V [c (x) = +1] ≥ 1/2 cBayes (x) =   −1 if Prc∼ν|V [c (x) = −1] > 1/2

where c ∼ ν|V means that c is chosen according to the restriction of ν to V . The first optimal procedure we suggest works as follows: if the QBC algorithm did not query for a label for the last tk consecutive instance after making the k’th query, then QBC terminates and returns the Bayes classifier cBayes as its hypothesis. The following proves the correctness of this procedure, together with the right choice for tk : Theorem 5.1 Assume that QBC is used with tk =

2 ǫ

ln π

2

(k+1)2 6δ

instead of the value defined in

algorithm 5. Let the Bayes classifier cBayes be defined using the version space V used by QBC when terminating. Then with a probability of 1 − δ over the sample and the internal randomness of QBC, h h ii Ec∼ν|V Pr cBayes (x) 6= c (x) ≤ ǫ x

Proof: Assume that QBC made k queries for labels to generate the version space V . Assume that QBC did not query for any additional label for tk consecutive instances after making the k’th

54

Chapter 5. The Query By Committee Algorithm

query. Let cBayes be the Bayes classifier, then    +1 if Prc∼ν|V [c (x) = +1] ≥ 1/2 cBayes (x) =   −1 if Prc∼ν|V [c (x) = −1] > 1/2

Arrange x and c such that cBayes (x) 6= c (x). From the definition of the Bayes classifier it

follows that if we pick a random hypothesis c′ from the distribution ν|V then with a probability h i of at least 1/2 we will have c′ (x) 6= c (x). Therefore, if we denote by cBayes (x) 6= c (x) the indicating function then

Ec′ ∼ν|V [c′ (x) 6= c (x)] ≥ for any c and x.

i 1h cBayes (x) 6= c (x) 2

h i Assume that Ec∼ν|V,x cBayes (x) 6= c (x) > ǫ. Thus,

Ec,c′ ∼ν|V,x [c′ (x) 6= c (x)] >

ǫ 2

this means that the probability that QBC will not query for the label of the next instance is at h i most 1 − 2ǫ . Hence, if Ec∼ν|V,x cBayes (x) 6= c (x) > ǫ the probability that QBC will not query for a label for the next tk consecutive instance is at most ǫ ǫ tk 1− ≤ e− 2 tk 2

by choosing tk =

2 ǫ

ln π

2

(k+1)2 6δ

we get that the probability that QBC will not query for tk consec-

utive labels when the Bayes classifier is not “good enough” is

6δ . π 2 (k+1)2

By summing over k the

proof is completed. Corollary 5.1 Assume that QBC is used with tk =

2 ǫ

ln π

2

(k+1)2 6δ

instead of the value defined in

algorithm 5. Let the Bayes classifier cBayes be defined using the version space V used by QBC when terminating. Then with a probability of 1 − δ over the sample and the internal randomness of QBC,

h h ii Ec∼ν Pr cBayes (x) 6= c (x) ≤ ǫ x

Proof: From theorem 5.1 we have that h h ii Ec∼ν|V Pr cBayes (x) 6= c (x) ≤ ǫ x

therefore

h h ii = Ec∼ν Pr cBayes (x) 6= c (x) x

= ≤

h h ii EV,c∼ν|V Pr cBayes (x) 6= c (x) x h h ii Ev Ec∼ν|V Pr cBayes (x) 6= c (x) x ǫ

5.1. Termination Procedures

5.1.2

55

Random Gibbs Hypothesis

Another possible solution to the generalization phase is to use a random Gibbs hypothesis. In this procedure, whenever the QBC decides to terminate the learning process, a random hypothesis is drawn out of the version space and is used for making further predictions. This is the “original” termination procedure suggested in [46]. We suggest two possible analyses for this procedure: an average analysis and analysis of the “typical” case. Theorem 5.2 Assume that QBC is used with tk =

4 ǫ

ln π

2

(k+1)2 6δ

instead of the value defined in

algorithm 5. Let the Gibbs classifier cGibbs be defined using the version space V used by QBC when terminating. Then with a probability of 1 − δ over the sample and the internal randomness of QBC,

h i EGibbs,c∼ν Pr cGibbs (x) 6= c (x) ≤ ǫ x

Note that since the Gibbs hypothesis is a random hypothesis, the error in theorem 5.2 is averaged over this randomness. Proof: From corollary 5.1 we have that using the choice of tk that ii h h Ec∼ν Pr cBayes (x) 6= c (x) ≤ ǫ/2 x

Since Haussler, Kearns and Schapire [49] proved that the average error of the Gibbs classifier is at most twice as large as the error of the Bayes classifier, the statement of the theorem follows. Theorem 5.2 shows that the average error of the Gibbs hypothesis is not large. In the next theorem we show that this is also the typical case. Theorem 5.3 Assume that QBC is used with tk =

8 ǫδ

ln π

2

(k+1)2 3ǫδ

instead of the value defined in

algorithm 5. Let the Gibbs classifier cGibbs be defined using the version space V used by QBC when terminating. Then with a probability of 1 − δ over the choice of the sample, the target hypothesis and the internal randomness of QBC, Pr cGibbs (x) 6= c (x) ≤ ǫ x

Proof: This follows immediately from theorem 5.2 and the Markov inequality. From the choice of tk we have that with a probability of 1 − δ/2 h i EGibbs,c∼ν Pr cGibbs (x) 6= c (x) ≤ ǫδ/2 x

(5.1)

56

Chapter 5. The Query By Committee Algorithm

Therefore, from the Markov inequality, if (5.1) holds, we have with a probability of 1 − δ/2 that Pr cGibbs (x) 6= c (x) ≤ ǫ x

Note that using a direct argument (instead of using the previous theorems as building blocks) we can get tk =

2 ǫδ

ln π

2

(k+1)2 3ǫδ

which is better by a factor of 4. Since this is of minor significance

we do not dwell on this argument.

5.1.3

Bayes Point Machine

The Gibbs sampler does not use a single classifier to make predictions. Rather, it randomly selects a hypothesis. Still another alternative is to use the Bayes Point Machine (BPM) [54, 47] to generate future predictions. The BPM uses a hypothesis in the version space that is the closest to the Bayes optimal classifier. When the concept class is the class of linear classifiers, then this is just the center of gravity of the version space. Gilad-Bachrach, Navot and Tishby [47] proved the following theorem:

Theorem 5.4 (Theorem 1 in [47]) Let the concept class be the class of linear classifiers and let the prior ν be log-concave, V is a version space and cBPM and cBayes are the Bayes Point Machine and Bayesian classifiers respectively, then Pr

c∼ν|V

h i c (x) 6= cBPM (x) ≤ (e − 1) Pr c (x) 6= cBayes (x) c∼ν|V

From this, we derive the following theorem:

Theorem 5.5 Assume that QBC is used with tk =

2(e−1) ǫ

ln π

2

(k+1)2 6δ

instead of the value defined

in algorithm 5. Let the concept class be the class of linear classifiers and let the prior ν be logconcave. Let the Bayes Point Machine classifier cBPM be defined using the version space V used by QBC when terminating. Then with a probability of 1 − δ over the sample and the internal randomness of QBC, h i Ec∼ν Pr cBPM (x) 6= c (x) ≤ ǫ x

Proof: The proof follows immediately from corollary 5.1 and theorem 5.4.

5.2. Summary

5.1.4

57

Avoiding the Termination Rule

In many applications, the training process need not terminate. We can assume that there is a constant stream of instances, and for each instance QBC makes one of two possible actions: it can either decide to query for the label of the instance or decide not to query for its label. Note that if QBC did not query for the label, this is because the random hypotheses QBC drew, predicted the same label for the instance. Therefore, we have a natural way to predict the label of this instance using the two random hypotheses. Using this “non-stop” rule makes QBC closer in spirit to label efficient algorithms (see section 4.3 on page 50). In the following theorem we show that the number of prediction mistakes QBC makes is proportional to the number of queries it makes. Therefore, if we can guarantee the number of queries to be small, we immediately obtain a bound on the number of prediction mistakes. Indeed, in chapter 6 we show that when certain conditions apply, the number of queries is small. Theorem 5.6 For any instance x and at any stage of the learning, the probability that QBC will make a prediction mistake on x is exactly half the probability it will query for the label of x. Proof: Let V be the current version space. Let x be an instance for which QBC needs to decide whether to predict its label or query for it. The probability that QBC will query for the label is 2 Pr [c (x) = 1] Pr [c (x) = −1] c∼ν|V

c∼ν|V

where the probability that it will make a prediction mistake is 2

2

Pr [c (x) = 1] Pr [c (x) = −1] + Pr [c (x) = −1] Pr [c (x) = 1]

c∼ν|V

c∼ν|V

c∼ν|V

c∼ν|V

which is Pr [c (x) = 1] Pr [c (x) = −1]

c∼ν|V

c∼ν|V

Therefore, the probability that QBC will query for a label for a given instance is exactly twice the probability it will make a prediction mistake.

5.2

Summary

In this chapter we have presented the Query By Committee algorithm. We have shown several possible termination rules for this algorithm. In table 5.1 the different options are listed

58

Chapter 5. The Query By Committee Algorithm

Table 5.1: Possible Termination Procedures for QBC. In the following table, the possible termination procedures for QBC are listed, together with the guarantee they provide. Procedure

tk π 2 (k+1)2 6δ

Bayes classifier

2 ǫ

ln

Gibbs classifier

4 ǫ

ln π

Gibbs classifier

8 ǫδ

ln π

BPM (linear classifiers) no termination (with probability 1)

2(e−1) ǫ

2

(k+1)2 6δ

2

(k+1)2 3ǫδ

ln π -

2

(k+1)2 6δ

Guarantee − δ) h (with prob. 1ii h Ec∼ν Prx cBayes (x) 6= c (x) ≤ ǫ

EGibbs,c∼ν Prx cGibbs (x) 6= c (x) ≤ ǫ Prx cGibbs (x) 6= c (x) ≤ ǫ

Ec∼ν Prx cBPM (x) 6= c (x) ≤ ǫ

Ec∼ν [number of queries] = 2Ec∼ν [number of prediction mistakes]

Chapter 6

Theoretical Analysis of Query By Committee In chapter 5 we introduced the QBC algorithm and studied its basic properties. We now turn to the fundamental properties of this algorithm. We follow the guidelines of Freund et al [45] while introducing some corrections and extensions. The QBC algorithm assumes a Bayesian setting. It assumes the existence of a prior distribution ν over the concept class. It assumes that the teacher selects the target concept from the concept class using this prior. Furthermore, it is assumed that ν is known to the learner; however the learner does not have access to the random choices made by the teacher. In chapter 7 on page 85 and chapter 8 we lift some of these assumptions. We begin this chapter by introducing and discussing the information gain in section 6.1. In section 6.2 we present theorems about the fundamental properties of QBC. These theorems show that when there is a lower bound on the expected information gain, the error rate of the hypotheses learned by QBC decreases exponentially with respect to the number of queries for labels it makes. The proofs of these theorems are provided in section 6.3. In section 6.4 we study the class of parallel planes and argue that there is a lower bound on the expected information gain for this class once the prior over the concept class is concave. We prove this argument in section 6.5. We wrap up with a short discussion in section 6.6. The theorems presented in this chapter are significant for an understanding of the Query By Committee algorithm. However, the proofs of these theorems are long and involve non-trivial techniques. Therefore, some readers may wish to skip these proofs (i.e. sections 6.3 and 6.5). 59

60

Chapter 6. Theoretical Analysis of Query By Committee

Doing so will not prevent the reader from following the rest of this essay.

6.1

The Information Gain

The key tool in analyzing the QBC algorithm is what is known as instantaneous information gain. It was introduced in [49] as a tool for analyzing the progress of learning algorithms. Let V be a version space and x be an instance. x induces a split of the version space such that V + consists of all the concepts which label x with the label +1 and V − consists of all concepts which label x with the label −1. Assume that there exists some prior ν over C. If the observed label for x is +1 we know that the concept we are learning is in V + and thus we say that the information we have gained from the instance x and its label is log (ν (V ) /ν (V + )). Equally if the label of x is −1 we say that the information gained is log (ν (V ) /ν (V − )). Definition 6.1 Let V be a version space and ν be some probability measure defined over V . Let x be an instance and y be the label of this instance. The instantaneous information gain from the pair (x, y)is log

ν ({c ∈ V }) ν ({c ∈ V : c (x) = y})

In the setting of selective sampling, we have an instance x but we do not have its label. At this point we can look at the expected information gain. The probability that the label of x will be +1 is exactly ν (V + ) /ν (V ), in which case the instantaneous information case will be log ν (V ) /ν (V + ). Equally, the probability that the label of x will be −1 is exactly ν (V − ) /ν (V ) in which case the instantaneous information gain will be log ν (V ) /ν (V − ) and thus the expected information gain from the label of x is ν (V + ) ν (V ) ν (V − ) ν (V ) log + log + ν (V ) ν (V ) ν (V ) ν (V − ) which is exactly H (ν (V + ) /ν (V )) where H (·) is Shanon’s binary-entropy [105]. Definition 6.2 Let V be a version space and ν be some probability measure over C. Let x be an instance then the expected information gain from the label of x is X ν ({c ∈ V : c (x) = y}) ν (V ) log ν (V ) (ν ({c ∈ V : c (x) = y})) y The expected information gain from an instance x is the entropy of the split it induces on the version space. We see that the most informative instances are the ones for which the split they induce are even. On the other hand, an instance for which the label is known a-priori in the sense that all the hypotheses in the version space agree on its label, has zero expected information gain.

6.2. The Fundamental Theory of QBC

61

If the instances for which QBC queries for labels, all have high expected information gain, then the volume, in the probabilistic sense of the version space is guaranteed to shrink fast. Therefore, we are guaranteed that QBC will focus on the target concept and its close neighbors. In the next section we prove this intuition but before doing so we need to define the expected information gain from the next query QBC will make. We have defined the expected information gain from an instance (definition 6.2). We need to define the expected information gain from the next QBC query. This value takes into account both the information of an instance, and also the probability of observing this instance and the probability that QBC will query for its label. Definition 6.3 Let V be a version space and ν be a probability measure over C. Let D be a distribution over the sample space X . Let ν + (x) = ν ({c ∈ V : c (x) = 1}) /ν (V ) and ν − (x) = ν ({c : c (x) = −1}) /ν (V ). The expected information gain from the next QBC query for label is R + 2ν (x) ν − (x) H (ν + (x)) dD (x) R G (ν, D) = 2ν + (x) ν − (x) dD (x)

To explain definition 6.3 we note that for an instance x, the value 2ν + (x) ν − (x) is the prob-

ability that QBC will query for the label of x. Note that the expected information gain from the next QBC query is a function of both the prior ν over the concept class and the distribution D over the sample space. Finally we use the definitions we have introduced here to define “good” concept classes and distribution, i.e. the concept classes and distributions for which we will be able to prove the properties of QBC. Definition 6.4 The concept class C endowed with the prior ν and distribution D over the sample space has a uniform lower bound g over the expected information gain, if for any set of instance x1 , . . . , xm and any concept c ∈ C G (ν|V, D) ≥ g where V is the version space induced by x1 , . . . , xm and the labels c (x1 ) , . . . , c (x2 ).

6.2

The Fundamental Theory of QBC

In this section we prove the basic properties of the Query By Committee algorithm. We show that if we can lower bound the expected information gain from the next query QBC will make, then we can guarantee that the QBC algorithm will make very few queries while learning. The following theorem shows this for various termination rules of the QBC algorithm.

62

Chapter 6. Theoretical Analysis of Query By Committee

Theorem 6.1 Let C be a concept class with a VC-dimension d. Let ν be a prior over C and let D be a distribution over the sample space. Letg > 0 be a uniform lower bound over the expected g log 1+

information gain and let g˜ =

g 16 log 16 −g g

4

k ≥ max

. Let δ > 0 and let 2 d+1 2 8 ln , log g2 δ g˜ δ

Then with a probability of 1 − 2δ, QBC will use at most k queries for label and m0 =

d g˜k/(d+1) 2 e

unlabeled instances when learning and will return a hypothesis with the following properties (depending on the termination rule used): 1. If QBC is used with the Bayes optimal classification rule, it will return a hypothesis (the Bayes optimal hypothesis) such that h h ii Ec∼ν Pr cBayes (x) 6= c (x) ≤ ǫ x

for any 2

ǫ>

2ek −˜gk/(d+1) π 2 (k + 1) 2 ln d 6δ

2. If QBC is used with the Gibbs average termination rule, it will return an hypothesis such that

for any

h i EGibbs,c∼ν Pr cGibbs (x) 6= c (x) ≤ ǫ x ǫ>

4ek −˜gk/(d+1) π 2 (k + 1)2 2 ln d 6δ

3. If QBC is used with the Gibbs “typical” error termination rule, it will return a hypothesis such that Pr cGibbs (x) 6= c (x) ≤ ǫ x

for any ǫ>

2 (k+1)2 8ek ln dπ 24ek + δd

g ˜k d+1

ln 2

2−˜gk/(d+1)

4. If QBC is used with Bayes point machine termination rule, it will return a hypothesis such that

h i Ec∼ν Pr cBPM (x) 6= c (x) ≤ ǫ x

for any

2

ǫ>

2 (e − 1) ek −˜gk/(d+1) π 2 (k + 1) 2 ln d 6δ

6.2. The Fundamental Theory of QBC

63

Theorem 6.1 shows that the error rate of the hypothesis returned by QBC decreases exponentially with respect to the number of queries allowed. This is true for all termination rules considered here. Note that in passive learning models, the error rate decreases as 1/polynomial with respect to the size of the sample (see e.g. [6] theorem 4.2). Hence, when there is a lower bound on the expected information gain, learning is exponentially faster when using QBC. It is interesting to note that if we look at the error rate achieved as a function of the size of the sample used m0 , the error rate decreases as 1/polynomial and thus, QBC does not “waste” much in the instances where it did not query for their label. The proof of the theorem is provided in section 6.3. The following theorem deals with the situation when QBC is being used without any termination rule. It shows that when there is a uniform lower bound on the expected information gain, both the number of queries and the number of prediction mistakes is logarithmic with respect to the length of the sequence processed.

Theorem 6.2 Let C be a concept class with a VC-dimension d together with a prior ν and a distribution D. Let g > 0 be a uniform lower bound over the expected information gain and let g˜ =

g log 1 +

g 16 log 16 g −g

4

Denote by k (x1 , . . . , xm ) the number of queries QBC makes when processing x1 , . . . , xm (note that this also depends on the target concept and the internal randomness of QBC), and let B (m) = max

em 8 d+1 em , 2 ln log g˜ d g d

+

2d e

Then for m > 0 Ex1 ,...,xm ,c,QBC [k (x1 , . . . , xm )] ≤ B (m) while the expected number of prediction mistakes is bounded by 12 B (m) Moreover, for δ > 0 Pr QBC,c,x1 ,x2 ,...

"

# B m2 ⌈log log m⌉ (⌈log log m⌉ + 1) ∃m k (x1 , . . . , xm ) ≥ ≤δ δ

and "

∃m mistakes (x1 , . . . , xm ) ≥ Pr QBC,c,x1 ,x2 ,...

1 2B

# m2 ⌈log log m⌉ (⌈log log m⌉ + 1) ≤δ δ

where mistakes (x1 , . . . , xm ) is the number of prediction mistakes QBC makes on x1 , . . . , xm .

64

Chapter 6. Theoretical Analysis of Query By Committee It is important to note that B m2 ≤ 2B (m) and thus B m2 = O (log m). This shows

that both the number of queries and number of mistakes QBC makes grows as a function of the logarithm1 of the number of instances processed. The following theorem is a key in proving the properties of the QBC algorithm.

Theorem 6.3 Let C be a concept class with a VC-dimension d together with a prior ν and a distribution D. Let g > 0 be a uniform lower bound over the expected information gain and let g log 1 + 16 logg16 −g g g˜ = 4 Denote by k = k (x1 , . . . , xm ) the number of queries QBC makes when processing x1 , . . . , xm (note that this also depends on the target concept and the internal randomness of QBC). Then for m > 0 Pr x1 ,x2 ,...,xm

6.3

∼Dm ,c∼ν,

QBC

k ≥ max

em 8 d+1 em , 2 ln log g˜ d g d

<

2d em

Proofs

As was previously mentioned, the analysis of the QBC algorithm uses the information gain as its core. Below, we note a few properties of the information gain. Let x1 , x2 , . . . be a sample, let c be a classifier and let ν be a prior. The instantaneous information gain from x1 and its label c (x1 ) is log

ν (c′ ∈ V ) ν (c′ ∈ V : c′ (x1 ) = c (x1 ))

If we have already seen the labels of x1 , . . . , xi−1 then the instantaneous information gain from xi and its label c (xi ) is log

ν (c′ : ∀j < i c′ (xj ) = c (xj )) ν (c′ : ∀j ≤ i c′ (xj ) = c (xj ))

and thus the information gained from x1 , . . . , xm and their labels, c (x1 ) , . . . , c (xm ) is simply the sum of the instantaneous information gain which is exactly log

1 ν (c′ : ∀j ≤ m c′ (xj ) = c (xj ))

i.e. the volume of the version space left after observing this labeled sample. The first lemma shows that for any sample of size m, the information from having the labels of all points in the sample is of order log (m). Lemma 6.1 (lemma 3 in [46]) 1 We

disregard a term of order (log log m)2 here.

6.3. Proofs

65

Let C be of VC-dimension d, and let S = {x1 , . . . , xm } be a sample such that m ≥ d. Let I (S, c) = log then

1 ν ({c′ : ∀x ∈ S c′ (x) = c (x)})

h em i d Pr I (S, c) > (d + 1) log < c∼ν d em

em d d

Proof: From Saur’s lemma [100] it follows that S breaks C into at most r ≤

equivalent

classes, where we say that c ∼ c′ if c (S) = c′ (S). Let E1 , . . . , Er be the different equivalent 1 for i’s such that c ∈ Ei . Using this notation we can write classes, then I (S, c) is simply log ν(E i)

for any α > 0 Pr [I (S, c) > α] =

c∼ν

X

i : log

1 >α ν (Ei )

ν (Ei ) ≤

X

i : log

1 >α ν (E i )

2−α ≤

em d d

2−α

(6.1)

and plug in (6.1) to get the stated result. Let α = (d + 1) log em d

Lemma 6.1 shows that if we get a fully labeled sample, then the information we have collected

from this sample is typically only logarithmic with respect to the size of the sample. Clearly, the information from a partly labeled sample can not exceed this bound (this follows from the information processing inequality [31] for example). Next we show that the information from the sub-sample that QBC collects grows linearly with respect to the number of queries QBC makes. Lemma 6.2 Assume that the expected information gain for the next query of QBC is lower bounded by g > 0. Let k > 0, and let V (k) be the version space induced by the first k queries of QBC, then Pr c,QBC,x1 ,x2 ,...

"

g g 1 < k log 1 + log ν (V (k)) 4 16 log 16 g −g

!#

≤ e−kg

2

/8

This lemma amends lemma 1 in [46]. It shows that the information gained by QBC grows linearly with respect to the number of queries it makes. Proof: Let gi be the expected information gain from the i’th instance for which QBC queried for labels. From the definition of the expected information gain we have that 0 ≤ gi ≤ 1. Since there is a lower bound on the expected information gain, we have that Ec,QBC,x1 ,x2 ,... [gi ] ≥ g. The gi ’s form a martingale, thus using the martingale method (see e.g. [69] lemma 4.1) we have " k # X 2 kg gi < Pr ≤ e−kg /8 2 c,QBC,x1 ,x2 ,... i=1 Assume that

Pn

i=1

gi ≥

kg 2 .

At least

kg 4

of the gi ’s have gi ≥

g 4.

Recall that gi is the expected

information gain and thus gi = H (pi ) where pi is the probability of observing the label +1 for the

66

Chapter 6. Theoretical Analysis of Query By Committee

corresponding instance, given the previously queries instances and their labels. From lemma 6.3 on page 70 we have that for every i such that gi ≥ g4 that " !# # " g g g g pi ∈ ,1 − = , 1/ 1 + 16 log 16 16 log 16 16 log 16 16 log 16 g g g g −g This means that for each i such that gi ≥ g4 , the instantaneous information gained from the instance and its label is at least log 1 + 16 logg16 −g , since there are at least kg/4 instances for g

which gi ≥ g4 . Therefore the sum of the information gained from all query instances is at least ! g g k log 1 + 4 16 log 16 g −g which is linear with respect to k. We are now ready to prove theorem 6.3 Proof: of theorem 6.3 From lemma 6.1 we know that with a probability of 1 −

d em

the information gained from querying the labels of all m instances is at most (d + 1) log em d . Let k be the number of queries 2

QBC made. From lemma 6.2 we know that with a probability of 1 − e−kg /8 , the information gained from the queries QBC made is at least k g4 log 1 + 16 logg16 −g . If k ≥ g82 ln em d we have g

that 1 − e−kg

2

/8

≥1−

d em ,

since the information gained from labeling all the instances is greater

than the information of any subset of the instances, and thus with a probability of 1 − ! g g em k log 1 + ≤ (d + 1) log 4 d 16 log 16 g −g and thus

em 4 k ≤ (d + 1) log d g log 1 +

g 16 log 16 g −g

while

k≤

2d em

em 8 ln g2 d

We are now ready to prove theorem 6.1 Proof: of theorem 6.1 The correctness of the algorithm, i.e. the fact that the hypothesis returned is indeed close to the target concept with high probability, was already proved in chapter 5. We need only to analyze the number of labeled and unlabeled instances used. k ≥ k≥

d+1 g ˜

8 g2

ln δ2 implies that e−kg

2

/8

≤ δ/2. From the choice of m0 and the assumption that

log δ2 we have that d/em ≤ δ/2 and thus from theorem 6.3 we have that with a probability

6.3. Proofs

67

of 1 − δ, the number of queries QBC will make on a sample of size m0 is at most em d+1 0 log g˜ d by the choice of m0 = de 2g˜k/(d+1) we have that with a probability of 1 − δ, the number of queries QBC will make on m0 instances is at most k. Assume that QBC did not query for labels for more than k labels out of the m0 instances. Therefore, for any t < m0 /k there must have been a sequence of t consecutive instances for which QBC did not query for labels. From this point we will look at each termination condition separately. 1. Let ǫ be such that 2

ǫ>

2k π 2 (k + 1) ln m0 6δ

(6.2)

then from corollary 5.1 we have that if QBC did not make any query for labels for a sequence of tk =

2 ǫ

ln π

2

(k+1)2 6δ

consecutive instances then h h ii Ec∼ν Pr cBayes (x) 6= c (x) ≤ ǫ x

with a probability of 1 − δ. However, by the choice of ǫ in (6.2) we have that tk ≤ m0 /k and thus we are guaranteed with a probability of 1 − δ that indeed there will be such a sequence (note that for any j < k we have that tj < tk and thus QBC may terminate before the k’th query for labels). Therefore, with a probability of 1 − 2δ, the number of queries that QBC will make will be smaller than k and the algorithm will terminate before the m0 unlabeled instance, returning a Bayes classifier with an error rate smaller than ǫ as defined in (6.2). By choice of m0 we have that this is true for any ǫ such that 2

ǫ>

2ek −˜gk/(d+1) π 2 (k + 1) 2 ln d 6δ

2. Let ǫ be such that ǫ>

4k π 2 (k + 1)2 ln m0 6δ

(6.3)

then from theorem 5.2 we have that if QBC did not make any query for labels for a sequence of tk =

4 ǫ

ln π

2

(k+1)2 6δ

consecutive instances than h i EGibbs,c∼ν Pr cGibbs (x) 6= c (x) ≤ ǫ x

with a probability of 1 − δ. However, by the choice of ǫ in (6.3) we have that tk ≤ m0 /k and thus we are guaranteed with a probability of 1 − δ that indeed there will be such a sequence

68

Chapter 6. Theoretical Analysis of Query By Committee (note that for any j < k we have that tj < tk and thus QBC may terminate before the k’th query for labels). Therefore, with a probability of 1 − 2δ, the number of queries that QBC will make will be smaller than k and the algorithm will terminate before the m0 unlabeled instance, returning a Gibbs classifier with an error rate smaller than ǫ as defined in (6.2). Through the choice of m0 we have that this is true for any ǫ such that 2

ǫ>

4ek −˜gk/(d+1) π 2 (k + 1) 2 ln d 6δ

3. Let ǫ be such that

2

(k+1) 8k ln m0 π 24k ǫ> δm0

2

(6.4)

then from theorem 5.3 we have that if QBC did not make any query for labels for a sequence of tk =

8 ǫδ

ln π

2

(k+1)2 3ǫδ

consecutive instances then Pr cGibbs (x) 6= c (x) ≤ ǫ x

with a probability of 1 − δ. However, by the choice of ǫ in (6.3) we have that tk ≤ m0 /k and thus we are guaranteed with a probability of 1 − δ that indeed there will be such a sequence (note that for any j < k we have that tj < tk and thus QBC may terminate before the k’th query for labels). Therefore, with a probability of 1 − 2δ, the number of queries that QBC will make will be smaller than k and the algorithm will terminate before the m0 unlabeled instance, returning a Gibbs classifier with an error rate smaller than ǫ as defined in (6.2). By choice of m0 we have that this is true for any ǫ such that 2 (k+1)2 g ˜k 8ek ln dπ 24ek + d+1 ln 2 ǫ> 2−˜gk/(d+1) δd 4. Let ǫ be such that ǫ>

2 (e − 1) k π 2 (k + 1)2 ln m0 6δ

(6.5)

then from theorem 5.5 we have that if QBC did not make any query for labels for a sequence of tk =

2(e−1) ǫ

ln π

2

(k+1)2 6δ

consecutive instances than h i Ec∼ν Pr cBPM (x) 6= c (x) ≤ ǫ x

with a probability of 1 − δ. However, by the choice of ǫ in (6.5) we have that tk ≤ m0 /k and thus we are guaranteed with a probability of 1 − δ that indeed there will be such a sequence (note that for any j < k we have that tj < tk and thus QBC may terminate before the k’th query for labels). Therefore, with a probability of 1 − 2δ, the number of queries that QBC

6.3. Proofs

69

will make will be smaller than k and the algorithm will terminate before the m0 unlabeled instance, returning a Bayes classifier with an error rate smaller than ǫ as defined in (6.2). By choice of m0 we have that this is true for any ǫ such that 2

2e (e − 1) k −˜gk/(d+1) π 2 (k + 1) 2 ln d 6δ

ǫ>

Proof: of theorem 6.2 From theorem 6.3 we have that Pr

x1 ,x2 ,...,xm ∼Dm ,c∼ν,QBC

k ≥ max

em 8 d+1 em , 2 ln log g˜ d g d

<

2d em

Since the number of queries is at most m we have that the expected number of queries is at most max

em 8 d+1 em , 2 ln log g˜ d g d

+

2d e

Using Theorem 5.6 we conclude that the expected number of prediction mistakes is at most 1 max 2

em 8 em d+1 , 2 ln log g˜ d g d

+

d e

We now show that this is not only the average case but also the typical case. Let δ > 0. For any t = 1, 2, . . . we have using the Markov inequality that t   B 22 t (t + 1) δ k x1 , . . . , x 2t > ≤ Pr 2 δ t (t + 1) c,QBC,x1 ,x2 ,... By summing over t and using the fact that

Pr

c,QBC,x1 ,x2 ,...



P∞

1 t=1 t(t+1)

∃t k x1 , . . . , x 2t 2

= 1 we have that

t  B 22 t (t + 1) ≤δ > δ t

Let m > 0, and let t = ⌈log log m⌉. Clearly m ≤ 22 and thus for any sequence of instances k (x1 , . . . , xm ) ≤ k x1 , . . . , x22t t

using this fact and the fact that m2 ≥ 22 we have that # " B m2 ⌈log log m⌉ (⌈log log m⌉ + 1) ≤δ Pr ∃t k x1 , . . . , x22t > δ c,QBC,x1 ,x2 ,... Analyzing the probability of having too many prediction mistakes can be done in the same way as we analyzed the probability of having too many queries for label.

70

Chapter 6. Theoretical Analysis of Query By Committee

Lemma 6.3 Let H (p) be the binary entropy of p. If H (p) ≥ α > 0 then p≥

α 4 log α4

Proof: Let p∗ = α/ 4 log α4 . Since p∗ < 1/2 we have that for any p < p∗ that H (p) < H (p∗ )

thus it suffices to show that H (p∗ ) ≤ α. Using the fact that p∗ < 1/2 once again, we have that −p∗ log p∗ ≥ − (1 − p∗ ) log (1 − p∗ ) and thus H (p∗ ) = ≤

−2p∗ log p∗

=

−

= ≤ where the last inequality follows since

6.4

−p∗ log p∗ − (1 − p∗ ) log (1 − p∗ ) α α 4 log 4 log α 2 log α 4 4 log log α α 1+ 2 log α4

α

log log z log z

≤ 1 for z ≥ 2.

Lower Bound on the Expected Information Gain for Linear Classifiers

In previous sections we showed that whenever there is a uniform lower bound on the expected information gain, the QBC will learn fast. We showed how the error rate of the hypotheses generated by QBC decreases exponentially with respect to the number of queries made. In order to make these results meaningful, we now provide such uniform lower bounds for the concept class of parallel planes [46]. The following is the main result in this section. Theorem 6.4 Let C be the class of d dimensional parallel planes. Let ν be a prior distribution over C which is ρ-concave for ρ > −1. Let D be a distribution over the sample space X = IRd × IR such that for any x0 ∈ IRd there is a section [b1 (x0 ) , b2 (x0 )] (which may be empty) such that D (x, θ | x = x0 ) is uniform over [a (x0 ) , b (x0 )]. The expected information gain from the next query of the QBC algorithm is uniformly lower bounded by G (ρ)

=

22+ρ (1 + ρ) (2 + ρ) (3 + ρ) ln 2

2−2−ρ 2−3−ρ 2−2−ρ 2−3−ρ − ln 2 − ln 2 + 2 2 3+ρ 2+ρ (3 + ρ) (2 + ρ)

6.4. Lower Bound on the Expected Information Gain for Linear Classifiers ∞ n X Γ (ρ + 1) (−1) + Γ (ρ − n + 1) n! n=0

2−n−3

2−n−3 2 − 2 + n + 3 ln 2 (n + 3) (n + 3) 1

71

!!

and this bound is tight. The theorem we have just stated needs some explanation. We need to define the class of parallel planes, define ρ-concave measures and understand the function G (ρ). Note however, that this theorem is an extension of theorem 2 in [46] where Freund et. al. proved the same lower bound on the expected information gain for the class of parallel planes; however this theorem assumed a uniform distribution over the class of classifiers which is a special case of the theorem presented here, as any uniform distribution over convex bodies is log-concave, i.e. 0-concave.

6.4.1

The Class of Parallel Planes

The class of parallel planes [46] is a close relative of the class of linear separators. Each concept in this class is represented by a d dimensional vector w. An instance is a pair (x, b) and the classification rule is cw (x, b) = sign (w · x + b) Note that this is different from the class of non-homogeneous linear separators as in the later; the bias b is a part of the concept. As is the case with linear classifiers, the vector w can be scaled down. To see this, note that if w ∈ IRd and (x, θ) ∈ IRd × IR then cw (x, b) = cw/λ (λx, λb) for any λ > 0. Therefore, we will always assume that the w’s are in the d-dimensional unit ball.

6.4.2

Concave Measures

We provide a brief introduction to concave measures. See [9, 25, 93, 18] for more information about concave measures. We begin by defining concave measures. Definition 6.5 A probability measure ν over IRd is ρ-concave if for any measurable sets A and B and every 0 ≤ λ ≤ 1 the following holds: ρ

ρ 1/ρ

ν (λA + (1 − λ) B) ≥ [λν (A) + (1 − λ)ν (B) ] A few facts about ρ-concave measures:

• If ν is ρ-concave with ρ = ∞ then ν(λA + (1 − λ)B) ≥ max(ν(A), ν(B)). • If ν is ρ-concave with ρ = −∞ then ν(λA + (1 − λ)B) ≥ min(ν(A), ν(B)), in this case ν is called quasi-concave.

72

Chapter 6. Theoretical Analysis of Query By Committee • If ν is ρ-concave with ρ = 0 then ν(λA + (1 − λ)B) ≥ ν(A)λ ν(B)1−λ , in this case ν is called log-concave.

Many common probability measures are log-concave, for example uniform measures over compact convex sets, normal distributions, chi-square and others. ρ-concave measures are always unimodals. Moreover, any uni-modal measure is ρ-concave, at least for ρ = −∞. The parameter ρ provides a hierarchy for the class of uni-modal measures since if ν is ρ-concave, and ρ′ < ρ than ν is ρ′ -concave as well. Thus, the larger the ρ , the assumption of being ρ-concave is more restrictive. The following lemma shows that if a measure ν is ρ-concave then any restriction of ν to a convex body is ρ-concave as well. This makes the parameter ρ suitable for our discussion, since after the QBC has made several queries for labels we will be looking at the posterior, which is the restriction of the original prior to the version-space. The lemma shows that if the prior was ρ-concave then the posterior will be ρ-concave as well. Lemma 6.4 Let ν be a ρ-concave probability measure and let K be a convex body such that ν (K) > 0. Let νK be the restriction of ν to K such that νK (A) = ν (A ∩ K) /ν (K) then νK is ρ-concave. Proof: Let A and B be measurable sets and let 0 ≤ λ ≤ 1. Let x ∈ λ (A ∩ K)+(1 − λ) (B ∩ K). It follows that x ∈ λA + (1 − λ) B, and since K is convex we have that x ∈ K and thus we conclude that λ (A ∩ K) + (1 − λ) (B ∩ K) ⊆ (λA + (1 − λ) B) ∩ K and hence ν (K) νK (λA + (1 − λ) B) =

6.4.3

ν ((λA + (1 − λ) B) ∩ K)

≥

ν (λ (A ∩ K) + (1 − λ) (B ∩ K))

≥

[λν (A ∩ K)ρ + (1 − λ) ν (B ∩ K)ρ ]

=

ν (K) [λνK (A)ρ + (1 − λ) ν (B)ρ ]

1/ρ

1/ρ

The Function G (ρ)

The function G (ρ) as defined in theorem 6.4 might look frightening. Recall that G (ρ) is defined to be G (ρ)

=

22+ρ (1 + ρ) (2 + ρ) (3 + ρ) ln 2

2−2−ρ 2−3−ρ 2−2−ρ 2−3−ρ − ln 2 − ln 2 + 2 2 3+ρ 2+ρ (3 + ρ) (2 + ρ)

6.5. Proof of Theorem 6.4

73

Figure 6.1: The function G (ρ) from theorem 6.4 on page 70.

1 0.9 0.8 0.7

G(ρ)

0.6 0.5 0.4 0.3 0.2 0.1 0 -1

0

1

∞ n X Γ (ρ + 1) (−1) + Γ (ρ − n + 1) n! n=0

2 ρ

3

4

2−n−3 − + ln 2 2 2 n+3 (n + 3) (n + 3) 2−n−3

1

5

!!

Figure 6.1 shows a plot of this function. When ρ = −1 we have that G (ρ) = 0, however it climbs fast and reaches

1 9

+

7 18 ln 2

≈ 0.67 when ρ = 0. The function is monotone increasing and

approaches 1 as ρ grows to infinity.

6.5

Proof of Theorem 6.4

We now turn to the information gain of ρ-concave measures. Proof: of theorem 6.4. The first step to take is to come up with a simplified notation for the expected information gain. Recall that the expected information gain is G (ν, D) =

E(x,b)∼D [2ν {w : w · x + b > 0} ν {w : w · x + b < 0} H (ν {w : w · x + b < 0})] E(x,b)∼D [2ν {w : w · x + b > 0} ν {w : w · x + b < 0}] (6.6)

We will show that for any x0 in the support of D Eb∼Dx0 [2ν {w : w · x0 + b > 0} ν {w : w · x0 + b < 0} H (ν {w : w · x0 + b < 0})] ≥ G (ρ) (6.7) Eb∼Dx0 [2ν {w : w · x0 + b > 0} ν {w : w · x0 + b < 0}]

74

Chapter 6. Theoretical Analysis of Query By Committee

where b ∼ Dx0 means that (x, b) is sampled from the distribution D conditioned on x = x0 . Once we prove (6.7), we have that G (ν, D) ≥ G (ρ) as well. This follows since for any two positive functions f and g and any probability measure over x is Ex [f (x)] f (x) ≥ min x g (x) Ex [g (x)] Therefore, from here on, we will be trying to prove (6.7). Fix x0 and denote F (b) = ν {w : w · x0 + b < 0} we see that we can rewrite (6.7) as Eb∼Dx0 [F (b) (1 − F (b)) H (F (b))] Eb∼Dx0 [F (b) (1 − F (b))] Note that F (b) is the Cumulative Density Function (CDF) of ν when projected along the vector x0 . Since ν is ρ-concave, F is ρ-convex (see e.g. [93]). Furthermore, according to the assumptions of this theorem, the distribution Dx0 is uniform over the segment (b1 , b2 ), thus we write R b2 F (b) (1 − F (b)) H (F (b)) db G (F ) = b1 R b2 b1 F (b) (1 − F (b)) db

from now on we will study G (F ) for ρ-convex functions. W.l.o.g. assume that 0 ∈ (b1 , b2 ) and that 0 is the median of the CDF F , i.e. F (0) = 1/2. Note that G (F ) = =

≥

R b2 b1

R0

F (b) (1 − F (b)) H (F (b)) db R b2 F (b) (1 − F (b)) db b1

Rb F (b) (1 − F (b)) H (F (b)) db + 0 2 F (b) (1 − F (b)) H (F (b)) db R0 R b2 b1 F (b) (1 − F (b)) db + 0 F (b) (1 − F (b)) db ! R0 R b2 b1 F (b) (1 − F (b)) H (F (b)) db 0 F (b) (1 − F (b)) H (F (b)) db (6.8) , min R0 R b2 F (b) (1 − F (b)) db F (b) (1 − F (b)) db b1 0 b1

Due to the symmetry around zero of G (F ) we conclude from (6.8) that it is enough to look at one “tail” of F . Hence our study will focus on F functions which have the following properties: 1. F is defined over [−∞, 0]. 2. F is monotone increasing. 3. F (−∞) = 0 and F (0) = 1/2. 4. F is ρ-convex. We call such an F function a ρ-admissible CDF function.

We begin by showing that for any ρ, there exists a ρ-admissible CDF function Fρ such that G (F ) = G (ρ). We break down our discussion into three cases, depending on the value of ρ:

6.5. Proof of Theorem 6.4

75 1 2

1. When ρ < 0 we use Fρ (b) =

1/ρ

(1 + b)

. Trivially, Fρ is ρ-admissible. In Lemma 6.5 we

show that indeed G (Fρ ) = G (ρ). 2. When ρ > 0 we use Fρ (b) =

1 2

1/ρ

(1 + b)

which is defined in the range of [−1, 0]. Again,

proving that Fρ is ρ-admissible is trivial. Lemma 6.6 shows that G (Fρ ) = G (ρ). 3. When ρ = 0 we use F0 =

1 b 2e .

Clearly, this is a 0-admissible function (recall that the

0-concave function is a log-concave function). Showing that G (F0 ) = G (0) is done in Lemma 6.7. We have shown that for any ρ there is a ρ-concave measure for which the expected information gain is G (ρ). Thus, if we prove that for any F which is ρ-admissible, G (F ) ≥ G (ρ) we will have that G (ρ) is a tight bound for G (F ). For a ρ-admissible CDF F , let f = F ρ (we use the convention here that if ρ = 0 we use f = ln F ). Since F is ρ-convex, we have that f is convex, i.e. f (λb1 + (1 − λ) b2 ) ≥ λf (b1 ) + (1 − λ) f (b2 ). Note that for Fρ as defined above, if we set fρ = (Fρ )ρ we get a linear function. We will now claim that this is the worse case. Note that if f is convex and f + Ψ is convex then for any ǫ ∈ [0, 1] we have that f + ǫΨ is convex as well (lemma 6.8). We use the notation G (f ) = G f 1/ρ . Taking the F´erchet derivative of G (·) we have that Z ∞ Z 0 ▽f G (b) ψ (b) db + ǫ2 O G (f + ǫψ) = G (f ) + ǫ ψ 2 (b) db (6.9) −∞

−∞

We now turn to ▽f G (x). Before we do this, we will rewrite G (f ). Recall that R ∞ 1/ρ (b) 1 − f 1/ρ (b) H f 1/ρ (b) db −∞ f R∞ G (f ) = 1/ρ (b) 1 − f 1/ρ (b) db −∞ f

Denote by K (b) = f 1/ρ (b) 1 − f 1/ρ (b) . Using this notation f 1/ρ (b) =

1 2

−

√

1−4K(b) . 2

We

introduce (yet) another notation

Q (b) = H and thus we have that Q (K (b)) = H

1−

G (f ) = now we have that

√

1−

1−4K(b) 2

R

√

1 − 4b 2

= H f 1/ρ (b) . Finally we write

K (b) Q (K (b)) db R K (b) db

▽f G (b) = ▽K G (b) ▽f K (b)

(6.10)

76

Chapter 6. Theoretical Analysis of Query By Committee

where ▽K G (b) = ▽f K (b) =

∂ 1 Q (K (b)) + K (b) Q (K (b)) − G (f ) ∂K (b) K (b) db 1/ρ−1 f 1 − 2f 1/ρ ρ

R

(6.11) (6.12)

We are interested in studying the behavior of ▽f G (b). By considering the places in which ▽f G (b) = 0 we will be able to tell where is it positive and where is it negative. By (6.10) we can study the terms ▽K G (b) and ▽f K (b) separately. First consider ▽f K (b). From 6.12 we see that ▽f K (b) = 0 only when f 1/ρ = 1/2 i.e. F = 1/2. The behavior of these terms is determined by the value of ρ. For positive ρ’s ▽f K (b) > 0 unless f 1/ρ = 1/2. On the other hand, if ρ < 0 then ▽f K (b) < 0 whenever f 1/ρ 6= 1/2. Now we consider ▽K G (b). Looking at (6.11) we note that Q (K (b)) + K (b)

∂ Q (K (b)) ∂K (b)

is monotone increasing. Thus there is a point b0 which Freund et al [46] referred to as the pivot point, such that for any b < b0 we have that ▽K G (b) < 0 while for b > b0 we have that ▽K G (b) > 0. We saw that ▽K G (b) < 0 for b < b0 and ▽K G (b) > 0 for b > b0 . We also saw that ▽f K (b) < 0 when ρ < 0 and ▽f K (b) > 0 when ρ > 0. We will now have to treat three cases separately: the first case we will consider is ρ < 0, the second is ρ > 0 and last we consider the case ρ = 0. 1. Assume ρ < 0. In this case ▽f K (b) > 0 and thus    > 0 when b < b0 ▽f G (b) = ▽K G (b) ▽f K (b)   < 0 when b > b0

Let F be a ρ-admissible function and let f = F 1/ρ . We will show that unless f is linear, ρ there is some ψ and ǫ > 0 such that Fˆ = (f + ǫψ) is ρ-admissible and G (F ) > G Fˆ . Assume that f is non-linear, we consider two cases: the first is when the non-linearity is

inspected in the range [−∞, b0 ]. The second case we consider is when f is linear on [−∞, b0 ]. (a) Assume that f is non-linear on [−∞, b0 ]. Let −∞ < b1 < b0 such that f is non-linear on [b1 , b0 ]. Since f is convex we have that for any b ∈ [b1 , b0 ] f (b) ≥ Let ψ (b) =

    

b0 − b b − b1 f (b0 ) + f (b1 ) b0 − b1 b0 − b1 0

b−b1 b0 −b1 f

(b0 ) +

b0 −b b0 −b1 f

when b ∈ / [b1 , b0 ] (b1 ) − f (b) when b ∈ [b1 , b0 ]

6.5. Proof of Theorem 6.4

77 1/ρ

We note the following: f + ψ is convex and monotone such that (f + ψ)

is ρ-

admissible. Furthermore, ψ (b) = 0 for b > b0 while for b < b0 we have ψ (b) ≤ 0 and this inequality is strict at least on some parts of the range [b1 , b0 ] (See sub-figure 1(a) in figure 6.2 on page 78 for an illustration). Finally, since ψ has finite support R0 ψ 2 (b) db < ∞. Using all these facts we conclude that −∞ Z

0

−∞

▽f G (x) Ψ (b) db < 0

and since G (f + ǫψ) = G (f ) + ǫ

Z

0

−∞

▽f G (b) ψ (b) db + ǫ2 O

Z

0

ψ 2 (b) db

−∞

we have that for small enough ǫ G (f + ǫψ) < G (f ) (b) Assume that f is linear for b < b0 but still it is non-linear. Therefore for b < b0 we have that f (b) = βb + α for some α and β. Consider the following ψ:    0 when b < b0    1/ρ ψ (b) = − f (x) , βb + α − f (b) when b0 ≤ x < 0 min 21     1 1/ρ  when b = 0 2

We note the following: f + ψ is monotone and convex2 . Since (f + ψ) ≤ 1/ρ

that (f + ψ)

1 1/ρ 2

we have 1/ρ

is ρ-admissible, and for any 0 ≤ ǫ ≤ 1 the same holds, i.e. (f + ǫψ)

is ρ-admissible (See sub-figure 1(b) in figure 6.2 on page 78 for an illustration). ψ has the following properties: clearly ψ (b) = 0 when b < b0 while ψ (b) ≥ 0 when b > b0 and this inequality is somewhere strict (since f is non-linear). Since ψ has final support R0 ψ 2 (b) db < ∞ and thus using the same argument as we used in the previous −∞ scenario

G (f + ǫψ) < G (f ) for small enough ǫ. The two cases we considered here show that unless f is linear G (f ) > G (f + ǫψ) for some 1/ρ

ǫ and ψ such that (f + ǫψ)

is ρ-admissible. This shows that the minimum of G (·) is

achieved for linear f ’s. We already computed G (f ) for linear f in Lemma 6.5 (we assumed 2 Note

here that (f + ψ)ρ may be a singular CDF, it may have a positive mass on the point b = 0.

78

Chapter 6. Theoretical Analysis of Query By Committee

Figure 6.2: Illustrations for the proof of theorem 6.4 The four different cases considered in the proof. Sub figure 1(a) and 1(b) demonstrate the cases where ρ < 0 while sub-figures 2(a) and 2(b) demonstrate positive ρ’s. In each figure a non-linear f is presented together with the modified f + ψ as described in the proof of theorem 6.4.

4.5 4

b1

4.5 f f+ψ

3.5

3.5

b0

3

2.5

2

2

1.5

1.5 −80

−60

−40 1(a)

−20

0

1 −100

0.5

0.5

0.4

0.4

0.3

b0

0.3

0.2

−80

−60

−40 1(b)

−20

0

b0

0.2

0.1 0 −100

b0

3

2.5

1 −100

f f+ψ

4

0.1

f f+ψ −80

−60

that f (b) =

−40 2(a)

1 ρ 2

−20

0

0 −100

f f+ψ −80

−60

−40 2(b)

−20

0

(1 − b) but by a simple change of argument the same result holds for any

admissible linear f ). And thus we conclude that for any F which is ρ-admissible for ρ < 0 G (F ) ≥ G (ρ)

2. We now consider the case when ρ > 0. Let F be ρ-admissible and assume that F is defined over the range [b1 , 0] for some finite b1 . Let f = F ρ and assume that f is non-linear. Again we will consider two cases: the first case we will consider is when f is non linear on [b0 , 0] and the second case is when f is linear on [b0 , 0] but still not linear.

6.5. Proof of Theorem 6.4

79

(a) Assume that f is non linear on [b0 , 0]. Let ψ be as follows    0 when b < b0 ψ (b) =   b−b0 f (0) + −b f (b0 ) when b ≥ b0 −b0 −b0

We note the following: f + ψ is monotone increasing and convex. Since (f + ψ) ≤ 1/ρ 1 1/ρ we have that (f + ψ) is ρ-admissible, and for any 0 ≤ ǫ ≤ 1 the same holds, 2 1/ρ

i.e. (f + ǫψ)

is ρ-admissible (See sub-figure 2(a) in figure 6.2 on page 78 for an

illustration). ψ has the following properties: clearly ψ (b) = 0 when b < b0 while ψ (b) ≤ 0 when b > b0 and this inequality is somewhat strict (since f is non-linear). R0 Since ψ has final support −∞ ψ 2 (b) db < ∞ and thus using the same argument as we used in previous scenarios, recalling that ▽f G (b) > 0 when b > b0 and thus G (f + ǫψ) < G (f ) for small enough ǫ. (b) Assume that f is linear on [b0 , 0] but non linear on [b1 , b0 ]. Assume that f (b) = βb + α for b ∈ [b0 , 0]. We define ψ (b) = βb + α − f (b) 1/ρ

We note the following: f + ψ is monotone increasing and convex and (f + ǫψ)

is

ρ-admissible for any 0 ≤ ǫ ≤ 1 (See sub-figure 2(b) in figure 6.2 on page 78 for an illustration). ψ has the following properties: ψ (b) = 0 when b > b0 while ψ (b) ≥ 0 when b < b0 and this inequality is somewhat strict (since f is non-linear). Since ψ R0 has final support −x1 ψ 2 (b) db < ∞ and thus using the same argument as we used in previous scenarios, recalling that ▽f G (b) < 0 when b > b0 and thus G (f + ǫψ) < G (f ) for small enough ǫ. The two cases we considered here show that unless f is linear G (f ) > G (f + ǫψ) for some 1/ρ

ǫ and ψ such that (f + ǫψ)

is ρ-admissible. This shows that the minimum of G (·) is

achieved for linear f ’s. We already computed G (f ) for linear f in Lemma 6.5 (we assumed ρ that f (b) = 21 (1 + b) for b ∈ [0, 1] but by a simple change of argument the same result

holds for any admissible linear f ). We also assumed that f has finite support, but since this holds for any finite range and from the continuity of G (·) this result holds for any f . Thus

80

Chapter 6. Theoretical Analysis of Query By Committee we conclude that for any F which is ρ-admissible for ρ < 0 G (F ) ≥ G (ρ) 3. The case where ρ = 0 can be treated in the same way as we treated the cases when ρ > 0 or ρ < 0. However, this argument can be avoided since if F is 0-concave (i.e. log-concave), it is also ρ-concave for any ρ < 0. Therefore G (F ) ≥ sup G (ρ) ρ<0

on the other hand, in Lemma 6.7 we show a log-concave F for which G (F ) = G (0). Combining these facts together completes the proof.

Lemma 6.5 Let Fρ (b) =

1 2

1/ρ

(1 − b)

for b ≤ 0 and −1 < ρ < 0. Then G (Fρ ) = G (ρ) where G (ρ)

is as defined in theorem 6.4. The CDF

1 2

1/ρ

(1 − b)

is the “typical” ρ-concave function when ρ is negative. In Lemma 6.5

we calculate the information gain for these CDFs. Proof: This is a pure calculation. Let F = Fρ then G (F )

R0

−∞

=

F (b) (1 − F (b)) H (F (b)) db R0 −∞ F (b) (1 − F (b)) db

(6.13)

We treat the numerator and denominator separately. Z

0

−∞

F (b) (1 − F (b)) H (F (b)) db

1 1 1 1/ρ 1/ρ 1/ρ 1 − (1 − b) H db (1 − b) (1 − b) 2 2 −∞ 2 Z ∞ 1 1/ρ 1 1 1/ρ = 1 − b1/ρ H db x b 2 2 2 1 Z 1/2 = − b (1 − b) H (b) H (x) 2ρ bρ−1 ρdb

=

Z

0

0

Z 1/2 = −ρ2ρ bρ (1 − b) H (b) db 0 "Z 1/2 ρ2ρ b1+ρ (1 − b) ln (b) db = ln 2 0 # Z 1/2 2 ρ + b (1 − b) ln (1 − b) db 0

Now we look at the two integral terms in the last expression: Z

1/2 0

b1+ρ (1 − b) ln bdb =

Z

0

1/2

b1+ρ − b2+ρ ln bdb

(6.14)

6.5. Proof of Theorem 6.4

81

=

b2+ρ b3+ρ − 2+ρ 3+ρ

−3−ρ

−2−ρ

1/2 Z ln b − 0

b1+ρ b2+ρ db − 2+ρ 3+ρ 0 1/2 b3+ρ b2+ρ 2 − 2 (2 + ρ) (3 + ρ) 1/2

=

2 2 ln 2 − ln 2 − 3+ρ 2+ρ

=

2−3−ρ 2−2−ρ 2−2−ρ 2−3−ρ − ln 2 − ln 2 + 2 2 3+ρ 2+ρ (3 + ρ) (2 + ρ)

0

Looking at the second term in (6.14)

Z

0

1/2

2

bρ (1 − b) ln (1 − b) db

=

Z

1

1/2

ρ

b2 (1 − b) ln bdb

Using Taylor expansion, we can write ρ

(1 − b) =

∞ n X Γ (ρ + 1) (−1) n b Γ (ρ − n + 1) n! n=0

where Γ (·) is the gamma function. Using the Taylor expansion we have Z

1

1/2

b2 (1 − b)ρ ln (b) db

= = = =

Z

∞ X Γ (ρ + 1) (−1)n n+2 b ln (b) db 1/2 n=0 Γ (ρ − n + 1) n! ∞ n Z 1 X Γ (ρ + 1) (−1) bn+2 ln (b) db Γ (ρ − n + 1) n! 1/2 n=0 1 ∞ n n+3 X 1 b Γ (ρ + 1) (−1) ln b − Γ (ρ − n + 1) n! n + 3 n + 3 1/2 n=0 1

∞ n X Γ (ρ + 1) (−1) Γ (ρ − n + 1) n! n=0

2−n−3 2 − 2 + n + 3 ln 2 (n + 3) (n + 3) 2−n−3

1

Looking at the denominator of (6.13) we have

Z

F (b) (1 − F (b)) db

= = = =

1 1 1/ρ 1/ρ (1 − b) 1 − (1 − b) db 2 −∞ 2 Z ∞ 1 1 1/ρ 1 − b1/ρ db b − 2 2 1 Z ∞ 1 2/ρ 1 1/ρ db b − b 4 2 1 ∞ b2/ρ+1 b1/ρ+1 − 4 (2/ρ + 1) 2 (1/ρ + 1) Z

0

1

1 1 − 8 = 2 ρ +2 ρ +4 1 1 = ρ − 2 + 2ρ 8 + 4ρ 3+ρ = ρ 4 (1 + ρ) (2 + ρ)

!

82

Chapter 6. Theoretical Analysis of Query By Committee Finally we can write

G (F ) =

2−3−ρ 2−2−ρ 2−2−ρ 2−3−ρ − ln 2 − ln 2 + 3+ρ 2+ρ (3 + ρ)2 (2 + ρ)2

ρ2ρ ln 2

∞ X Γ (ρ + 1) (−1)n + Γ (ρ − n + 1) n! n=0

2−n−3

2−n−3 2 − 2 + n + 3 ln 2 (n + 3) (n + 3) 1

!!

/ρ

3+ρ 4 (1 + ρ) (2 + ρ)

which is, by simple algebra G (ρ). Lemma 6.6 Let Fρ (b) =

1 2

1/ρ

(1 + b)

for b ∈ [−1, 0] and ρ > 0. Then G (Fρ ) = G (ρ)

where G (ρ) is as defined in theorem 6.4. Proof: Recall that G (Fρ ) =

R0

−∞

This is a pure calculation Z

0

−1

Fρ (b) (1 − Fρ (b)) H (Fρ (b)) db

= = =

Fρ (b) (1 − Fρ (b)) H (Fρ (b)) db R0 −∞ Fρ (b) (1 − Fρ (b)) db 1 1 1 1/ρ 1/ρ 1/ρ 1 − (1 + b) H db (1 + b) (1 + b) 2 2 −1 2 Z 1 1 1/ρ 1 1 1/ρ 1 − b1/ρ H db b b 2 2 0 2 Z 1/2 − b (1 − b) H (b) 2ρ bρ−1 ρdb Z

0

0

= =

−ρ2 ρ2ρ ln 2

ρ

Z

1/2

0

"Z

bρ (1 − b) H (b) db

1/2

b

1+ρ

0

(1 − b) ln (b) db +

Z

1/2

0

ρ

b (1 − b) ln (1 − b) db

From (6.14) in Lemma 6.5 we know that this equals ρ2ρ ln 2

2−2−ρ 2−3−ρ 2−3−ρ 2−2−ρ ln 2 − ln 2 + − 2 2 3+ρ 2+ρ (3 + ρ) (2 + ρ) ∞ n X Γ (ρ + 1) (−1) + Γ (ρ − n + 1) n! n=0

2−n−3 2 − 2 + n + 3 ln 2 (n + 3) (n + 3) 2−n−3

1

Looking at the denominator in the definition of information gain we have Z

0 −1

Fρ (b) (1 − Fρ (b)) db

1 1 1/ρ 1/ρ 1 − (1 + b) db (1 + b) 2 −1 2 Z 1 1 1/ρ 1 1/ρ = 1− b db b 2 0 2 Z Z 1 1 1/ρ 1 1 2/ρ b − b db = 2 0 4 0

=

Z

0

2

!!

#

6.5. Proof of Theorem 6.4

83 1 1 1 1 b1+1/ρ − b1+2/ρ 2 (1 + 1/ρ) 4 (1 + 2/ρ) 0 0 1 1 = − 2 + 2/ρ 4 + 8/ρ 3+ρ = ρ 4 (1 + ρ) (2 + ρ) =

By simple algebra the result stated in the lemma is obtained. Lemma 6.7 Let F0 (b) = 12 eb for b ≤ 0 and −1 < ρ < 0. Then G (F0 ) = G (0) =

7 1 + 9 18 ln 2

where G (ρ) is as defined in theorem 6.4. Proof: Recall that G (F0 ) =

R0

−∞

This is a pure calculation Z

0

−∞

F0 (b) (1 − F0 (b)) H (F0 (b)) db

=

F0 (b) (1 − F0 (b)) H (F0 (b)) db R0 −∞ F0 (b) (1 − F0 (b)) db Z

0

−∞ 1/2

=

Z

1 b 1 b 1 b e 1− e H e db 2 2 2 (1 − b) H (b) db

0

= = = =

− −

Z

1/2

0

Z

2

(1 − b) log (1 − b) + b (1 − b) log (b) db 1

1/2

b2 log (b) db −

7 1 − 72 ln 2 24 7 1 + 24 48 ln 2

Z

1/2

b log (b) db + 0

0

−∞

F0 (x) (1 − F0 (x)) dx

= = = = =

1 x 1 x e 1 − e dx 2 −∞ 2 Z 0 Z 1 0 2x 1 ex dx − e dx 2 −∞ 4 −∞ 0 1 x0 1 1 2x (e |−∞ − e 2 4 2 −∞ 1 1 1 (1 − 0) − −0 2 4 2 3 8

Z

0

1/2 0

b2 log (b) db

1 1 1 1 − − − + − − 8 16 ln 2 24 72 ln 2

Looking at the denominator we have

Z

Z

84

Chapter 6. Theoretical Analysis of Query By Committee Thus we conclude G (F0 ) =

1 24

+

7 48 ln 2 3 8

=

7 1 + 9 18 ln 2

Lemma 6.8 Let f be a concave function such that f +Ψ is concave as well. Then for any ǫ ∈ [0, 1] the function f + ǫΨ is concave. ˆ is convex as well. Then for any ǫ ∈ [0, 1] the Let fˆ be a convex function such that fˆ + Ψ ˆ is convex. function fˆ + ǫΨ Proof: Let f be concave and Ψ be such that f + Ψ is concave as well. Let x1 and x2 be two points, let γ ∈ [0, 1] and let ǫ ∈ [0, 1]. (f + ǫΨ) (λx1 + (1 − λ) x2 ) =

ǫ (f + Ψ) (λx1 + (1 − λ) x2 ) + (1 − ǫ) f (λx1 + (1 − λ) x2 )

≤

ǫ (λ (f + Ψ) (x1 ) + (1 − λ) (f + Ψ) (x2 )) + (1 − ǫ) (λf (x1 ) + (1 − λ) f (x2 ))

=

λ (f + ǫΨ) (x1 ) + (1 − λ) (f + ǫΨ) (x2 )

This proves the first part of the lemma. To see that the same works for convex functions, let fˆ ˆ be convex as well. We apply to the first part of the lemma with f = −fˆ be convex and let fˆ + Ψ ˆ to get the stated result. and Ψ = −fˆ + −Ψ

6.6

Summary

In this chapter we have studied the fundamental properties of Query By Committee. First we defined information gain which is a function of a concept class, a prior over this class and a distribution over the sample space. We showed that when there is a lower bound on the expected information gain, QBC learns exponentially fast with respect to the number of queries it makes. This pace is exponentially better than any passive learner, as these learners learn in a polynomial rate. Next we demonstrated cases in which there is a lower bound on the expected information gain. We studied the class of parallel planes. We showed that the expected information gain is lower bounded when there is a prior which is ρ-concave, and the distribution over the sample space is of a special type. Freund et al [46] showed that this lower bound on the class of parallel planes can be translated into a lower bound on the expected information gain when learning homogeneous linear classifiers when the prior is uniform and the distribution over the sample space is uniform (theorem 4 in [46]).

Chapter 7

The Bayes Model Revisited The QBC algorithm and its analysis as presented in chapter 6 assumed that there is a known prior over the concept class. This assumption is usually referred to as the Bayesian assumption. However, in many cases, the knowledge of this prior is not present. In this chapter we show how this assumption can be weakened and in some cases lifted. We look at three different scenarios and use different tools in each one of them.

7.1

PAC-Bayesian Techniques

McAllester [83] presented the PAC-Bayesian theory. In his work, the Bayesian assumption is regarded as a way to present prior knowledge or preferences. We use the same technique here to show how the Bayesian assumption can be lifted in some cases.

Theorem 7.1 Let C = {c1 , c2 , . . .} be a countable concept class with VC-dimension d. Let P w1 , w2 , . . . be a set of positive weights such that wi ≥ 0 and wi = 1. Let D be a distribution over the sample space such that there exists a lower bound g > 0 on the expected information P gain of QBC when learning with the prior ν such that ν (S) = i∈S wi , and the distribution D. Assume that QBC is used with tk =

8 ǫδ

ln π

2

(k+1)2 3ǫδ

instead of the value defined in algorithm 5.

Let the Gibbs classifier cGibbs be defined using the version space V used by QBC when terminating. g g log 1+ 16 log 16 −g g 2 2 d+1 8 Let δ > 0, let g˜ = and let , let k ≥ max , log ln 2 4 g δ g ˜ δ ǫ>

2 (k+1)2 8ek ln dπ 24ek + δd

85

g ˜k d+1

ln 2

2−˜gk/(d+1)

86

Chapter 7. The Bayes Model Revisited Then for any ci ∈ C, with a probability of 1 − wδi over the choice of the sample and the internal

randomness of QBC; the algorithm will use at most k queries and will return a hypothesis cGibbs such that Pr cGibbs (x) 6= c (x) ≤ ǫ x Proof: From theorem 7.3 it follows that in the conditions as described above h h ii Pr EQBC,x1 ,x2 ,... Pr cGibbs (x) 6= c (x) > ǫ < δ c∼ν x

(7.1)

Let ci be the target concept and define pi such that pi

= =

h i EQBC,x1 ,x2 ,... Pr cGibbs (x) 6= ci (x) > ǫ x h i Pr Pr cGibbs (x) 6= ci (x) > ǫ QBC,x1 ,x2 ,... x

i.e. pi is the probability that QBC will fail when learning the target concept ci . Using this definition in (7.1) we have that X i

wi pi ≤ δ

and therefore, for any i we have that pi ≤ δ/wi . The number of queries used follows employing the same argument. Theorem 7.1 shows how the Bayesian assumption can be lifted and converted into a weight or significance assigned to each concept in the class. Although we assumed that the concept class is finite, it is possible to extend this result to general classes using the same techniques as presented in [83].

7.2

Symmetry

In this section we lift the Bayesian assumption when learning linear classifiers with a uniform distribution over the sample space. This is based on the perfect symmetry in this class. Assume that QBC is learning homogeneous d dimensional linear classifiers. Each concept is represented as a unit vector w ∈ IRd and each instance is a unit vector x ∈ IRd where the classification rule is cw (x) = sign (w · x). Freund et al [46] showed that there is a uniform lower bound on the expected information gain of QBC when learning this class once there is a uniform distribution over the sample space and a uniform prior over the concept class. Using the results presented in Chapter 6 this implies fast learning rates for the QBC algorithm in this setting. Here we show that the Bayes assumption can be lifted in this case. This is due to the symmetry of this problem.

7.2. Symmetry

87

In Theorems 6.1 and 6.2 we showed that the error rate of QBC decreases exponentially fast when there is a lower bound on the expected information gain. We showed it for several variant of the QBC algorithm and several methods for evaluating success. The argument presented here applies to all these cases. Instead of repeating these theorems we will state the following theorem in general terms. Theorem 7.2 Assume that C is the class of d-dimensional homogeneous linear classifiers. Let the sample space X be the unit sphere in IRd and assume that D is the uniform distribution over X . When QBC is applied in this setting, all the results presented in theorems 6.1 and 6.2 apply for any concept in the class and not only on average (or with a probability) over the choice of the concept. Proof: Let cw and cw′ be two homogeneous linear classifiers such that w and w′ are unit vectors. Let T be the rotation transformation such that T (w) = w′ . We will use the fact that the uniform distribution over the unit sphere is rotation invariant and thus if S is a set in the unit sphere then the measure of S equals the measure of T (S). The QBC algorithm is a random algorithm. We assume that it gets 3 inputs: a sequence of unlabeled instances, an oracle that is capable of providing the labels of instances and a sequence of random bits. By providing the algorithm with a sequence of random bits as an input, we can look at the QBC algorithm as a deterministic algorithm. For a concept cw let ∆ (cw ) ⊆ X ∗ × {0, 1}∗ be the set of inputs on which QBC fails when learning the concept cw . Note that the definition of “failure” varies, as shown in theorems 6.1 and 6.2 however the result we present here applies to all these definitions. ∗

Let T be a rotation. For {(x1 , x2 , . . .) , (r1 , r2 , . . .)} ∈ X ∗ × {0, 1} we define T ({(x1 , x2 , . . .) , (r1 , r2 , . . .)}) = {(T (x1 ) , T (x2 ) , . . .) , (r1 , r2 , . . .)} and extend this definition such that T (∆ (cw )) = {T ({(x1 , x2 , . . .) , (r1 , r2 , . . .)}) : {(x1 , x2 , . . .) , (r1 , r2 , . . .)} ∈ ∆ (cw )} The main observation is that if w′ = T (w) then ∆ (cw′ ) = T (∆ (cw )). We define the measure ∞ µ over X ∗ × {0, 1}∗ to be the product measure of D∞ with B 12 where B (·) here stands for the Bernoulli measure. Since µ is rotation invariant, because D is rotation invariant, then µ (∆ (cw′ )) = µ (T (∆ (cw ))) = µ (∆ (cw ))

88

Chapter 7. The Bayes Model Revisited

this implies that the probability of failing to learn the concept cw equals the probability of failing to learn the concept cw′ . Assume that the probability of failure of the QBC algorithm when averaging over the target concept is bounded by δ. Recall that the probability of failure is [QBC fails] [QBC fails] = Ecw ∼Uniform ∗ Pr Pr X ∼D ∗ ,{0,1}∗ cw ∼Uniform,X ∗ ∼D ∗ ,{0,1}∗ = Ecw ∼Uniform [µ (∆ (cw ))] Z = u (w) µ (∆ (cw )) dw where u (w) is the density of the uniform distribution. Since u (w) is constant, and as we saw µ (∆ (cw )) is constant as well it follows that ∀w

Pr

cw ∼Uniform,X ∗ ∼D ∗ ,{0,1}∗

[QBC fails] = µ (∆ (cw ))

Finally, since Pr

cw ∼Uniform,X ∗ ∼D ∗ ,{0,1}∗

[QBC fails] ≤ δ

we have that ∀w µ (∆ (cw )) ≤ δ

7.3

Incorrect Priors and Distributions

The QBC algorithm assumes the knowledge of both a prior over the concept class and a distribution over the target concept. In this section we discuss the case in which we have an estimate of the prior and distribution, but these estimates need not be accurate. We show that if these estimates are reasonably close to the true priors, then the QBC algorithm will tolerate the incorrect priors. Thus, the exponential learning rates of QBC which were demonstrated in the fundamental theorem of QBC (Theorem 6.1) remain1 . First we define a measure of proximity between probabilities. Definition 7.1 A probability measure µ is λ far from a probability measure µ′ if for any measurable set A, λ−1 µ′ (A) ≤ µ (A) ≤ λµ′ (A) 1 In

this section we revisit theorem 5 in [46].

7.3. Incorrect Priors and Distributions

89

Using this definition we note that if QBC was used with the assumption that the prior over the concept class is ν which is λc far from the true prior ν ′ and the distribution over the sample space is assumed to be D which is λx far from the true distribution D′ , then the performance of QBC does not degrade by much. The following theorem is the equivalent of the fundamental theorem of the QBC algorithm (Theorem 6.1). It shows that even if QBC is used with incorrect priors, it still has exponential learning rates. Theorem 7.3 Let there be a prior ν over C which is λc far from the prior ν ′ over C. Let D be a distribution over the sample space X which is λx far from D′ . Let g = G (ν, D) > 0 be a lower ! b g b g log 1+ g 16 log 16 −b bg −4 −2 bound on the expected information gain of QBC. Let b g = λc λx g and g˜ = 4

and

k ≥ max

8 2 d+1 2 ln , log gb2 δ g˜ δ

Let δ > 0 then if the true prior and distribution are ν ′ and D′ , while QBC assumes that the prior is ν then with a probability of 1 − 2δ, QBC will use at most k queries for labels and m0 =

d g˜k/(d+1) 2 e

unlabeled instances when learning and will return a hypothesis with the following properties (depending on the termination rule used): 1. If QBC is used with the Bayes optimal classification rule, it will return a hypothesis (the Bayes optimal hypothesis) such that h h ii Ec∼ν Pr cBayes (x) 6= c (x) ≤ ǫ x

for any 2

ǫ>

2ek −˜gk/(d+1) π 2 (k + 1) 2 ln dλ2c 6δ

2. If QBC is used with the Gibbs average termination rule, it will return an hypothesis such that

for any

h i EGibbs,c∼ν Pr cGibbs (x) 6= c (x) ≤ ǫ x 2

ǫ>

4ek −˜gk/(d+1) π 2 (k + 1) 2 ln dλ2c 6δ

90

Chapter 7. The Bayes Model Revisited 3. If QBC is used with the Gibbs “typical” error termination rule, it will return a hypothesis such that Pr cGibbs (x) 6= c (x) ≤ ǫ x

for any ǫ>

2 (k+1)2 + 8ek ln dπ 24ek δdλ2c

g ˜k d+1

ln 2

2−˜gk/(d+1)

4. If QBC is used with Bayes point machine termination rule, it will return a hypothesis such that

h i Ec∼ν Pr cBPM (x) 6= c (x) ≤ ǫ x

for any

2

ǫ>

2 (e − 1) ek −˜gk/(d+1) π 2 (k + 1) 2 ln dλ2c 6δ

Proof: In lemma 7.1 we show that even though the QBC uses wrong priors, the expected −2 information gain from the next query is uniformly lower bounded by λ−4 c λx g. In lemma 7.3 we

show that if QBC did not query for labels for a while, then the hypothesis it will use will be a good approximation of the target concept. Using these two lemmas, and following the proof technique of the fundamental theorem of the QBC algorithm (theorem 6.1 on page 62) the proof is completed. Lemma 7.1 Let there be a prior ν over C which is λc far from the prior ν ′ over C. Let D be a distribution over the sample space X which is λx far from D′ . Let g = G (ν, D) > 0 be lower bound on the expected information gain of QBC. If the true prior and distribution are ν ′ and D′ , while −2 QBC assumes that the prior is ν then the expected information gain is at least λ−4 c λx g.

Proof: First we apply to lemma 7.2 to obtain the following R ν (V + (x)) ν (V − (x)) ν (V + (x)) dD (x) ν(V ) ν(V ) H ν(V ) g = R ν(V + (x)) ν(V − (x)) ν(V ) ν(V ) dD (x) − ′ + ′ R ν (V (x)) ν (V (x)) ν (V + (x)) 2 dD (x) λc ν ′ (V ) ν ′ (V ) H ν(V ) ≤ R ′ − ′ + ν (V (x)) ν (V (x)) λ−2 c ν ′ (V ) ν ′ (V ) dD (x) R ν ′ (V + (x)) ν ′ (V − (x)) ν (V + (x)) dD (x) ν ′ (V ) ν ′ (V ) H ν(V ) = λ4c R ν ′ (V + (x)) ν ′ (V − (x)) ν ′ (V ) ν ′ (V ) dD (x)

Since D is λx far from D′ we have that with a probability of 1 ′ λ−1 x dD (x) ≤ dD (x) ≤ λx dD (x)

7.3. Incorrect Priors and Distributions

91

and thus

g

≤

λ4c

R

ν ′ (V + (x)) ν ′ (V − (x)) ν ′ (V ) ν ′ (V ) H

λx ≤

=

λ4c

λ4c λ2x

R R

R

ν (V + (x)) ν(V )

dD (x)

ν ′ (V + (x)) ν ′ (V − (x)) ν ′ (V ) ν ′ (V ) dD (x)

ν ′ (V + (x)) ν ′ (V − (x)) ν ′ (V ) ν ′ (V ) H

λ−1 x

R

which completes the proof.

ν (V + (x)) ν(V )

ν ′ (V + (x)) ν ′ (V − (x)) ′ ν ′ (V ) ν ′ (V ) dD

ν ′ (V + (x)) ν ′ (V − (x)) ν ′ (V ) ν ′ (V ) H

R

dD′ (x)

(x) ν (V + (x)) dD′ (x) ν(V )

ν ′ (V + (x)) ν ′ (V − (x)) ′ ν ′ (V ) ν ′ (V ) dD

(x)

Lemma 7.2 Let ν be λc far from γ ′ . Let x be an instance and V be a version space. Denote by V + (x) (and V − (x)) the concepts in the version space that assign x with the label +1 (or −1) respectively. Then λ−2 c

ν ′ (V + (x)) ν ′ (V − (x)) ν (V + (x)) ν (V − (x)) ν ′ (V + (x)) ν ′ (V − (x)) ≤ ≤ λ2c ′ ′ ν (V ) ν (V ) ν (V ) ν (V ) ν ′ (V ) ν ′ (V )

′ Proof: First note that ν (V + (x)) ≤ λc ν ′ (V + (x)) and ν (V ) ≥ λ−1 c ν (V ) and thus

λc−2

ν (V + (x)) ν ′ (V + (x)) ν ′ (V + (x)) ≤ ≤ λ2c ′ ν (V ) ν (V ) ν ′ (V )

′ 2 Let z ∈ [0, 1] and let z ′ be such that λ−2 c ≤ z /z ≤ λc . It is easy to verify that

λc−2 z ′ (1 − z ′ ) ≤ z (1 − z) ≤ λ2c z ′ (1 − z ′ ) By setting z =

ν (V + (x)) ν(V )

and z ′ =

ν ′ (V + (x)) ν ′ (V )

we complete the proof.

Lemma 7.3 Let there be a prior ν over C which is λc far from the prior ν ′ over C. Let D be a distribution over the sample space X which is λx far from D′ . Assume the true prior and distribution are ν ′ and D′ , while QBC assumes that the prior is ν and the distribution is D then 1. Assume that QBC is used with tk =

2 ǫ

ln π

2

(k+1)2 6δ

instead of the value defined in algorithm 5.

Let the Bayes classifier cBayes be defined using the version space V used by QBC when terminating and the prior ν. Then with a probability of 1 − δ over the sample and the internal randomness of QBC, h h ii Ec∼ν ′ |V Pr cBayes (x) 6= c (x) ≤ λ2c ǫ x

92

Chapter 7. The Bayes Model Revisited 2. Assume that QBC is used with tk =

4 ǫ

ln π

2

(k+1)2 6δ

instead of the value defined in algorithm 5.

Let the Gibbs classifier cGibbs be defined using the version space V used by QBC when terminating and the prior ν. Then with a probability of 1 − δ over the sample and the internal randomness of QBC, h i EGibbs,c∼ν Pr cGibbs (x) 6= c (x) ≤ λ2c ǫ x 3. Assume that QBC is used with tk =

8 ǫδ

ln π

2

(k+1)2 3ǫδ

instead of the value defined in algorithm 5.

Let the Gibbs classifier cGibbs be defined using the version space V used by QBC when terminating and the prior ν. Then with a probability of 1 − δ over the choice of the sample, the target hypothesis and the internal randomness of QBC, Pr cGibbs (x) 6= c (x) ≤ γ 2 ǫ x 4. Assume that QBC is used with tk =

2(e−1) ǫ

ln π

2

(k+1)2 6δ

instead of the value defined in al-

gorithm 5. Let the concept class be the class of linear classifiers and let the prior ν be log-concave. Let the Bayes Point Machine classifier cBPM be defined using the version space V used by QBC when terminating. Then with a probability of 1 − δ over the sample and the internal randomness of QBC, h i Ec∼ν Pr cBPM (x) 6= c (x) ≤ λ2c ǫ x

Proof: 1. Assume that QBC made k queries for labels to generate the version space V . Assume that QBC did not query for any additional label for tk consecutive instances after making the k’th query. Let cBayes be the Bayes classifier, then    +1 if Prc∼ν|V [c (x) = +1] ≥ 1/2 cBayes (x) =   −1 if Prc∼ν|V [c (x) = −1] > 1/2

Arrange x and c such that cBayes (x) 6= c (x). From the definition of the Bayes classifier it follows that if we pick a random hypothesis c′ from the distribution ν|V then with a probabilh i ity of at least 1/2 we will have c′ (x) 6= c (x). Therefore, if we denote by cBayes (x) 6= c (x) the indicating function then

Ec′ ∼ν|V [c′ (x) 6= c (x)] ≥

i 1h cBayes (x) 6= c (x) 2

7.3. Incorrect Priors and Distributions

93

for any c and x. h i Assume that Ec∼ν ′ |V,x cBayes (x) 6= c (x) > ǫλ2c . Thus, ǫλ2c

h i < Ec∼ν ′ |V,x cBayes (x) 6= c (x) h i cBayes (x) 6= c (x) Pr′ = Ex c∼ν |V h i  Prc∼ν ′ |V cBayes (x) 6= c (x) ∩ (c ∈ V )  = Ex  Prc∼ν ′ |V [c ∈ V ] h i  Prc∼ν|V cBayes (x) 6= c (x) ∩ (c ∈ V )  ≤ Ex λ2c Prc∼ν|V [c ∈ V ] h i = λ2c Ec∼ν|V,x cBayes (x) 6= c (x)

and therefore Ec,c′ ∼ν|V,x [c′ (x) 6= c (x)] >

ǫ 2

this means that the probability that QBC will not query for the label of the next instance h i is at most 1 − 2ǫ . Hence, if Ec∼ν ′ |V,x cBayes (x) 6= c (x) > ǫλ2c the probability that QBC will not query for a label for the next tk consecutive instance is at most by choosing tk =

2 ǫ

ln π

2

(k+1)2 6δ

1−

ǫ tk ǫ ≤ e− 2 tk 2

we get that the probability that QBC will not query for tk

consecutive labels when the Bayes classifier is not “good enough” is

6δ . π 2 (k+1)2

By summing

over k the proof is completed. 2. The proof for the Gibbs classifier follows the same pattern as theorem 5.2. From item 1 in lemma 7.3 we have that using the choice of tk that h h ii Ec∼ν Pr cBayes (x) 6= c (x) ≤ λ2c ǫ/2 x

Since Haussler, Kearns and Schapire [49] proved that the average error of the Gibbs classifier is at most twice as large as the error of the Bayes classifier, the statement of the theorem follows. 3. This follows immediately from the previous item and the Markov inequality. From the choice of tk we have that with a probability of 1 − δ/2 h i EGibbs,c∼ν Pr cGibbs (x) 6= c (x) ≤ λ2c ǫδ/2 x

(7.2)

94

Chapter 7. The Bayes Model Revisited Therefore, from the Markov inequality, if (7.2) holds, we have with a probability of 1 − δ/2 that Pr cGibbs (x) 6= c (x) ≤ λ2c ǫ x 4. The proof follows immediately from item 1 in lemma 7.3 and theorem 7.1 on page 85.

7.4

Summary

In this chapter we revisited the Bayesian assumption underlying the QBC algorithm and its analysis. We showed that this assumption is not as strong as it might appear at first glance. In many cases it may be lifted, as we showed in section 7.1 and section 7.2 or weakened, as we showed in section 7.3. We conclude that the knowledge of the prior from which the target concept was chosen, or even the existence of such a prior, is not critical for the QBC to exhibit fast learning rates.

Chapter 8

Noise Tolerance In the discussion of the Query By Committee algorithm so far, we have made several assumptions. We studied the Bayesian assumption in Chapter 7. In this chapter we revise yet another assumption we made; namely that we learn in a noise free environment. We assumed that there is a target concept which is a deterministic function of the inputs; i.e. we assumed that there exists a concept c such that for any given input x, the concept c assigns the “true” label c (x) in a deterministic way. However, this assumption is doubtful for various reasons. First, many concepts we may wish to learn are non-deterministic by nature. Moreover, noise that can be caused by human errors or communication problems might corrupt the labels we see (see the more extensive discussion about noise on Chapter 3). Version-space based algorithms such as QBC are sensitive to noise. A single misclassified instance will cause the target concept to be eliminated from the version-space and thus lead to poor results. Therefore, if QBC is ever to be used on real data, it must be made less sensitive to such effects. In this dissertation we provide two methods for coping with this problem. In the current chapter we present a “soft” version of the QBC algorithm and analyze it. We show that √ k)

when certain conditions apply, the error of the hypothesis return decreases as e−O(

where k

is the number of queries for label made. A more practical approach is presented in Chapter 10 where kernels are used to overcome noise as well as some other practical problems. The advantage of the method we present here is its theoretical soundness. However it does not (yet) have any practical implementation. In section 8.1 we introduce the “soft” version of the QBC algorithm. In section 8.2 we revise the notation of Information Gain to suit the new setting. In section 8.3 we use the newly proposed 95

96

Chapter 8. Noise Tolerance

way of measuring information to analyze the “soft” QBC. We wrap-up in section 8.4. Note that Sollitch and Saad [110] conducted a preliminary study of the impact of noise on active learning, although their work mainly focused on the behavior of the algorithm when the sample size grows to infinity and less on the practical scenario. The work presented in this chapter is based on collaboration with Scott Axelrod, Shai Fine, Shahar Mendelson and Naftali Tishby.

8.1

“Soft” QBC

We begin our discussion by defining the model in which we are working. Noise and uncertainty can come in different forms. The first model we consider is when the target concept is deterministic. The noise in this case corrupts the communication channel between the learner and the oracle that answers the learner’s queries. In this case, the source of noise is external. The second case we consider is when the target concept is non-deterministic in itself. In this case the noise is internal.

8.1.1

The Case of Learning with Noise

In many cases, the concepts to be learned are deterministic, but noise corrupts our observations. Noise can differ in nature; it can be random classification noise, where the noise is equal over all the sample space and independent of the target concept. In other cases the noise may tend to have greater impact near the decision boundary (see [37] for a comprehensive survey about learning in the presence of noise). We use the following notation: Let the set of labels Y be finite and let W be a parameterization of the concept class. For each w ∈ W and x ∈ X a distribution p (y|w, x) is defined where the underlying concept to be learned cw is such that cw (x) = arg max p (y|w, x) y∈Y

Therefore, given any ǫ > 0, the objective of our learning process is to find some w ∈ W such that Pr [cw (x) 6= c∗ (x)] < ǫ

x∼D

where c∗ is the target concept.

8.1. “Soft” QBC

8.1.2

97

The case of stochastic concepts

A scenario we shall not pursue any further is when the concepts we are trying to learn are stochastic. Thus, there is no perfect mapping between instances and labels. In this case one might think of various criteria for generalization. Some of the possibilities are to minimize the loss with respect to different Lp norms or to apply the Kullback Leibler divergence. For the sake of simplicity we will not present all the possibilities in this direction. However, we note that the algorithm we present below can be easily adjusted to include various such criteria, and the proofs follow the same path as those presented here.

8.1.3

A variant of the QBC algorithm

Here, we focus on the noise model as described in 8.1.1, i.e. the noise is external to the system and corrupts the communication channel between the teacher and the learner. Let ǫ > 0 and δ > 0 be the accuracy and confidence parameters specified by the user. The version of the QBC algorithm which is capable of managing noise is presented in Algorithm 6. We will present two facts about this “soft” version of the QBC algorithm (for brevity, henceforth the abbreviation SQBC). First, we show that the hypothesis returned by the algorithm is indeed a good approximation of the target concept. Second, we show that if SQBC is allowed to issue k √

queries for labels then the generalization error of the hypothesis it returns is e−O( k) . Recall that √ a passive learner using k queries will have a generalization error of O 1/ k in the same setting (see e.g. [6] Theorem 5.2).

Theorem 8.1 Let ǫ > 0 and δ > 0. Assume that cw∗ is the target concept and that cw is the concept the SQBC returned. Then the probability that cw is not a good approximation for cw∗ , i.e., that

∗ Pr arg max p (y|w, x) 6= arg max p (y|w , x) > ǫ

x∼D

y∈Y

y∈Y

is less than δ, when the probability is with respect to the internal randomness of SQBC, the random sample used for learning, the random labels and the random target concept (Bayesian assumption). The proof is similar to the ones presented in Chapter 5 where we studied the QBC algorithm. Proof: Define the set of “bad” pairs of parameters W = (w1 , w2 ) s.t. Pr arg max p (y|w1 , x) 6= arg max p (y|w2 , x) > ǫ x∼D

y∈Y

y∈Y

The algorithm fails if the target concept cw∗ and the hypothesis cw returned by SQBC are such that (w∗ , w) form a “bad” pair. Recall that w∗ is randomly picked from the prior ν. We

98

Chapter 8. Noise Tolerance

Algorithm 6 “Soft” Query By Committee (SQBC) Inputs: • Required accuracy ǫ. • Required confidence 1 − δ. • A prior ν over the parameter class W . Output: • A hypothesis cw . Algorithm: 1. Let ν1 = ν. 2. Let k ← 0. 3. Let l ← 0. 4. For t = 1, . . . (a) Receive an unlabeled instance xt . (b) Let l ← l + 1.

(c) Select w1 ∼ νt and w2 ∼ νt .

(d) If arg maxy∈Y p (y|w1 , x) 6= arg maxy∈Y p (y|w2 , x) then i. ii. iii. iv.

Query for the label y of x. Let k ← k + 1. Let l ← 0. Let νt+1 be the posterior over W given all the labels seen so far; i.e. for U ⊆ W using Bayes rule we have νt+1 (U ) =

ν (U ) Pr [y1 , . . . , yt+1 are the observed labels given that w∗ ∈ U ] Pr [y1 , . . . , yt+1 are the observed labels]

(e) else i. Let νt+1 ← νt . (f) If l ≥ tk where tk =

i. Select w ∼ νt+1 ii. Return cw .

2 ǫδ

log 2k(k+1) then δ

8.2. Information Gain Revisited

99

allow the teacher (“the adversary”) extra power and allow it to choose the target concept only at the end of the learning process with the only restriction being that the concept is chosen using the posterior over the labels it presented while the QBC was learning. Hence, we may assume that the selection of w∗ was made using the posterior defined by the given labels. Therefore, both the algorithm and the teacher use the same probability measure which is the posterior to select w and w∗ respectively. There are two possible sources for failure in this case. First, SQBC may terminate when W is too big in a probabilistic sense. The second case of failure is when W is small but nevertheless, the target concept and the hypothesis returned by SQBC form a “bad” pair. We show that the probability of any of these cases is less than δ/2. Let νt be the posterior. If νt2 (W ) ≤ δ/2 then we are done, since the probability that w and w∗ form a bad pair is bounded by δ/2. On the other hand, assuming that νt2 (W ) > δ/2, then we argue that the probability of observing a long sequence of instances for which SQBC will not issue a query for label is small. Under this assumption, the probability of selecting a triplet x, w1 , w2 ∼ D × νt × νt such that arg maxy∈Y p (y|w1 , x) 6= arg maxy∈Y p (y|w2 , x) is greater than ǫδ/2. Hence, the probability of tk consecutive instances without a query after seeing k labels is bounded by

and by setting tk =

t ǫδ k 1− ≤ e−ǫδtk /2 , 2 2 ǫδ

log 2k(k+1) it follows that the probability that the QBC does not query for δ

tk consecutive instances is less than δ/2k(k + 1). Summing over the possible values of k we get that the probability of failure is bounded by δ/2. The theorem above shows that the SQBC algorithm is sound. We now explore the number of queries used in the learning processes. We use the same technique as Freund et. al. [46] and analyze the information gained by the algorithm during the process. However, before we do this we need to adapt the notion of information gain to the new setting we are dealing with. In the next section we introduce the refined information gain and study some of its properties. Later in this chapter we use these tools to study the sample complexity of SQBC.

8.2

Information Gain Revisited

A fundamental problem in learning theory is bounding the information gained by an example about the unknown target concept. This problem is most critical in the context of active learning, when the learner has to select the most informative examples to be labeled in order to minimize

100

Chapter 8. Noise Tolerance

the number of labels required. The Mutual Information allows one to measure the average knowledge one gains about another random variable, B, by knowing the value of one random variable A. However, in concrete learning cases one is interested in a more precise measure; namely, how much does a specific value a tells us about B. Here we present an information measure which quantifies the amount of information an observation a of the random variable A gives about the state of the random variable B. We show that with high probability this specific mutual information is bounded by the logarithm of the covering number of B (see definitions 8.1 and 8.2), and establish a version of the Information Processing Inequality suitable for this quantity. Later we will use the information a label contains about the target concept to measure the information gain by the SQBC algorithm. The mutual information measures the amount of information one random variable contains about another random variable [31]. If a random variable A takes values in the set A and the random variable B takes values in the set B, the mutual information is defined by the following formula: I (A; B) =

Z

p (a, b) log

A×B

and it can be rewritten as

p (a, b) d (a × b) p (a) p (b)

1

I (A; B) =

Z

p (a) A

p (b|a) db da p (b|a) log p (b) B

Z

(8.1)

A reasonable question is what a specific observation a ∈ A can tell us about the other variable B. Let’s consider for example that one is interested in knowing whether it rained over night. The observation one gets can be the moisture on the ground in the morning. If the ground is dry then we can be pretty sure that it wasn’t raining. If, however, the ground is wet, then it might have rained, but it is also possible that the sprinklers were working during the night and caused the ground to be wet. Clearly, different observations, or values of the same variable, can provide different amounts of information. There exists a natural definition for the specific information value we are after. Indeed, by looking at (8.1) we come up with the definition: Z p (b|a) p (b|a) log I (a; B) = db p (b) B

(8.2)

This should be read as the information that the observation a ∈ A gives about the random variable B. This quantity has some nice properties: 1 We

assume that p(a|b) belongs to an appropriate L1 space.

8.2. Information Gain Revisited

101

1. It is non-negative, since from (8.2) one can see that I (a; B) is a Kullback Leibler divergence [64] between two distributions. 2. I (a; B) is a measurable function due to Fubini’s theorem. 3. The expected value of the information from an observation is the mutual information, i.e., EA [I (a; B)] = I (A; B). Before proceeding we need to define some notations. We begin by defining a distance measure between two instances of a random variable B. Definition 8.1 The distance2 between two instances b1 and b2 of a random variable B with respect to the random variables A1 , . . . , Am over A1 , . . . , Am is ρm (b1 , b2 ) = max sup |p (ai |b1 ) − p (ai |a2 )| 1≤i≤m ai ∈Ai

Given a distance measure, one can define the covering number which counts the number of balls of radius ǫ needed to cover the space: Definition 8.2 If B is a random variable over B and ρ is a (pseudo)-metric on B, then for any ǫ > 0 the ǫ-covering number is the smallest number of balls of radius ǫ (with respect to the distance measure ρ) needed to cover B. We denote this value by N (B, ǫ, ρ). Note that in the deterministic case, when p (ai |b) is either zero or one, this definition takes a simple form: ρ (b1 , b2 ) is zero if the two states assign the same values to the observations and it is 1 otherwise. Here, for every radius ǫ < 1, the ǫ covering numbers are simply the number of equivalence classes. Hence, if the observations are labels assigned to different sample points and d if B has a VC-dimension d, then by Sauer’s Lemma it follows that N (B, ǫ, ρm) ≤ em . d

8.2.1

Observations of the State of a Random Variable

Let us assume that we are interested in the random variable B which takes values b ∈ B. We have some observations of the random variables Ai . Each random variable Ai receives values ai ∈ Ai . Q We assume that the Ai ’s are mutually independent given B, i.e. the p (a1 , . . . , am |b) = p (ai |b). This is often the case in learning from examples. To see this, let W be a parameterization of a

concept class. Let x1 , . . . , xm be a fixed set of instances, then for any w ∈ W, we have that p (y1 , . . . , ym |w, x1 , . . . , xm ) =

Y

p (yi |w, x1 , . . . , xm ) =

Y

p (yi |w, xi )

2 Actually ρ m is a semi-distance since it is possible that b1 6= b2 while ρm (b1 , b2 ) = 0. This has no significance throughout the paper.

102

Chapter 8. Noise Tolerance

We are interested in measuring the contribution of the labels y1 , . . . , ym to our knowledge about the random variable W . Haussler and Opper [50] have studied this question and presented the relationship between the information and metric entropy. However, they studied the average case; i.e. “what is the amount of information regarding the state of the world that a general set of observations captures?”. The question we are interested in is “how much information does a specific set of observations capture on the state of a random variable?”. Another difference between Haussler and Opper results and the result presented here is the distance measure used. Haussler and Opper used the Hellinger distance measure whereas we use an infinity norm. This allows us to use the results of Alon et. al. [2] which bound the metric entropy with respect to this norm using the Pollard dimension and the Fat-Shattering dimension of the space. The first result we present shows that the information from a set of observations is essentially bounded in the sense that with high probability it is bounded by the covering number. Theorem 8.2 Let m > 2 and let A1 , . . . , Am be a set of observed random variables. Let B be a random variable. Assume that there is some γ > 0 such that for any ai ∈ Ai and any b ∈ B, p (ai |b) ≥ γ. Denote by a(m) = (a1 , . . . , am ) then 2γ 1 (m) Pr I a ; B ≤ log N B, 2 , ρm + 2 + log ≥1−δ m δ a(m) ∼A(m) where ρm is as defined in Definition 8.1. Note that in the deterministic case, the assumption that there is a positive lower bound on p (ai |θ) is not necessary. In fact, if B has VC-dimension d then with a probability of at least 1 − δ, 1 I a(m) ; B ≤ d log em d + 2 + log δ which is similar to the bound presented in [46, Lemma 3]. Proof: of Theorem 8.2 Recall that

I a

(m)

;B =

Z

p a(m) |b (m) db, log p b|a p a(m)

hence, by Jensen’s inequality (or annealed approximation )

I a

(m)

; B ≤ log

Z

(m) |b (m) p a p b|a db. p a(m)

(8.3)

Taking the expected value of the integral in (8.3) with respect to the observations and applying Fubini’s Theorem, it follows that " "Z # # (m) p a(m) |b |b (m) p a db = Eb∼B Ea(m) ∼A(m) |b Ea(m) ∼A(m) p b|a p a(m) p a(m)

(8.4)

8.2. Information Gain Revisited

103

Let B1 , . . . , Br be a disjoint cover of B (i.e., B = ∪Bi and if i 6= j then Bi each Bi has diameter smaller than

2γ m2

with respect to the metric ρm . Thus,

b, b′ ∈ Bi =⇒ ∀j, aj ∈ Aj

|p (aj |b) − p (aj |b′ )| ≤

T

Bj = ∅), such that

2γ m2

Using this definition we rewrite the expected value in (8.4) as " " # # Z r X p a(m) |b p a(m) |b = db p (b|Bi ) EA(m) |b P (B i ) EB EA(m) |b p a(m) p a(m) Bi i=1

(8.5)

(8.6)

We shall bound the integral on Bi for each 1 ≤ i ≤ r separately. Let i be such that P (Bi ) > 0.

Note that for each b, b′ ∈ Bi we have that p a(m) |b′

Y

p (ai |b′ ) Y 2γ p (ai |b) − 2 , ≥ m =

thus, = p a(m) ≥ ≥ =

Z

Z

p (b′ ) p a(m) |b′ db′

p (b′ ) p a(m) |b′ db′ B Z i Y 2γ p (ai |b) − 2 db′ p (b′ ) m Bi Y 2γ p (ai |b) − 2 . P (Bi ) m

Q Since p a(m) |b = p (ai |b) it follows that p a(m) |b p (ai |b) 1 Y ≤ 2γ . (m) P (Bi ) p a p (ai |b) − m 2

(8.7)

Recall that p (ai |b) ≥ γ, hence

γ 1 p (ai |b) 2γ ≤ 2γ = 2 1 − p (ai |b) − m2 γ − m2 m2

(8.8)

Clearly, for m > 2, 2

e− m ≤ 1 −

2 2 2 + 2 ≤ 1 − 2. m m m

Hence, p (ai |b) 1 2 1 m 2γ ≤ 2 ≤ −2 = e , m 1 − p (ai |b) − m e 2 2 m and using (8.7)

Therefore,

p a(m) |b e2 ≤ . P (Bi ) p a(m)

(8.9)

104

Chapter 8. Noise Tolerance

EB EA(m) |b

"

# p a(m) |b p a(m)

≤

X

P (Bi )

i : P (Bi )>0

e2 P (Bi )

≤ re2

Now, recall the definition of r in (8.5) and conclude that " " # # p a(m) |b p a(m) |b EA(m) EB|a(m) = EB EA(m) |b p a(m) p a(m) 2γ ≤ N B, 2 , ρ e2 m

(8.10) (8.11)

(8.12)

By Markov’s inequality, p(a(m) |b) 2γ 1 2 ≤ δ, PA(m) EB|a(m) , ρ N B, ≥ m e δ m2 p(a(m) ) thus, by (8.3) 2γ 1 (m) PA(m) I a ; B ≥ log N B, 2 , ρm + 2 + log ≤ δ, m δ as claimed. An immediate consequence of the proof of theorem 8.2 is a bound on the mutual information as presented in the following corollary: Corollary 8.1 Assume that the conditions of Theorem 8.2 hold. Then 2γ I (A1 , . . . , Am ; B) ≤ log N B, 2 , ρm + 2. m

8.2.2

Information Processing Inequality

A fundamental property of mutual information is the Information Processing Inequality. The information processing inequality asserts that when data are processed, the mutual information can only decrease. More formally, for any function g the following holds I (A; B) ≥ I (g (A) ; B)

(8.13)

As a corollary, if A1 , . . . , Am , B are random variables then for any J ⊆ [1, m] I (A1 , . . . , Am ; B) ≥ I (AJ ; B) where AJ = {Aj }j∈J . Nevertheless, as we move to the setting of information from observations, the situation is more complex. A subset of the observation could contain more information on the target variable than all the observations. However it is possible to prove a slightly weaker version of the information processing inequality.

8.2. Information Gain Revisited

105

Theorem 8.3 Information Processing Inequality Let m > 2 and put A1 , . . . , Am to be a set of observed random variables which are mutually independent given the random variable B. Assume further that each Ai can take only a finite set of values, and that there is some γ > 0 such that for any ai ∈ Ai and any b ∈ B, p (ai |b) ≥ γ. Then, for any τ Pr

a(m) ∼A(m)

[∃J s.t. I (aJ ; B) ≥ τ + 1] ≤

h i 1 I a(m) ; B ≥ τ m log γ a(m) ∼A(m) Pr

where aJ = {aj }j∈J . Theorem 8.3 shows that in a sense, the information processing inequality is valid for the setting described here with high probability. In the proof of this theorem we make a specific use of the fact that γ > 0. However, in the deterministic case this assumption is superfluous since the information is monotonic, thus ∀J ⊆ {1, . . . , m}

I (a1 , . . . , am ; B) ≥ I (aJ ; B)

Before we prove this theorem we derive an immediate corollary Corollary 8.2 In the setting of theorem 8.3, Let δ > 0 then # " m log γ1 2γ <δ Pr ∃J s.t. I (aJ ; B) ≥ N B, 2 , ρm + 3 + log m δ a(m) ∼A(m) Corollary 8.2 follows from Theorem 8.2 and Theorem 8.3 by choosing τ =N

m log γ1 2γ B, 2 , ρm + 2 + log m δ

We now turn to prove Theorem 8.3. Proof: of Theorem 8.3. Assume there exists J ⊆ {1, . . . , m} such that I (aJ ; Θ) > τ + 1

(8.14)

and let J = {1, . . . , m} \ J. We will examine all the possible values of aJ . Note that by Fubini’s Theorem EAJ |aJ

h i I a(m) ; B = =

"

"

## p a(m) |b EAJ |aJ EB|a(m) log p a(m) " " ## p a(m) |b EB|aJ EAJ |b log p a(m)

106

Chapter 8. Noise Tolerance

At the same time, p a(m) |b log p a(m)

p (aJ |b) p aJ |b = log p (aJ ) p aJ

p aJ |b p (aJ |b) + log = log p (aJ ) p aJ

hence

h i = EAJ |aj I a(m) ; B

≥

I (aJ ; B) + I AJ ; B|aJ I (aJ ; B)

(8.15)

Note that the second term in (8.15) is a mutual information and thus non-negative. Define QaJ = Pr

AJ |aJ

h i I a(m) ; B ≥ τ

For every a(m) we have that I a(m) ; B ≤ m log γ1 since Ai is finite and hence p a(m) |b ≤ 1 and on the other hand p a(m) ≥ γ m . Therefore, from (8.15) it follows that I (aJ |B)

≤ ≤

Thus, if I (aJ ; B) ≥ τ + 1 then QaJ ≥ Pr

A(m)

h i I a(m) ; B ≥ τ

≥ =

h i EAJ |aJ I a(m) ; B 1 τ + m log QaJ γ

1 m log

1 γ

. Therefore,

h i I a(m) ; B ≥ τ and ∃J I (aJ ; B) ≥ τ + 1 A(m) h i Pr I a(m) ; B ≥ τ | ∃J s.t. I (aJ ; B) ≥ τ + 1 × Pr

A(m)

Pr [∃J s.t. I (aJ ; B) ≥ τ + 1]

A(m)

≥

1 Pr [∃J s.t. I (aJ ; B) ≥ τ + 1] m log γ1 A(m)

Thus we obtain h i 1 Pr [∃J s.t. I (aJ ; B) ≥ τ + 1] ≤ Pr I a(m) ; B ≥ τ m log (m) (m) γ A A

8.3

SQBC Sample Complexity

In order to analyze the SQBC algorithm we are about to use the information of observation as a replacement for the information gain used in the analysis of QBC. Note that this is a natural

8.3. SQBC Sample Complexity

107

extension as though the concepts were deterministic, i.e. no noise in the system, in which case the information gain is equivalent to the information of observation. Let x ¯ = {x1 , x2 , . . .} be a sequence of instances. For the sake of our discussion we will assume that this sequence is fixed. The label yi of the instance xi tells us something about the target concept c. Using the terminology of the previous section, yi is an observation of the state of the target concept which is the random variable3 C. We apply the same technique as in Chapter 6. We will show that with high probability, for any subset J ⊆ {1, . . . , m}, the information from {yj }j∈J is not too high. We will argue that when certain conditions apply, SQBC queries for labels with high information content and thus it will not issue too many queries. This will lead to large gaps between consecutive queries for labels which will lead SQBC to terminate successfully as proved in Theorem 8.1 on page 97. In the following we rework the definition of information gain and its derivatives. Definition 8.3 For a sequence of instances x ¯ = {x1 , x2 , . . .} ∈ X ∞ , the Information Gain from a set of labels yJ = {yj }j∈J (where yj is the label of the instance xj ) is I (yj ; C | xJ ). The Expected Information Gain from the next query for a label is Ej ∗ ,yj∗ ,{xj }j>max J I yJ∪{j ∗ } ; C xJ∪{j ∗ } − I (yJ ; C | xJ )

where the expectation is taken with respect to the sequence of instances {xj }j>max J , the choice of the next query point (i.e. the choice of j ∗ ) and the label yj ∗ of xj ∗ . Unlike the deterministic case, the information gain is not guaranteed to be non-negative. This, and other properties of the noisy setting make the analysis more involved than the deterministic - noise free case. The main result we would have liked to establish is presented in Theorem 8.4 on page 108. However, we encountered a technical difficulty in the course of the proof, when trying to show that the information gain is, with high probability, linear in the number of queries. A close analysis of the proof of the analogous result in [46] reveals a similar gap which was overlooked by the authors. Though it is possible to close the gap in the noise-free case (as we did in chapter 6) we are still in the process of adjusting the proof to our setup. Hence, the proof of Theorem 8.4 is presented under the assumption that conjecture 8.1 holds. Conjecture 8.1 Assume there exist a lower bound g > 0 on the expected information gain from the query the SQBC algorithm makes at any step. Then, for any δ > 0 there exist constants Kδ 3 There

is a slight abuse of notation here, since C is the concept class and not a random variable.

108

Chapter 8. Noise Tolerance

and g˜ > 0 such that if k > Kδ and if J is the set of size k of indexes of the queries that the SQBC algorithm made, then Pr

x ¯,¯ y ,SQBC

[I (yJ ; C | xJ ) < k˜ g] ≤ δ

Conjecture 8.1 is the equivalent of lemma 6.2 on page 65. The next theorem is the main result in this section. It proves that when certain conditions √

apply, if SQBC is allowed to issue k queries for label, then it will reach an accuracy of e−O(

k)

.

Theorem 8.4 Assume that Conjecture 8.1 holds. Let W be a set of parameters of a concept class, such that for w ∈ W the probability of observing the label y for the instance x when the target concept is parameterized by w is p (y | w, x ). Assume that there exists γ > 0 such that p (y | w, x ) ≥ γ for all y, w, x. Assume that {p (y | w, x ) | w ∈ W } has a Pollard dimension d. Let ν be a prior over W and let D be a distribution over X . Assume that there is g > 0 such that for any finite sample S ∈ (X × Y)∗ the expected information gain of SQBC from the next query given the sample S, is lower bounded by g. Let δ > 0 and let k ≥ Kδ (Kδ and g˜ are as defined in Conjecture 8.1). Then with a probability of 1 − 3δ, SQBC will issue at most k queries for labels and use m0 =

γdδ (k˜g )/(36/|Y|d) e e log γ1

unlabeled instances when learning and will return a hypothesis with Pr SQBC,w∗ ,x1 ,x2 ,...

∗ Pr arg max p y|wSQBC , x 6= arg max p (y|w , x) > ǫ < δ x y y

for any ǫ>

2ke log γ1 γδ 2 d

log

2k (k + 1) −√(k˜g )/(18|Y|) e δ

In the statement of theorem 8.4 we used the Pollard dimension. Here is a definition of the Pollard dimension (see e.g. [2]). Definition 8.4 Let F be a set of functions from some space Z to IR. F has a Pollard-dimension d if the class C = {sign (f ) : f ∈ F } has a VC-dimension d. An alternative definition for the Pollard dimension is to say that if F has a Pollard-dimension d, if d is maximal such that there exist z1 , . . . , zd ∈ Z such that for any y1 , . . . , yd ∈ {±1} there exists f ∈ F with yi f (zi ) > 0 for all i.

8.3. SQBC Sample Complexity

109

Alon et al. [2] showed that if F has a Pollard-dimension d then d log(2em/(dǫ)) 4m N (F, ǫ, ρm ) ≤ 2 ǫ2 where N (·, ·, ·) is the covering number and ρm is the l∞ distance measure when F is restricted to m points. Note that it is possible to use the Fat-Shattering-Dimension of Alon et al. [2] here, instead of the Pollard-dimension to obtained slightly better bounds. For the sake of clarity we avoid using the Fat-Shattering-Dimension here. Proof: of Theorem 8.4 Assume that SQBC made k queries. Given that Conjecture 8.1 holds, then with a probability of 1 − δ, the information SQBC gained is at least k˜ g. Let J be the indexes of the queries QBC made. From the information processing inequality, Theorem 8.3 and Corollary 8.2, we know that with a probability of 1 − δ

m log γ1 2γ I (yJ ; W | xJ ) ≤ log N W, 2 , ρm + 3 + log m δ

Alon et al. [2] proved that log N

3 em 2γ + log 2 W, 2 , ρm ≤ |Y| d log2 m γd

and therefore |Y| d log2

em3 γd

+ log 2 + 3 + log

m log γ1 δ

≥ k˜ g

Applying a coarse upper bound on the left hand side of the above inequality we have ! em log γ1 2 18 |Y| d log ≥ k˜ g γdδ and therefore, with a probability of 1 − 2δ, if SQBC made k queries for labels, then m≥ If

m k

γδd √(k˜g )/(18|Y|) e e log γ1

> tk is as defined in the SQBC algorithm then the SQBC algorithm will bail out and

as we proved in theorem 8.1 on page 97, when this happens, the returned hypothesis is a good approximation of the target concept with high probability. Therefore it suffices to require that 2 2k (k + 1) γδd √(k˜g )/(18|Y|) m > tk = log e ≥ k ǫδ δ ke log γ1 which holds whenever ǫ>

2ke log γ1 γδ 2 d

log

2k (k + 1) −√(k˜g )/(18|Y|) e δ

110

8.4

Chapter 8. Noise Tolerance

Summary

The main question we have attempted to address in this chapter is whether active learning in general and QBC in particular can be applied in the presence of noise and uncertainty. Although the discussion presented here is incomplete, there is a reason to believe that active learning can be applied in the realistic setting where noise and uncertainty exist. Nevertheless, we would like to mention several key issues that are lacking in the discussion in this chapter. First, we were not able to prove Conjecture 8.1 on page 107. We believe that this conjecture, with some minor amendments, is true. However, at this point we were not able to prove it. Second, we do not show here any concept class which has a lower bound on the expected information gain as required in Theorem 8.4. Finally, we do not have any practical implementation of the SQBC algorithm. Nevertheless, in Chapter 10 we present an alternative method of overcoming noise using kernels. Kernels provide a practical method of applying QBC for real world applications. However, the theoretical justification of this method is weaker. The revised concept of information gain and information from observations presented in section 8.2 are of interest in themselves. Measuring the information of an observation on a target random variable can play an important role in diverse applications.

Chapter 9

Efficient Implementation Using Random Walks The Query By Committee algorithm (Algorithm 5 on page 52) is a very simple and straightforward algorithm. Whenever a new instance is presented it draws two random hypotheses from the version space. If these two hypotheses predict different labels for the instance, then the algorithm queries for the true label. However, this description belies the difficulty of implementing this algorithm since drawing random hypotheses from the version space is indeed a non-trivial task. In this chapter we show how QBC can be implemented in polynomial time when learning linear classifiers. The main ingredient in our implementation is a reduction of the problem of sampling the version space to the problem of sampling convex bodies. We show that the sophisticated techniques developed for sampling from convex bodies provide a solution to the missing components in the QBC algorithm. The work presented in this chapter is based on a collaboration with Shai Fine and Eli Shamir.

9.1

Linear Classifiers

The question we address in this chapter is “how can the QBC algorithm be used to learn linear classifiers?”. We assume that the concept class we are interested in is the class of homogeneous linear classifiers. The sample space is X = IRd and the concept class is C = cw : w ∈ IRd such

that cw (x) = sign (w · x). The class of linear classifiers is frequently used in modern machine

learning. This class is very powerful once the inputs are mapped from the input space to some 111

112

Chapter 9. Efficient Implementation Using Random Walks

feature space using a non-linear map. In some cases, the inputs are mapped to an infinite dimension Hilbert space, without affecting the computational complexity of learning in this class (see more about this in Chapter 10). An important property of homogeneous linear classifiers is that they are scale free in the sense that if w ∈ IRd and λ > 0 then cw is equivalent to cλw . This is due to the fact that cw (x) = sign (w · x) = sign (λw · x) = cλw (x) Therefore, we may assume that the concept class C is defined solely on the unit ball, i.e. C = cw : w ∈ IRd and kwk ≤ 1 . A key observation is that when learning homogeneous linear classifiers, the Version Space is a

bounded convex body at all stages as the following lemma shows. m

Lemma 9.1 Let C be the class of homogeneous linear classifiers. Let S = {xi , yi }i=1 a finite sample (possibly empty). Then the version space induced by S is a bounded convex body. Proof: Recall that the class of homogeneous linear classifiers is defined as C = cw : w ∈ IRd and kwk ≤ 1 .

Therefore, C is isomorphic to the unit ball and thus bounded and convex. The concept cw is in the version space if ∀i yi (w · xi ) ≥ 0 and thus the version space is the intersection of the unit ball with m linear constraints. Since

all these constraints are convex, then the version space is convex. Furthermore, since the version space is a subset of the unit ball, it is bounded. Therefore, the problem of random sampling the version space is reduced to the problem of random sampling from convex bodies. In the following section we discuss methods of solving the later problem.

9.2

Sampling from Convex Bodies

The problem of sampling from convex bodies has been studied for the last two decades in the field of computational geometry. Given a convex body K, the task is to return a point x ∈ K sampled from the uniform distribution over K. Any efficient sampling algorithm has many applications. For example, Bertsimas and Vempala [14] showed how convex optimization problems can be solved efficiently given such a sampling algorithm.

9.2. Sampling from Convex Bodies

113

Elekes [41]1 proved that it is impossible to sample uniformly from convex bodies. Soon after, Dyer, Frieze and Kannan [39] showed that it is possible to sample approximately uniformly from convex bodies. They showed that given a bounded convex body K, and an accuracy parameter ǫ > 0 it is possible to sample x from K such that for any set A ⊆ K Pr [x ∈ A] − UK (A) < ǫ x

where UK is the uniform measure over K. We use the notation Prx [A] to denote the probability that the sampling algorithm will return a point in the set A. The algorithm presented by Dyer Frieze and Kannan was polynomial but its running time was O d>20 where d is the dimension of d.

Nevertheless, in a series of improvements, the efficiency of sampling algorithms was significantly improved and the recent algorithm can perform the sample task in O∗ d3 operations2 [79]. Although clear advances have been made, current algorithms are still not practical as the constants

involved are too high. Nevertheless, this is an active research field and we expect better algorithms to follow. Describing these sampling algorithms is well beyond the scope of this dissertation. Although many different algorithms have been suggested, all use Monte-Carlo Markov Chains (MCMC) at their core. For these MCMCs to work, the convex body must be well rounded. Therefore, not only should K be bounded, i.e. be contained in a ball of radius R, it must also contain a ball of radius r. The following theorem summarizes the essentials of sampling from convex bodies. Theorem 9.1 Let K ∈ IRd be a convex body such that there exists a ball of radius R which contains K and there exists a ball of radius r which is contained in K. Then there exists a sampling algorithm such that for any ǫ > 0 the algorithm returns x ∈ K such that for any measurable subset S of K |Pr [x ∈ S] − UK (S)| < ǫ and the algorithm works in poly d, log Rr , log 1ǫ time.

The proof for this theorem can be found in [79] for example. We note that the convex body K

is assumed to be given via a separating oracle. In other words, given a point x the oracle either returns the answer “x is in K” or returns a hyperplane w such that w · x > max (w · z) z∈K

This oracle must be able to compute its answer in polynomial time. 1 Elekes

was interested in the problem of computing the volume of a convex body. However, the problem of sampling from convex bodies and computing their volumes are closely related. 2 The notation O ∗ indicates that logarithmic factors are ignored.

114

Chapter 9. Efficient Implementation Using Random Walks

9.3

A Polynomial Implementation of QBC

Freund et al [46] showed that QBC learn a homogeneous linear classifier exponentially faster than passive learners (see also Chapter 6). In Section 7.3 on page 88 we showed that we do not need to sample exactly from the correct prior and distribution. However, we required that the approximation be close in a multiplicative sense whereas the sampling algorithm discussed in Theorem 9.1 has an additive discrepancy. Furthermore, the complexity of the sampling algorithm depends on the ratio between the radii of a bounding ball and a bounded ball. In this section we show how these ingredients can be used to support a polynomial time implementation of QBC. The polynomial implementation is presented as Algorithm 7. It has the same structure as the original QBC algorithm (Algorithm 5 on page 52) when using the sampling techniques presented above. Next we would like to prove the efficiency of this algorithm. Efficiency here means two things. The first is the computational complexity for which we show that the algorithm runs in polynomial time. The second measure of efficiency is the sample complexity for which we show that the implementation enjoys exponential learning rates, similar to the one for QBC. Theorem 9.2 Let the target concept be a uniformly distributed homogeneous linear classifier. Assume that the distribution over the sample space is uniform. Let 1−δ be a confidence parameter. Then with a probability of 1 − δ, the following holds 1. The expected information gain of the queries the polynomial QBC makes are at least g/2 where g is the expected information gain of the original QBC algorithm. g log 1+

2. Then there exists g˜ =

g 32 log 32 −g/2 g

> 0 such that for any k = Ω g˜−2 log (1/δ) and g˜k log (dk) −gk/d ǫ=Ω 2 δd2

8

the polynomial QBC implementation will return a hypothesis h such that Pr [h (x) 6= c (x)] ≤ ǫ x

3. It will use k labels and m0 = d2O(gk/d) unlabeled instances. 4. Each iteration of the algorithm will run in poly k, 1ǫ , δ1 time.

Proof: We begin by analyzing the computational complexity of the proposed algorithm. We showed in Theorem 9.1 that sampling from convex bodies can be done in poly d, log Rr , log 1ǫ time,

9.3. A Polynomial Implementation of QBC

115

Algorithm 7 Polynomial Implementation of QBC Inputs: • Required accuracy ǫ. • Required confidence 1 − δ. • The dimension of the problem d. Output: • A hypothesis h. Algorithm: 1. Let V1 = C. 2. Let k ← 0. 3. Let l ← 0. 4. For t = 1, . . . (a) Receive an unlabeled instance xt . (b) Let l ← l + 1.

(c) Select c1 and c2 uniformly from Vt using a sampling algorithm with additive accuracy gδ and g is a lower bound on the expected information gain of ǫt , where ǫt = 240k(k+1)t k QBC when learning linear separators with the correct priors.

(d) If c1 (x) 6= c2 (x) then i. ii. iii. iv.

Query for the label c (x). Let k ← k + 1. Let l ← 0. Let Vt+1 ← {c ∈ Vt : c (xt ) = yt }.

(e) else i. Let Vt+1 ← Vt . (f) If l ≥ tk where tk =

80 ǫδ 2

ln 10k(k+1) δ

i. Choose a hypothesis h uniformly from Vt using a sampling algorithm with additive accuracy δ/40. ii. Return h.

116

Chapter 9. Efficient Implementation Using Random Walks

where R is a radius of a ball containing Vt and r is a radius of a ball contained in Vt . Clearly, Vt is a subset of the unit ball and thus we may assume that R = 1. We would like to show that r is not too small. Let V ∗ be the version space induced by the labels of all the m0 instances. It is clear that ∀t V ∗ ⊆ Vt . Therefore if there is a ball of radius r in V ∗ then the same ball is bounded inside Vt as well. Thus we will study V ∗ . In Lemma 6.1 we show that for any sequence of m0 instances, the probability that the target concept will be such that the probabilistic volume of the −(d+1) d 0 version space induced by it is smaller than em . Therefore, if m0 > 10d is at most em d eδ 0 em0 −(d+1) then with a probability of 1 − δ/10, the measure of the version space is at least d at −(d+1) 0 all times and thus its volume is at least Vol (Bd ) em where Bd is the d-dimensional unit d

ball, and Vol (Bd ) is its volume. In lemma 9.2 on page 119 we show that for any compact convex body K, such as the version space, there exists a ball of radius r inside the convex body with r≥

Vol (K) Vol (Bd ) dd Rd−1

where R is a radius of a ball containing K. Since the version space is a subset of the unit ball, we −(d+1) 0 can use R = 1 in our case. Since the volume of the version space is at least Vol (Bd ) em d

we conclude that with probability 1 − δ, the version space contains a ball of radius r such that r≥

d d+1

(em0 )

Using the bound on r and the bound on m0 we obtain that the complexity of each iteration is 1 1 poly d, gk, log , log ǫ ǫ We now turn to prove that the hypothesis returned by this implementation of the QBC is indeed a good approximation of the target concept. The proof is very similar to the proof of Theorem 7.3 on page 89 where we considered the QBC algorithm with incorrect priors. For the sake of completeness we will show the two main ingredients needed for the proof. We begin by showing that there is a lower bound on the expect information gain from the next query. Next we will show that if the algorithm terminated, then the hypothesis returned is a good approximation of the target concept with high probability. We begin by analyzing the expected information gain. Let V be the current version space and let γ be the additive accuracy we require from the sampling algorithm. We have that g≤

R

UV (V + (x)) UV (V − (x)) H (UV (V + (x))) dD (x) R UV (V + (x)) UV (V − (x)) dD (x)

9.3. A Polynomial Implementation of QBC

117

where UV (·) is the uniform distribution restricted to V . When sampling c from V we are guaranteed that for any measurable set A: UV (A) − Pr [c ∈ A] ≤ γ c

and thus we have

Pr c ∈ V + (x) Pr c ∈ V + (x) c

c

UV V + (x) + γ UV V − (x) + γ = UV V + (x) UV V − (x) +γ UV V + (x) + UV V − (x) + γ = UV V + (x) UV V − (x) + γ (1 + γ)

≤

and since γ < 1 we have Pr c ∈ V + (x) Pr c ∈ V + (x) ≤ UV V + (x) UV V − (x) + 2γ c

c

repeating the same argument we have that Pr c ∈ V + (x) Pr c ∈ V + (x) ≥ UV V + (x) UV V − (x) − γ c

c

Let q be the probability that the polynomial QBC will query for the label of the next instance it sees. Clearly q=2

Z

Pr c ∈ V + (x) Pr c ∈ V − (x) dD (x) c

c

If q is very small then the algorithm will most likely terminate. Recall that it terminates when no t

query is made for tk consecutive instances. The probability for this is exactly (1 − q) k . Let q≤

δ 20k (k + 1) tk

then since 0 < q ≤ 1/2 then e−2q ≤ 1 − 2q + 2q 2 ≤ 1 − q (1 − q)

tk

≥

e−2qtk

≥

e−δ/10k(k+1)

≥

1−

δ 10k (k + 1)

This the probability that the algorithm will not terminate when q is small. By summing over k we get that with a probability of 1 − δ/10 the algorithm will not make another query after it reach the state where q ≤

δ 20k(k+1)tk .

118

Chapter 9. Efficient Implementation Using Random Walks

Assume that q >

δ 20k(k+1)tk .

It follows that the probability that QBC, when sampling from

the true posterior will query for the label of the next instance is at least q − 2γ. Since the expect information gain of QBC is at least g we have that Z 2 UV V + (x) UV V − (x) H UV V + (x) dD (x) ≥ g (q − 4γ)

and thus

2

Z

Pr c ∈ V + (x) Pr c ∈ V − (x) H UV V + (x) dD (x) c

c

≥ g (q − 4γ) − 2γ ≥ gq − 6γ

Thus the expected information gain of the polynomial QBC is at least gq − 6γ q By choosing γ =

gδ 240k(k+1)tk

=

g−

and using the fact that q >

6γ q δ 20k(k+1)tk

we conclude that the expected

information gain is at least g/2. The lower bound on the expected information gain proves that the number of queries that the polynomial QBC algorithm will make on the sample of size m0 is O gd˜ log m0 (see the arguments

in the proof of the fundamental properties of the QBC algorithm, Theorem 6.1). Therefore, the

algorithm will reach, with high probability, a sequence of tk consecutive instances for which it did not query for a label. We now argue that when this happens, if the algorithm returns a random hypothesis then it is likely to be a good approximation of the target concept. Let W ⊆ C × C be the set W = {(c1 , c2 ) : D (x : c1 (x) 6= c2 (x)) > ǫ} Let p be the probability that if we choose c1 using the sampling algorithm while c2 as chosen using the true prior then (c1 , c2 ) ∈ W . If p ≤ δ/10 then if the polynomial QBC terminates and returns a random hypothesis then it will be a good approximation of the target concept with high probability. Now assume that p > δ/10. We will show that with high probability the QBC algorithm will not terminate. Since p > δ/10 then Pr [UV (c2 : (c1 , c2 ) ∈ W ) > δ/20] > δ/20 c1

For each c1 such that UV (c2 : (c1 , c2 ) ∈ W ) > δ/20 we have that if we sample c2 from V , we will hit the set W with high probability since Pr [(c1 , c2 ) ∈ W ] > c2

δ δ − ǫ∗ = 20 40

9.4. A Geometric Lemma

119

and thus Pr [(c1 , c2 ) ∈ W ] >

c1 ,c2

δ2 80

By definition of the set W we have that the probability that the polynomial QBC will query for the label of the next instance is at least ǫδ 2 /80. Therefore the probability of tk consecutive instances without a query, assuming that p > δ is less than t 2 ǫδ 2 k 1− ≤ e−ǫδ tk /80 80 which by the choice of tk is δ/10k (k + 1). By summing over k we get that the probability that the QBC will terminate when p ≥ δ/10 is at most δ/10. The remaining specifics of this proof are identical to the proof of Theorem 6.1 which explores the original QBC algorithm, and have thus been omitted. Note that there are several possible causes for failure. However, we showed that the probability for each of these causes is less than δ/10. Using the union bound we get that the probability of failure is less than δ.

9.4

A Geometric Lemma

While sampling from convex bodies can work in polynomial time, for this to happen we need to prevent it from becoming singular, i.e. we need the ratio between the radius a bounding ball and a bounded ball to be moderated. We use the following lemma to bound this ratio. Lemma 9.2 Let K be a compact convex body in IRd which is bounded by a ball of radius R. Let Vol (K) be the volume of this body. Then there exist a ball of radius r inside K such that r≥

Vol (K) Vol (Bd ) dd Rd−1

Proof: Recall that John’s theorem [57] states that there exist an ellipsoid E ⊆ K such that K ⊆ dE where dE is the ellipsoid E when it is blown up by a factor d around its origin. Let λ1 ≥ λ2 ≥ · · · ≥ λd be the lengths of the principal axes of E. Since λd is the smallest then we can place a ball of radius λd inside E, centered at E’s origin. Thus there is a ball of radius r = λd inside K. The lengths of the principal axes of dE are dλ1 , . . . , dλd and thus the volume of dE is Q Vol (Bd ) di=1 dλi where Bd is the d-dimensional unit ball. Since K ⊆ dE we have Vol (K) ≤

Vol (dE)

=

Vol (Bd )

d Y

i=1

dλi

120

Chapter 9. Efficient Implementation Using Random Walks

and therefore r

≥ ≥

λd Vol (K) Qd−1 Vol (Bd ) dd i=1 λi

Finally, since K is contained in a ball of radius R, and E is contained in K then λ1 , . . . , λd ≤ R and thus r≥

Vol (K) Vol (Bd ) dd Rd−1

The significance of this lemma is that it shows that log

R = O (d log d + d log R − log Vol (K)) r

Therefore, as expected, log Rr is moderated as long as K occupies a non-negligible portion of the bounding ball of radius R. While the constants in Lemma 9.2 are not tight, it is clear that any (K) . To see this, let R > r > 0. Let o1 , . . . , od a set of orthogonal bound on r must be O Vol Rd−1 vectors such that the lengths of o1 , . . . , od−1 are R and the length of od is r. Let K be an ellipsoid with o1 , . . . , od as its principal directions. Clearly, the minimal ball bounding K is of radius R and the maximal ball bounded in K is of radius r. Furthermore, the volume of K is Rd−1 rVol (Bd ). Therefore r=

Vol (K) Vol (Bd ) Rd−1

which is identical to the bound we have in lemma 9.2 up to dd .

9.5

Summary

In this chapter we showed that the QBC algorithm can be implemented in polynomial time for learning homogeneous linear classifiers. We have reduced the problem of implementing the QBC algorithms to the problem of sampling from convex bodies and used polynomial algorithm for sampling from such bodies. While these algorithms are polynomial they are still far from being practical. We discuss this issue further in the next chapter.

Chapter 10

Kernelizing the QBC In this chapter we take another step towards making it possible to use QBC for real world tasks. In chapter 9 we saw that algorithms for sampling from convex bodies can be used to sample from the version space when learning linear classifiers. While this provided us with a polynomial time algorithm it is not sufficient because it assumes that the task at hand can be carried out by a linear classifier. This problem is not unique to active learning and the QBC algorithm. The same problem is found in classical models such as the Perceptron [90] and more generally in Neural Networks. The universal way of overcoming this problem is to add a non-linear phase to the model. This is typically done by mapping the input data by a non-linear activation function. The idea is to map the data to a new space in which it is more likely that they will be linearly separable. Further improvement on the idea of mapping the data to a new space and learning in the new space was made by Vapnik and others [117, 19]. They showed that in many cases, it is not necessary to explicitly map the data. Rather, this can be done implicitly by using kernels (see section 10.1 for more about kernels). This observation led to a flux of algorithms utilizing kernels: SVM [19], kernel PCA [103] and others (see e.g. [107] and references therein). Kernels were found successful in many applications ranging from speaker identification [43] to predicting arm movements of monkeys [108]. In this chapter we show how kernels can be used together with the QBC algorithm. Thus, we need to modify the algorithm to enable the use of kernels. The algorithm we present in this chapter uses the same skeleton as QBC, but replaces sampling from the high dimensional version space by sampling from a low dimensional projection of it. By doing so, we obtain an algorithm 121

122

Chapter 10. Kernelizing the QBC

which can cope with large-scale problems and at the same time authorizes the use of kernels. Although the algorithm uses linear classifiers at its core, the use of kernels makes it much broader in scope. This new sampling method is presented in section 10.2. Section 10.3 gives a detailed description of the kernelized version, the Kernel Query By Committee (KQBC) algorithm. The last building block is a method for sampling from convex bodies. We suggest the hit and run [79] random walk for this purpose in section 10.4. A Matlab implementation of KQBC is available at http://www.cs.huji.ac.il/labs/learning/code/qbc.

Other algorithms have been suggested for sampling from the version space. Most notable is the Billiard-walk based sampling of Herbrich et al [53]. Herbrich and his coauthors considered the problem of sampling the version space when kernels are used. The added value of our method is two-fold. First, we extend the theoretical reasoning behind the sampling approach. Second, we suggest using “hit and run” (see section 10.4) instead of the Billiard walk since “hit and run” is easier to use and is guaranteed to mix fast to the right, i.e. uniform, distribution.

10.1

Kernels

We begin with a brief introductions to kernels. The reader who is familiar with this subject may wish to skip this section. Kernels are widely used in modern machine learning. They make it possible to use a unified learning algorithm for solving a diversity of problems by plugging in different kernels. In this section we give a brief introduction to the main definitions and properties of kernels. Definition 10.1 A function K : X × X → IR is a kernel function if there exist a Hilbert space H and a function ϕ : X → H such that K (x1 , x2 ) = ϕ (x1 ) · ϕ (x2 ).

10.1.1

Commonly used Kernel Functions

Here is a list of some commonly used kernel functions: 1. The polynomial kernel: For X = IRd we define the kernel function K (x1 , x2 ) = (x1 · x2 + c)

p

for c ≥ 0 and p ≥ 1. 2. The Gaussian/radial kernel: For X = IRd we define the kernel function K (x1 , x2 ) = e−kx1 −x2 k

2

/2σ2

10.1. Kernels

123

for σ 6= 0. 3. The sigmoid kernel: For X = IRd we define the kernel function K (x1 , x2 ) = tanh (κx1 · x2 + θ) for a variety of choices of κ and θ. 4. The ridge kernel: [106] the ridge kernel is an extension that can be applied to any kernel. Let K be a kernel function then we define the kernel function ˆ (x1 , x2 ) = K (x1 , x2 ) + ∆δx1 ,δ2 K where ∆ ≥ 0 and δ is Kronecker delta, i.e.    1 if x1 = x2 δx1 ,x2 =   0 otherwise

Other kernels exists for variety of sample spaces: string kernels [70], spike kernels [108], Fisher kernels [55] and many others.

10.1.2

The Grahm Matrix

Many learning algorithms, e.g. SVM [19], need only inner products between instances for training and generalization. In these cases, it suffices to provide the algorithm with the Grahm matrix, which contains all the inner products between instances: Definition 10.2 Let x1 , . . . , xm be instances in a sample space X , and let K be a kernel function over this space then the Grahm matrix is a symmetric real value matrix with a size of m × m such that the entry in position i, j is K (xi , xj ). It follows that any Grahm matrix must be semi-positive definite. In other words, if G is a Grahm matrix then for any vector w ∈ IRm : wGw⊤ ≥ 0

10.1.3

Mercer’s conditions

Mercer’s conditions provide necessary and sufficient conditions for a function K : X × X → IR to be a valid kernel function.

124

Chapter 10. Kernelizing the QBC

Theorem 10.1 A function K : X × X → IR is a kernel function iff for any g (x) such that Z 2 g (x) dx < ∞ it holds that

10.2

Z

K (x1 , x2 ) g (x1 ) g (x2 ) dx1 dx2 ≥ 0

A New Method for Sampling the Version-Space

The Query By Committee algorithm [104] provides a general framework that can be used with any concept class. Whenever a new instance is presented, QBC generates two independent predictions for its label by sampling two hypotheses from the version space. If the two predictions differ, QBC queries for the label of the instance at hand (see algorithm 5 on page 52). The main obstacle in implementing QBC is the need to sample from the version space (step 4c). It is not clear how to do this with reasonable computational complexity. As is the case for most research in machine learning, we first focus on the class of linear classifiers and then extend the discussion by using kernels. In the linear case, the dimension of the version space is the input dimension which is typically large for real world problems. Thus direct sampling is practically impossible. We overcome this obstacle by projecting the version space onto a low dimensional subspace. k

Assume that the learner has seen the labeled sample S = {(xi , yi )}i=1 , where xi ∈ IRd and yi ∈ {±1}. The version space is defined to be the set of all classifiers which correctly classify all the instances seen so far: V = {w : kwk ≤ 1 and ∀i yi (w · xi ) > 0}

(10.1)

QBC assumes a prior ν over the class of linear classifiers. The sample S induces a posterior over the class of linear classifiers which is the restriction of ν to V . Thus, the probability that QBC will query for the label of an instance x is exactly 2 Pr [w · x > 0] Pr [w · x < 0] w∼ν|V

w∼ν|V

(10.2)

where ν|V is the restriction of ν to V . From (10.2) we see that there is no need to explicitly select two random hypotheses. Instead, we can use any stochastic approach that will query for the label with the same probability as in (10.2). Furthermore, if we can sample yˆ ∈ {±1} such that Pr [ˆ y = 1] =

Pr [w · x > 0]

w∼ν|V

(10.3)

10.2. A New Method for Sampling the Version-Space

125

and Pr [ˆ y = −1] =

Pr [w · x < 0]

(10.4)

w∼ν|V

we can use it instead, by querying the label of x with a probability of 2 Pr [ˆ y = 1] Pr [ˆ y = −1]. Based on this observation, we introduce a stochastic algorithm which returns yˆ with probabilities as specified in (10.3) and (10.4). This procedure can replace the sampling step in the QBC algorithm. Let S = {(xi , yi )}ki=1 be a labeled sample. Let x be an instance for which we need to decide whether to query for its label or not. We denote by V the version space as defined in (10.1) and denote by T the space spanned by x1 , . . . , xk and x. QBC asks for two random hypotheses from V and queries for the label of x only if these two hypotheses predict different labels for x. Our procedure does the same thing, but instead of sampling the hypotheses from V we sample them from V ∩ T . One main advantage of this new procedure over the original QBC is that it samples from a space of low dimension and therefore its computational complexity is much lower. This is true since T is a space of dimension k + 1 at most, where k is the number of queries for label QBC made so far. Hence, the body V ∩ T is a low-dimensional convex body1 and thus sampling from it can be done efficiently. The input dimension plays a minor role in the sampling algorithm. Another important advantage is that it allows us to use kernels, and therefore gives a systematic way to extend QBC to the non-linear scenario. The use of kernels is described in detail in section 10.3. The following theorem proves that indeed sampling from V ∩ T produces the desired results. It shows that if the prior ν (see algorithm 5 on page 52) is uniform, then sampling hypotheses uniformly from V or from V ∩ T generates the same results. k

Theorem 10.2 Let S = {(xi , yi )}i=1 be a labeled sample and x an instance. Let V be the version space V = {w : kwk ≤ 1 and ∀i yi (w · xi ) > 0} and let T = span (x, x1 , . . . , xk ) then Prw∼U(V ) [w · x > 0] = Prw∼U(V ∩T ) [w · x > 0]

and

Prw∼U(V ) [w · x < 0] = Prw∼U(V ∩T ) [w · x < 0] where U (·) is the uniform distribution. 1 From

the definition of the version space V it follows that it is a convex body. See Lemma 9.1 on page 112.

126

Chapter 10. Kernelizing the QBC

Before we prove this theorem, we prove a couple of lemmas: Lemma 10.1 Let V and T be as defined in Theorem 10.2. Let PT be the orthogonal projection to T then PT (V ) = V ∩ T Proof: Let w ∈ V ∩ T . Since w ∈ T then w = PT (w), combined with the fact that w ∈ V we conclude that w ∈ PT (V ) and thus V ∩ T ⊆ PT (V ). On the other hand, let w ∈ PT (V ). It suffices to show that w ∈ V to complete the proof. Let w ˆ ∈ V be such that PT (w) ˆ = w. Since PT is a projection, kwk ≤ kwk ˆ ≤ 1. Moreover, since w ˆ − w ∈ T ⊥ , and xi ∈ T we have that w ˆ · xi

=

w · xi + (w ˆ − w) · xi

=

w · xi

and thus yi w · xi = yi w ˆ · xi > 0 and thus w ∈ V which completes the proof. Next we show that V is almost a product space. Lemma 10.2 Let w ∈ V ∩ T then PT−1 (w) ∩ V PT−1 (w) = {v : PT (v) = w}.

= w + v ∈ T ⊥ : kvk ≤ 1 − kwk where

Proof: Let v ∈ T ⊥ such that kvk ≤ 1 − kwk. Then kv + wk = kvk + kwk ≤ 1. For any (xi , yi ), we have that yi (w + v) · xi = yi w · xi since v ⊥ xi and therefore (v + w) ∈ V . Furthermore, PT (w + v) = w since v ∈ T ⊥ and thus (v + w) ∈ PT−1 (w) ∩ V . Therefore PT−1 (w) ∩ V ⊇ w + v ∈ T ⊥ : kvk ≤ 1 − kwk

On the other hand, let u ∈ PT−1 (w) ∩ V . Clearly, PT (u) = w and therefore u = w + v

such that v ∈ T ⊥ . Since w ⊥ v and kwk ≤ 1 it follows that kvk ≤ 1 − kwk and thus u ∈ w + v ∈ T ⊥ : kvk ≤ 1 − kwk . Finally PT−1 (w) ∩ V ⊆ w + v ∈ T ⊥ : kvk ≤ 1 − kwk

10.3. Sampling with Kernels

127

this completes the proof. We are now ready to present the proof of the main theorem: Proof: of theorem 10.2 First note that for any u ∈ V sign (u · x) = sign (PT (u) · x)

(10.5)

Let ν be the push forward probability measure PT (U (V )); i.e. if A is a measurable set then ν (A) is the measure under U (V ) of PT−1 (A). From (10.5) it follows that Pr

[w · x > 0] =

w∼ν

Pr

[w · x < 0] =

w∼ν

w∼U(V ) w∼U(V )

Pr [w · x > 0] Pr [w · x < 0]

Clearly, ν is continuous with respect to the Lebesgue measure and hence has density. Let dν be the density of ν. From lemma 10.1 it follows that for any w ∈ / V ∩ T the density dν (w) is zero. From lemma 10.2 if follows that for any w ∈ V ∩ T the density dν (w) depends solely on kwk. Finally, since for any λ > 0 sign (w · x) = sign (λw · x) it follows that Pr [w · x > 0] =

w∼ν

Pr [w · x < 0] =

w∼ν

Pr

[w · x > 0]

Pr

[w · x < 0]

U(V ∩T ) U(V ∩T )

this completes the proof. Theorem 10.2 proves the soundness of the sampling algorithm presented. It proves that although we sample from a low-dimensional projection of the version space, the results are identical.

10.3

Sampling with Kernels

In this section we show how the new sampling method presented in section 10.2 can be used together with kernels. QBC uses the random hypotheses for one purpose alone: to check the labels they predict for instances. In our new sampling method the hypotheses are sampled from V ∩ T , where T = span (x, x1 , . . . , xk ). Hence, any hypothesis is represented by w ∈ V ∩ T , that has the form

w = α0 x +

k X j=1

αj xj

(10.6)

128

Chapter 10. Kernelizing the QBC

The label w assigns to an instance x′ is   k k X X αj xj · x′ αj xj  · x′ = α0 x · x′ + w · x′ = α0 x +

(10.7)

j=1

j=1

Note that in (10.7) only inner products are used, hence we can use kernels. Using these observations, we can sample a hypothesis by sampling α0 , . . . , αk and define w as in (10.6). However, since the xi ’s do not form an orthonormal basis of T , sampling the α’s uniformly is not equivalent to sampling the w’s uniformly. We overcome this problem by using an orthonormal basis of T . The following lemma shows how an orthonormal basis for T can be computed when only inner products are used. Lemma 10.3 Let x0 , . . . , xk be a set of vectors, let T = span (x0 , . . . , xk ) and let G = (gi,j ) be the Grahm matrix such that gi,j = xi · xj . Let λ1 , . . . , λr be the non-zero eigen-values of G with the corresponding eigen-vectors γ1 , . . . , γr . Then the vectors t1 , . . . , tr such that ti =

k X γi (l) √ xl λi l=0

form an orthonormal basis of the space T . This lemma is significant since the basis t1 , . . . , tr enables us to sample from V ∩T using simple P techniques. Note that a vector w ∈ T can be expressed as ri=1 α (i) ti . Since the ti ’s form an

orthonormal basis, kwk = kαk. Furthermore, we can check the label w assigns to xj by w · xj =

X i

α (i) ti · xj =

X i,l

γi (l) α (i) √ xl · xj γi

which is a function of the Grahm matrix. Therefore, sampling from V ∩ T boils down to the problem of sampling from convex bodies, where instead of sampling a vector directly we sample the coefficients of the orthonormal basis t1 , . . . , tr . Keep in mind that we do no not need to recalculate this basis for every new instance for which we query for label. Instead, if we have the basis t1 , . . . , tr for span (x1 , . . . , xk ) and we encounter a new instance x0 we can simply do the following calculation: t⊥ = x0 −

r X i=1

(x0 · ti ) ti

If t⊥ is zero then x0 ∈ span (x1 , . . . , xk ) and thus we do not need to extend the basis. Otherwise

we can extend the basis with the vector tr+1 = t⊥ / t⊥ . The computational complexity of this process is O r2 which is O k 2 at most. We now go back to prove Lemma 10.3.

10.4. Hit and Run

129

Proof: of Lemma 10.3 First note that t1 , . . . , tr ∈ T and thus span (t1 , . . . , tr ) ⊆ T . Also note that the dimension of T is r. Indeed, if the dimension of t is greater than r then there exists an orthonormal basis τ1 , . . . , τk for T with k > r. We can express the vectors τ1 , . . . , τk in terms of the xi ’s such that P τi = j τi (j) xj . Let Θ = (θi,j ) be the matrix such that θi,j = τi (j) then ! X X ′ ′ (ΘGΘ )i,j = τi (l) xl · x0 , . . . , τi (l) xl · xk (τj (0) , . . . , τj (k)) l

=

X s,l

=

l

τi (l) τj (s) xl · x (s)

τi · τj = δij

where the last equality follows since τ1 , . . . , τk are orthonormal. It follows that ΘGΘ′ = Ik×k . Since k > r this contradicts the assumption that rank (G) = r. Therefore we conclude that the dimension of T is at most r. To complete the proof, is suffices to show that t1 , . . . , tr are indeed orthonormal. Thus, we will show that ti · tj = δi,j . ti · tj

= = = =

! ! k k X X γj (l) γi (l) √ xl · p xl λi λj l=0 l=0 X 1 p γi (l) γj (s) xl · xs λi λj l,s 1 p γi′ Gγj λi λj λ p j (γi · γj ) = δi,j λi λj

where the last equality follows since the eigen-vectors γ1 , . . . , γr are orthonormal. In the next section we discuss one possible method of sampling from this convex body.

10.4

Hit and Run

Hit and run [79] is a method of sampling from a convex body K using a random walk. Let z ∈ K. A single step of the hit and run begins by choosing a random point u from the unit sphere. Afterwards the algorithm moves to a random point selected uniformly from l ∩ K, where l is the line passing through z and z + u. Hit and run has several advantages over other random walks for sampling from convex bodies. First, its stationary distribution is indeed the uniform distribution, it mixes fast [79] and it does not require a “warm” starting point [80]. What makes it especially suitable for practical use is

130

Chapter 10. Kernelizing the QBC

the fact that it does not require any parameter tuning other than the number of random steps. It is also very easy to implement. Current proofs [79, 80] show that O∗ d3

steps are needed for the random walk to mix.

However, the constants in these bounds are very large. Nevertheless, our experiments show that in practice hit and run mixes much faster than that (see chapter 11 on page 132). We have used it to sample from the body V ∩ T . The number of steps we used was very small, ranging from a couple of hundred to a couple of thousands. Our empirical study shows that this suffices to obtain impressive results.

10.5

Generalizing to Unseen Instances

We saw how the QBC learning process can be conducted efficiently even when kernels are being used. We now look at the generalization phase. In Chapter 5, where the QBC algorithm is presented, we discussed several options for the generalization phase of QBC. One option is to work in an online fashion in which there is no clear distinction between the learning and the generalization rule (see Theorem 5.6 on page 57). In this setting, the learner predicts the label of an instance he sees and at the same time decides whether to query for the label or not. As we saw in previous sections, this does not introduce any difficulty when kernels are being used. In other settings presented in Chapter 5, the learning phase stops once a certain stopping criterion is met. At this point QBC returns a hypothesis. We have discussed several options for the choice of the returned hypothesis. We would like to verify which of these hypotheses can be used together with kernels. The first hypothesis we consider is the Bayes optimal hypothesis. This hypothesis is not necessarily a linear classifier and thus, in general, does not have a simple representation. Since this is a problem even when kernels are not being used, we will definitely have the same problem once kernels are used. The second kind of a hypothesis we consider is the Gibbs hypothesis. There are two possibilities here. First, we can draw a random hypothesis whenever we would like to label an instance. Using the techniques presented in the previous sections of this chapter, this can be done combined with kernels. An alternative way to use the Gibbs hypothesis is to draw a single hypothesis from the version space and to use it for all future predictions. This can-not be done when kernels are used because the random hypothesis needs to be sampled from the full version space. Note that when we

10.6. Summary and Further Study

131

projected the version space into space T , we used T which is the span of x, x1 , . . . , xk . We assumed that we know the instance x for which we would like to predict the label. However when x is not known, it is not clear what to focus on. The final option we considered in Chapter 5 was to use the Bayes Point Machine (BPM) classifier, which in our case will be the center of gravity of the version space. It is easy to verify that under the assumption that the prior is uniform, the center of gravity will always be in the span of the instances for which we queries for labels. Furthermore, using the same arguments as we used throughout this chapter it is easy to show that if V is the version space and T is the span of the instances for which we queries for labels then the center of gravity of V is exactly at the same point as the center of gravity of V ∩ T . Thus the BPM classifier can be used even in the kernelized setting.

10.6

Summary and Further Study

In this chapter we presented two main ideas. First we showed how kernels can be used to enhance the ability of QBC to deals with tasks where the target classifier is not necessarily linear. It can be used to overcome noise using the ridge trick. In other words, for any two instances x1 , x2 and a kernel K we define ˆ (x1 , x2 ) = K (x1 , x2 ) + ∆δx1 ,x2 K ˆ where ∆ > 0 and δx1 ,x2 is one if x1 and x2 are identical and zero otherwise. Using the kernel K every task becomes linearly separable. Another issue that we dealt with in this chapter is practical methods for sampling the version space. We suggested the use of the hit-and-run algorithm for this purpose. We discussed the adequacy of this sampling algorithm for our purposes. In the following chapter we present the empirical results obtained when using the techniques presented here for several learning tasks.

Chapter 11

Empirical Evidence 11.1

Empirical Study

In this chapter we present the results of applying the kernelized version of the Query by Committee (KQBC) algorithm with the Hit-and-Run random walk (see Chapter 10), to two learning tasks. The first task requires classification of synthetic data whereas the second is a real world problem.

11.1.1

Synthetic Data

In our first experiment we study the task of learning a linear classifier in a d-dimensional space. The target classifier is the vector w∗ = (1, 0, . . . , 0) thus the label of an instance x ∈ IRd is the sign of its first coordinate. The instances are normally distributed N (µ = 0, Σ = Id ). In each trial we use 10000 unlabeled instances and let KQBC select the instances to query for the labels. We also apply Support Vector Machine (SVM) to the same data. The linear kernel is used for both KQBC and SVM. Since SVM is a passive learner, SVM is trained on prefixes of the training data. We use different sizes for these prefixes. The results are presented in figure 11.1. The difference between KQBC and SVM is notable. When both are applied to a 15-dimensional linear discrimination problem (figure 11.1b), SVM and KQBC have an error rate of ∼ 6% and ∼ 0.7% respectively after 120 labels. After such a short training sequence the difference attains an order of magnitude. The same qualitative results emerge for all problem sizes. As expected, the generalization error of KQBC decreases exponentially fast as the number of queries is increased, whereas the generalization error of SVM decreases only at an inversepolynomial rate (the rate is O∗ (1/k) where k is the number of labels). This should not come as a 132

% generalization error

% generalization error

% generalization error

11.1. Empirical Study

133

100 10 Kernel Query By Committee Support Vector Machine 48⋅2−0.9k/5

1 0.1

0

10

20

30

40 (a) 5 dimensions

50

60

70

80

100 10 Kernel Query By Committee Support Vector Machine 53⋅2−0.76k/15

1 0.1

0

50

100 150 (b) 15 dimensions

200

250

100 30 10 Kernel Query By Committee Support Vector Machine 50⋅2−0.67k/45

3 1

0

50

100

150

200 250 300 (c) 45 dimensions

350

400

450

500

Figure 11.1: Results on the synthetic data. The generalization error (y-axis) in percent (in logarithmic scale) versus the number of queries (x-axis). Plots (a), (b) and (c) represent the synthetic task in 5, 15 and 45 dimensional spaces respectively. The generalization error of KQBC is compared to the generalization error of SVM. The results presented here are averaged over 50 trials. Note that the error rate of KQBC decreases exponentially fast as was proved in the fundamental theorem of the QBC algorithm (Theorem 6.1 on page 62).

134

Chapter 11. Empirical Evidence

surprise since the fundamental theorem of the QBC algorithm (Theorem 6.1 on page 62) proved that this is the expected behavior.

11.1.2

Label Efficient Learning over Synthetic Data

We conducted another experiment using the same synthetic setting as presented in section 11.1.1. The sample space is IR5 with uniform distribution and the target concept is the vector (1, 0, 0, 0, 0). In this experiment we tested KQBC in the label efficient setting (see section 4.3 on page 50). We generated 2500 instances and presented them to KQBC one by one. For each of these instances KQBC either queried for the label of the instance or predicted its label. We counted both the number of queries and the number of prediction mistakes. This process was repeated 50 times. The results are presented in figure 11.2a. As predicted in Theorem 5.6 on page 57, the number of queries for labels is exactly twice the number of prediction mistakes. Also, following the theoretical analysis presented in Chapter 6, both parameters grow logarithmically with respect to the number of instances. We use this setting to check the effect of the number of Hit and Run steps on the performance of KQBC. The results of KQBC with 1000, 100, 50, 10, 5 and 2 Hit and Run steps for generating a random hypothesis are presented in sub-figures a-f of figure 11.2. When the number of random steps drops, KQBC tends to query for fewer instances, which causes an increase in the number of prediction mistakes. However the results of using 50, 100 and 1000 random steps are practically equivalent and match our predictions for uniformly sampled hypotheses. We conclude that Hit and Run mixes very fast: much faster than the bounds in [79].

11.1.3

Face Image Classification

The setting of the second experiment is more realistic. In the second task we used the AR face dataset [82] which is a collection of face images. The people in these images are wearing different accessories, have different facial expressions and the faces are lit from different directions. We selected a subset of 1456 images from this dataset. Each image was converted into gray-scale and re-sized to 85 × 60 pixels; i.e. each image was represented as a 5100 dimensional vector. see figure 11.3 for sample images. The task was to distinguish male and female images. For this purpose we split the data into a training sequence of 1000 images and a test sequence of 456 images. To test statistical significance we repeated this process 20 times, each time splitting the dataset into training and testing sequences.

11.1. Empirical Study

135

Figure 11.2: KQBC for label efficient learning. The results of applying KQBC to the synthetic data are presented. The number of instances is located on the x-axis and the y-axis shows the average number of queries and prediction errors. Each subplot represents the different numbers of Hit and Run steps made to generate a new hypothesis from the version space.

136

Chapter 11. Empirical Evidence

% generalization error

Figure 11.3: Examples of face images used for the face recognition task.

48 45 42 39 36 33 30 27 24 21

Kernel Query By Committeei (KQBC) Support Vector Machine (SVM) SVM over KQBC selected instances

18 15 12

0

20

40

60

80

100 120 number of labels

140

160

180

200

Figure 11.4: The generalization error of KQBC and SVM for the faces dataset (averaged over 20 trials). The generalization error (y-axis) vs. number of queries (x-axis) for KQBC (solid) and SVM (dashed) are compared. When SVM was applied solely to the instances selected by KQBC (dotted line) the results are better than SVM but worse than KQBC.

We applied both KQBC and SVM to this dataset. We used the Gaussian kernel, such that 2 the inner product between two images was defined to be K (x1 , x2 ) = exp − kx1 − x2 k /2σ 2

where σ is chosen to be σ = 3500 which is the value favored by SVM. The results are presented in figure 11.4. It is apparent from figure 11.4 that KQBC outperforms SVM. When the budget allows for 100 − 140 labels, KQBC has an error rate of 2 − 3 percent less than the error rate of SVM. When 140 labels are used, KQBC outperforms SVM by 3.6% on average. This difference is significant as in 90% of the trials KQBC outperformed SVM by more than 1%. In one of the cases, KQBC was 11% better. We also used KQBC as an active selection method for SVM. We trained SVM over the instances selected by KQBC. The generalization error obtained by this combined scheme was better than the passive SVM but worse than KQBC. Another interesting way to view these results is to look at the images for which KQBC queried

11.2. Summary

137

Figure 11.5: Images selected by KQBC. The last six faces for which KQBC queried for a label. Note that three of the images are saturated and that two of these are wearing a scarf that covers half of their faces.

for labels. In figure 11.5 we see the last images for which KQBC queries for labels. It is apparent, that the selection made by KQBC is non-trivial. All the images are either highly saturated or partly covered by scarves or sunglasses. We conclude that KQBC indeed performs well even when kernels are used.

11.2

Summary

In this chapter we demonstrated the kernelized version of the QBC algorithm on several experiments. In all our experiments, KQBC outperformed SVM significantly. We also tested KQBC in the efficient labeling setting; i.e. the online setting, and showed that it also performs well here.

Part IV

Discussion

138

Chapter 12

Summary “No learning occurs if the learner is not active” [17, pg. 110] The title of this work, “To PAC and Beyond” represents the main theme of this dissertation. The PAC [116] model is a very successful one. Valiant defined learning in mathematical language, and thus enabled the scientific community to study this concept using tools from different scientific fields. It allowed us to articulate questions such as • Is everything learnable? • Is anything learnable? • What can we learn? The definition of the PAC model marks the beginning of the machine learning field of research. Although many of the important results in this field [44, 90, 96, 100, 111, 118] were made earlier, Valiant took the pioneering step of placing all these works in the context of learning. Nevertheless, the PAC model has its limitations. In this work we went beyond PAC by allowing learners to be active. We see learning as a game played between the learner and the teacher. We showed that the assumption that the learner is passive is restrictive. When we allow the learner the freedom to actively participate in the learning process it learns much faster. After a short introduction we studied the Membership Queries framework in Part II. In Chapter 3 we presented a novel method of tolerating noise in the learning process using membership queries and the dual representation of the learning problem. In Part III we studied active learning in the selective sampling framework. In Chapter 5 we presented the Query By Committee 139

140

Chapter 12. Summary

algorithm of Seung et al[104] and discussed possible termination rules for this algorithm which corresponds to different modes of use. In Chapter 6 we presented a theoretical analysis of the QBC algorithm. We showed that active learners can enjoy an exponential speedup in their learning rates when certain conditions apply. In Chapter 7 we showed that QBC can tolerate incorrect assumptions on priors. In Chapter 8 we presented a method which makes QBC more resistant to noise. We discussed efficient implementations of QBC in Chapter 9 and extended it to enable the use of kernels in Chapter 10. An empirical study of QBC was presented in Chapter 11. These constitute encouraging step forward in the ability to study active learning from various points of view, and (almost) close the gap between theory and practice in this field.

12.1

Active Learning in Humans

Our prime focus in this work is machine learning. Nevertheless, the findings can be connected to learning in humans, since active learning is as important to humans as it is for machines. Research on human learning and machine learning is conducted from very different points of view. Investigators studying human learning primarily try to teach teachers how to teach, whereas researchers in the field of machine learning attempt to teach learners how to learn. Indeed, “learning to learn” is not the title of any class in school or university. In an introduction to his course on computer organization Charles Lin tries to address this issue [73]. Lin entitled his essay “Active Learning” to render the idea that you know that you have learned something when you are able to teach it. Thus a student should convince himself that he is able to teach what he has learned and whenever the student is not confident that he can do that, he should ask the teacher, a peer or seek the answer somewhere else. Researchers in the field of human learning and early childhood development use “active learning” slightly differently from the way we have used it in this dissertation. Any learning process in which the learner takes part in is considered to be active. According to Piaget, a child plays a very active role in the growth of intelligence [109, pg. 13]. Examples include game playing, counting, etc. Furthermore, for Piaget, intelligence meant exploring the environment [109, pg. 27] thus intelligence is about actively extending knowledge. Both Piaget and Vygotsky explicitly argue that the child plays an active role in the acquisition of knowledge as opposed to Behaviorism theory which suggests that learning is determined by external variables (stimulus and reinforcement)[17, pg. 27]. The constructivist theory argues that learning is an internal process that external stimulus can trigger it [88].

12.2. Conclusions

141

The active role of children in learning takes on several forms. According to leading theories, the child constructs a hypothesis and revises it when needed. Constructing a theory is an active process [17, pg. 8]. While this is an internal process, active learning has an external manifestation as well [17, pg. 9]: a child needs to be able to manipulate objects in order to understand what these objects are and what can they do. It is also necessary to have children actively involved in the learning process to motivate them and cause them to engage in the learning process. The type of “active learning” we are interested in is different. For us, a child is considered active if his behavior and questions causes a change in the learning process itself. Therefore, a natural question would be how much a child can gain (in knowledge) by actively directing the teacher. To the best of my knowledge, no study addresses this issue explicitly. Never the less, implicitly, there is no doubt in our mind that many theories of early childhood development and human learning see the child as a “director” of the learning process. For instance, Montessori, Erikson, Piaget and Vygotsky place great emphasis on the significance of observing the students in planning of a curriculum [87]. For example, according to Montessori, teachers should be trained to “teach little and observe much” [87, pg. 31] because observation is the key to determining what children are interested in and need to learn [87, pg. 33].

12.2

Conclusions

The study of active learning in machines is taking its first steps. In this work we attempted to contribute to the growth of this field. We studied both empirical and theoretical aspects of this domain. At the same time we argue that active learning is important for human learning. Those of us who are involved in learning should keep this in mind and use this powerful tool while learning.

List of Publications In order to keep this document reasonably sized, only a subset of the work I have done during my studies is presented in this dissertation. Here is a complete list of my publications. Journals

• R. Bachrach, R. El-Yaniv, and M. Reinstadtler, ”On the competitive theory and practice of online list accessing algorithms”, Algorithmica, vol. 32, no. 2, pp. 201-245, 2002. An extended abstract of this paper appeared in a conference: R. Bachrach and R. El-Yaniv, ”Online list accessing algorithms and their applications, recent empirical evidence”, in Proceedings of the 8th Symposium on Discrete Algorithms (SODA), pp 53-62, 1997.

• R. Bachrach, S. Fine, and E. Shamir, ”Query by committee, linear separation and random walks”, Theoretical Computer Science, vol. 284, no. 1, 2002. An extended abstract of this paper appeared in a conference:

R. Bachrach, S. Fine, and E. Shamir, ”Query by committee, linear separation and random walks”, in Proceedings of the 4th European Conference on Learning Theory (EUROCOLT), pp 34-49, 2001.

Refereed Conferences

• R. Gilad-Bachrach, A. Navot and N. Tishby “Query By Committee made real ”, in Proceedings of the 19th Conference on Neural Information Processing Systems (NIPS), 2005. • R. Gilad-Bachrach, A. Navot and N. Tishby, “Bayes and Tukey meet at the center point ”, in Proceedings of the 17th Conference on Learning Theory (COLT), 2004. • R. Gilad-Bachrach, A. Navot and N.Tishby, “Margin based feature selection - theory and algorithms”, in Proceedings of the 21st International Conference on Machine Learning (ICML), 2004. 142

12.2. Conclusions

143

• R. Gilad-Bachrach, A. Navot, and N. Tishby, ”An information theoretic tradeoff between complexity and accuracy”, in Proceedings of the 16th Conference on Learning Theory (COLT), pp. 595-609, 2003. • K. Crammer, R. Gilad-Bachrach, A. Navot, and N. Tishby, ”Margin analysis of the lvq algorithm”, in Proceedings of the 16th Conference on Neural Information Processing Systems (NIPS), 2002.

Book Chapters

• R. Gilad-Bachrach, A. Navot and N. Tishby Connections with some classic IT problems. In Information Bottlenecks and Distortions: The emergence or relevant structure from data, N. Tishby and T. Gideon (eds.) MIT press (in preparation). • R. Gilad-Bachrach, A. Navot and N. Tishby Large margin principles for feature selection. In Feature extraction, foundations and applications, I. Guyon, S. Gunn, M. Nikravesh and L. Zadeh (eds.) , Springer (forthcoming 2006).

Technical Reports

• S. Fine, R. Gilad-Bachrach, E. Shamir, and N. Tishby, ”Noise tolerant learning using early predictors”, technical report 1999-22, Leibniz Center, the Hebrew University, 1999. • S. Fine, R. Gilad-Bachrach, S. Mendelson, and N. Tishby, ”Noise tolerant learning via the

dual learning problem”, technical report 2000-14, Leibniz Center, the Hebrew University. presented at NCST99, 2000.

• S. Axelrod, S. Fine, R. Gilad-Bachrach, S. Mendelson, and N. Tishby, ”The information of

observations and applications for active learning with uncertainty”, technical report 2001-81, Leibniz Center, the Hebrew University, 2001.

• R. Gilad-Bachrach, A. Navot, and N. Tishby, ”Kernel query by committee (KQBC)”, technical report 2003-88, Leibniz Center, the Hebrew University, 2003.

• R. Gilad-Bachrach, “Dimensionality reduction for online learning algorithms using random projections”, technical report 2005, Leibniz Center the Hebrew University, 2005.

Bibliography [1] R.A. Adams. Sobolev Spaces, volume 69 of Pure and Applied Mathematics series. Academic Press, 1975. [2] N. Alon, S. Ben-David, N. Cesa-Bianchi, and D. Haussler. Scale-sensativie dimensions, uniform convergence, and learnability. Journal of the ACM, 44(4):615–631, 1997. [3] D. Angluin. Queries and concept learning. Machine Learning, 2:319–342, 1988. [4] D. Angluin. Queries revisited. Theoretical Computer Science, 313(2):175–194, 2004. [5] D. Angluin and M. Kharitonov. When won’t membership queries help? In Proceedings of the 23rd annual ACM symposium on Theory of computing, 1991. [6] M. Anthony and P. L. Bartlett. Neural Network Learning: Theoretical Foundations. Cambridge University Press, 2001. [7] A. C. Atkinson and A. N. Donve. optimum experiment designs. Oxford University Press, 1992. [8] P. Auer, N. Cesa-Bianchi, Y. Freund, and R. E. Schapire. The nonstochastic multiarmed bandit problem. SIAM journal of computing, 32(1):48–77, 2002. [9] M. Bagnoli and T. Bergstrom. Log-concave probability and its applications. http: //www.econ.ucsb.edu/~tedb/Theory/logconc.ps, 1989. [10] Y. Baram, R. El-Yaniv, and K. Luz. Online choice of active learning algorithms. Journal of Machine Learning Reseach (JMLR), 5:255–291, march 2004. [11] P. Bartlett and S. Ben-David. Hardness results for neural network approximation problems. In the proceedings of the 4’th European Conference on Computational Learning Theory, 1999. [12] E. B. Baum. Neural net algorithms that learn in polynomial time from examples and queries. IEEE Transactions on Neural Networks, 2(1), 1991. [13] N. Ben-David, S. Eiron and P. Long. On the difficulty of approximating maximum agreement. Journal of Computer and System Sciences, 66(3):496 – 514, May 2003. [14] D. Bertsimas and S. Vempala. Solving convex programs by random walks. In STOC, pages 109–115, 2002. 144

BIBLIOGRAPHY [15] A. Blum and T. Mitchell. Combining labled and unlabled data with co-training. In the 11’th annual Conference on Computional Learning Theory, pages 92–100, 1998. [16] A. Blumer, A. Ehrenfeucht, D. Haussler, and M. K. Warmuth. Learnability and the vapnik-chervonenkis dimension. Journal of the ACM, 100:157–184, 1989. [17] E. Bodrova and Leong D. J. Tools of the Mind: The Vygotskian approach to early childhood education. Pretice-Hall, 1996. [18] C. Borell. Convex set functions in d-space. Periodica Mathematica Hungarica, 6:111– 136, 1975. [19] B. Boser, I. Guyon, and V. Vapnik. Optimal margin classifiers. In Fifth Annual Workshop on Computational Learning Theory, pages 144–152, 1992. [20] L. Breiman, J. Friedman, R. A. Olshen, and C. Stone. Classification and Regression Trees. Chapman Hall, 1984. [21] Leo Breiman. Bagging predictors. Machine Learning, 24(2):123–140, 1996. [22] N. Bshouty. Exact learning via the monotone theory. In Proceedings of the 34th Annual Symposium on Foundations of Computer Science, 1993. [23] N. H. Bshouty, S. A. Goldman, H. D. Mathias, S. Suri, and H. Tamaki. Noise-tolerant distribution-free learning of general geomtric concepts. In the proceedings of the 28th Annual ACM Symposium of Theory of Computing, 1996. [24] C. Campbell, N. Cristianini, and A. Smola. Query learning with large margin classifiers. In Proceedings of the 17th International Conference on Machine Learning (ICML), 2000. [25] A. Caplin and B. Nalebuff. Aggregation and social choice: A mean voter theorem. Exonometrica, 59(1):1–23, 1991. [26] N. Cesa-Bainchi, A. Conconi, and C. Gentile. Learning probablistic linear-threshold classifiers via selective sampling. In Proceedings of the 16th annual Conference on Learning Theory (COLT), pages 373–387, 2003. [27] N. Cesa-Bianchi, G. Lugosi, and G. Stolz. Minimizing regret with label efficient prediction. IEEE Transactions on Information Theory, 51(6):2152–2162, 2005. [28] D. Cohn, L. Atlas, and R. Ladner. Training connectionist networks with queries and selective sampling. Advanced in Neural Information Processing Systems 2, 1990. [29] D. Cohn, L. Atlas, and R. Ladner. Improving generalization with active learning. Machine Learning, 15(2):201–221, 1994. [30] D. A. Cohn, Z. Ghahramani, and M. I. Jordan. Active learning with statistical models. Journal of Artificial Intelligence Research, 4:129–145, 1996.

145

146

BIBLIOGRAPHY

[31] T. M. Cover and J. A. Thomas. Elements Of Information Theory. Wiley Interscience, 1991. [32] T.M. Cover and P.E. Hart. Nearest neighbor pattern classifier. IEEE Transactions on Information Theory, 13:21–27, 1967. [33] I. Dagan and S. Engelson. Committee-based sampling for training probabilistic classifiers. Proc. 12th International Conference on Machine Learning, 1995. [34] S. Dasgupta. Analysis of a greedy active learning strategy. In Advances in Neural Information Processing Systems, 2004. [35] S. Dasgupta. Coarse sample complexity bounds for active learning. In Advances in Neural Information Processing Systems, 2005. [36] S. Dasgupta, A. T. Kalai, and C. Monteleoni. Analysis of perceptron-based active learning. In Proceeding of the 18th Annual Conference on Learning Theory (COLT), 2005. [37] S. E. Decator. Efficient Learning from Faulty Data. PhD thesis, Harvard University, 1995. [38] O. Dekel, S. Shalev-Shwarts, and Singer Y. The forgetron: A kernel-based perceptron on a fixed budget. In Neural Information Processing Systems (NIPS), 2005. [39] M. Dyer, A. Frieze, and R. Kannan. A random polynomial time algorithm for approximating the volume of convex bodies. Journal of the Association for Computing Machinery, 38, Number 1:1–17, 1991. [40] B. Eisenberg and R. L. Rivest. On the sample complexity of pac-learning using random and chosen examples. In the Proceedings of the Third Annual Conference on Computational Learning Theory, pages 154–162. Morgan-Kaufmann, 1990. [41] G. Elekes. A geometric inequality and the complexity of computing volume. Discrete and Computational Geometry, 1986. [42] S. Fine, A. Freund, I. Jaeger, Y. Mansour, Y. Naveh, and Ziv A. Harnessing machine learning to improve success rate of stimuli generation. IEEE Transactions on Computers, to appear 2006. [43] S. Fine, J. Navratil, and R. Gopinath. A hybrid gmm/svm approach to speaker identification. In The International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2001. [44] E. Fix and j. Hodges. Discriminatory analysis. nonparametric discrimination: Consistency properties. Technical Report 4, USAF school of Aviation Medicine, 1951. [45] Y. Freund and R. E. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 55(1):119– 139, 1997.

BIBLIOGRAPHY [46] Y. Freund, H. Seung, E. Shamir, and N. Tishby. Selective sampling using the query by committee algorithm. Machine Learning, 28:133–168, 1997. [47] R. Gilad-Bachrach, A. Navot, and N. Tishby. Bayes and tukey meet at the center point. In Proceedings of the 17th Conference on Learning Theory (COLT), pages 549–563, 2004. [48] D. Gilbarg and N. S. Trudinger. Elliptic Partial Differential Equations of Second Order. Springer Verlag, 1998. [49] D. Haussler, M. Kearns, and R. E. Schapie. Bounds on the sample complexity of bayesian learning using information theory and the vc dimension. Machine Learning, 14:83–113, 1994. [50] D. Haussler and Opper M. Mutual information, metric entropy, and cumulative relative entropy risk. Annals of Statistics, 25(6), Dec 1997. [51] Donald O. Hebb. The Organization of Behavior. John Wiley, New York, 1949. [52] D. Helmbold and S. Panizza. Some label efficient learning results. In Proceedings of the 10th Annual Conference on Computational Learning Theory (COLT), pages 218–230, 1997. [53] R. Herbrich, T. Graepel, and C. Campbell. Bayes point machines: Estimating the bayes point in kernel space. In Proceedings of IJCAI Workshop on Support Vector Machines, pages 23–27, 1999. [54] R. Herbrich, T. Graepel, and C. Campbell. Bayes point machines. Journal of Machine Learning Research, 2001. [55] T. Jaakkola, M. Deikhans, and D. Haussler. a discriminative framework for detecting remote protein homologies. Journal of Computational Biology, 7:95–114, 2000. [56] F. V. Jensen. Bayesian Networks and Decision Graphs. Springer, 2001. [57] F. John. Extremum problems with inequalities as subsidiary conditions. In Studies and Essays Presented to R. Courant on his 60th Birthday, pages 187–204. Interscience Publishers, Inc., New York, N. Y., 1948. [58] M. Kearns. Efficient noise-tolerant learning from statistical queries. In Proceedings of the 25th ACM Symposium on the Theory of Computing, pages 392–401, 1993. [59] M. Kearns. Boosting theory towards practice: Recent developments in decision tree induction and the weak learning framework. Abstract accompanying invited talk given at AAAI 1996, 1996. [60] M. Kearns and V. Valiant. Cryptographic limitations on learning boolean formulae and finite automata. J. of the ACM, 41(1):67–95, 1994. [61] M. Kearns and U. Vazirani. An Introduction To Computational Learning Theory. The MIT Press, 1994.

147

148

BIBLIOGRAPHY

[62] A. R. Klivans and R. Servedio. Learning intersections of halfspaces with a margin. In Proceedings of the 17th Annual Conference on Learning Theory (COLT), 2004. [63] A. Krogh and J Vedelsby. Neural network ensembles, cross validation, and active learning. In Advances in Neural Information Processing Systems (NIPS), pages 231– 238, 1995. [64] S. Kullback. Information Theory and Statistics. Wiley, 1959. [65] E. Kushilevitz and Y. Mansour. Learning decision trees using the fourier spectrum. In Proceedings of the twenty-third annual ACM symposium on Theory of computing, pages 455–464, 1991. [66] S. Kwek and L. Pitt. Intersections of halfspaces with membership queries. Algorithmica, 1998. [67] K. J. Lang and E. B. Baum. Query learning can work poorly when a human oracle is used. In Proceedings of the International Joint Conference on Neural Networks, pages 335–340, 1992. [68] BBC Learning. How we learn - definition of learning. http://www.bbc.co.uk/learning/ returning/betterlearner/learningstyle/a_whatislearning_01.shtml, 2004.

[69] M. Ledoux. The Concentration of Measure Phenomenon. American Mathematical Society, 2001. [70] C. Leslie, Eskinm E., A. Cohen, J. Weston, and W. S. Noble. Mismatch string kernels for sicriminative protein classification. bioinformatics, 20(4):467–476, 2004. [71] D. D. Lewis and W. A. Gale. A sequential algorithm for training text classifiers. In W. Bruce Croft and Cornelis J. van Rijsbergen, editors, Proceedings of 17th ACM International Conference on Research and Development in Information Retrieval (SIGIR), pages 3–12, Dublin, IE, 1994. Springer Verlag, Heidelberg, DE. [72] R. Liere. Active Learning with Committees: An approach to Efficient Learning in Text Categorization Using Linear Threshold Algorithms. PhD thesis, Oregon State University, 1999. [73] Charles Lin.

Active learning.

http://www.cs.umd.edu/class/spring2003/cmsc311/

Notes/Learn/active.html, 2003.

[74] J. Lindenstrauss and L. Tzafriri. Classical Banach Spaces, volume 2. Springer Verlag, 1979. [75] N. Linial, Y. Mansour, and N. Nissan. Constant-depth circuits, fourier transform and learnability. Jour. Assoc. Comput. Mach., 40:607–620, 1993. [76] N. Littlestone. Learning quickly when irrelevant attributes abound: A new linearthreshold algorithm. In 28th Annual Symposium on Foundations of Computer Science, pages 68–77, 1987.

BIBLIOGRAPHY [77] N. Littlestone. Mistake Bounds and Logarithmic Linear-threshold Learning Algorithms. PhD thesis, University of California Santa Cruz, 1989. [78] L. Lovasz and M. Simonovits. Random walks in a convex body and an improved volume algorithm. Random Structures and Algorithms, 4, Number 4:359–412, 1993. [79] L. Lov´asz and S. Vempala. Hit and run is fast and fun. Technical Report MSR-TR2003-05, Microsoft Research, 2003. [80] L. Lov´asz and S. Vempala. Hit-and-run from a corner. In Proc. of the 36th ACM Symposium on the Theory of Computing (STOC), 2004. [81] H. Mamitsuka and N. Abe. Efficient data mining by active learning. In S. Arikawa and A. Shinohara, editors, Progress in Discovery Science: Final Report of the Japanese Discovery Science Project. Springer-Verlag GmbH, 2002. [82] A.M. Martinez and R. Benavente. The ar face database. Technical report, CVC Tech. Rep. #24, 1998. [83] D. A. McAllester. Some pac-bayesian theorems. Proc. of the Eleventh Annual Conference on Computational Learning Theory, pages 230–234, 1998. [84] A. K. McCallum and K. Nigam. Employing em in pool-based active learning for text classification. In Jude W. Shavlik, editor, Proceedings of the 15th International Conference on Machine Learning (ICML), pages 350–358, Madison, US, 1998. Morgan Kaufmann Publishers, San Francisco, US. [85] W.S. McCulloch and W. Pitts. A logical calculus of ideas immanent in neural activity. Bulletin of Mathematical Biophysics, 5:115–133, 1943. [86] S. Mendelson. Learnability in hilbert spaces with reproducing kernels. Journal of Complexity, 18(1):152–170, 2002. [87] C. G. Mooney. Theories of Childhood: An introduction to Dewey, Montessori, Erikson, Piaget & Vygotsky. Redleaf Press, 2000. [88] N. Movshovitz-Hadar. Personal communication, 2006. [89] I. Muslea, S. Minton, and C. Knoblock. Active + semi-supervise learning = robust multi-view learning. In Proceedings of the 19th International Conference on Machine Learning (ICML), pages 435–442, 2002. [90] A. B. J. Novikoff. On convergence proofs on perceptrons. In Proceedings of the Symposium on the Mathematical Theory of Automata, volume 12, pages 615–622, 1962. [91] G. Pisier. Probabilistic methods in the geometry of banach spaces. In Probability and Analysis, number 1206 in Lecture Notes in Mathematics, pages 167–241. Springer Verlag, 1986. [92] L. Pitt and M. K. Warmuth. Prediction-preserving reducibility. Journal of Computer and System Sciences, 41:430–467, 1990.

149

150

BIBLIOGRAPHY

[93] A. Prekopa. Logarithmic concave measures with applications to stochastic programming. Acta Sci. Math. (Szeged), 32:301–315, 1971. [94] J. R. Quinlan. Induction of decision trees. Journal of Machine Learning, 1:81–106, 1986. [95] J. R. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann, 1993. [96] F. Rosenblatt. The perceptron: A probabilistic model for information storage and organization in the brain. Psychological Review, 65(6):386–408, 1958. [97] N. Roy and A. McCallum. Toward optimal active learning through sampling estimation of error reduction. In Proceedings of the 18th International Conference on Machine Learning (ICML), pages 441–448. Morgan Kaufmann, San Francisco, CA, 2001. [98] S. Russel. Stuart russell on the future of artificial intelligence. Ubiquity, 4(43), 2004. http://www.acm.org/ubiquity/interviews/v4i43_russell.html.

[99] S. Russel and P. Norvig. Artificial Intelligence: A Modern Approach. Prentice Hall, 2nd edition edition, 2002. [100] N. Sauer. On densities of families of sets. Journal of Combinatorics Theory, 13:145– 147, 1972. [101] R. E. Schapire, Y. Freund, P. Bartlett, and W. S. Lee. Boosting the margin : A new explanation for the effectiveness of voting methods. Annals of Statistics, 1998. [102] G. Schohn and D. Cohn. Less is more: Active learning with support vector machines. In Proceedings of the 17th International Conference on Machine Learning (ICML), pages 839–846. Morgan Kaufmann, San Francisco, CA, 2000. [103] B. Sch¨ olkopf, A. J. Smola, and K. R. M¨ uller. kernel prinicpal component analysis. In B. Sch¨ olkopf, C. J. C. Burges, and A. J. Smola, editors, Advances in Kernel Methods– Support Vector Learning, pages 327–352. MIT press, 1999. [104] H. S. Seung, M. Opper, and H. Sompolinsky. Query by committee. Proc. of the Fith Workshop on Computational Learning Theory, pages 287–294, 1992. [105] C. E. Shannon. A mathematical theory of communication. Bell System Technical Journal, 27, July and October 1948. [106] J. Shawe-Tylor and N. Cristianini. Further results on the margin distribution. In Proceedings of the 12th Annual Conference on Learning Theory (COLT), pages 278– 285, 1999. [107] J. Shawe-Tylor and N. Cristianini. Kernel Methods for Pattern Analysis. Cambridge University Press, 2004. [108] L. Shpigelman, Y. Singer, R. Paz, and E. Vaadia. Spikernels: Predicting arm movements by embedding population spike rate patterns in inner-product spaces. Neural Computation, 17(3):671–690, March 2005.

BIBLIOGRAPHY [109] D. G. Singer and T. A. Revenson. A Piaget primer: how a child thinks. The Penguin Group, revised edition 1996. [110] P. Sollich and D. Saad. Learning from queries for maximum information gain in imperfectly learnable problems. Advances in Neural Information Systems, 7:287–294, 1995. [111] C. J. Stone. Consistent nonparametric regression. The Annals of Statistics, 5(4):595– 620, 1977. [112] S. Tong and D. Koller. Active learning for structure in bayesian networks. In International Joint Conference on Artificial Intelligence, 2001. [113] S. Tong and D. Koller. Support vector machine active learning with applications to text classification. Journal of Machine Learning Reseach (JMLR), 2:45–66, Nov 2001. [114] L. Troyansky. Faithful Representations and Moments of Satisfaction: Probabilistic Methods in Learning and Logic. PhD thesis, Hebrew University, Jerusalem, 1998. [115] G. Tur, R. E. Schapire, and D. Hakkani-T¨ ur. Active learning for spoken language understanding. In IEEE international conference on coustic, Speech and Signal Processing, 2003. [116] L. G. Valiant. A theory of the learnable. Communications of the ACM, 27(11):1134– 1142, 1984. [117] V. Vapnik. Statistical Learning Theory. Wiley, 1998. [118] V. Vapnik and A. Y. Chervonenkis. On the uniform covergence of relative frequencies of events to their probabilities. Theory of Probability and its Applications, 16(2):264–280, 1971. [119] C. Zhang and T. Chen. An active learning framework for content-based information retrieval. IEEE Transactions on Multimedia, 4(2):260–268, 2002.

151

152

BIBLIOGRAPHY

dl xarne dgpen d inl

diteqelitl xehwe x`ez zlaw myl xeaig z`n jxka- rlb ox

e'qyz zpya milyexia zixard dhiqxaipe`d hpql ybed

II

III iayiz ilztp xeqtext ly eziigpda dzyrp ef d ear

IV

V xivwz

era .zyweand dniynl mnvr z` mi`zdl mdly zlekid `ed

mi nel minzixebl`

zlekid z` miykex mi nel minzixebl` ,d igi dniyn rval i k mixtzp

ly le bd oexzid

“miizxeqn”

re r in xefgi`e azk iedifn lgd mipey minegza dglvda elrted ze nel zehiy

minzixebl`

.zeniyn rval

.miteif xezi`e i`etx oegai` zipaza .ze`nbe n d inl ly zipazd `id odipan dvetpd la` ziaeyig d inla zeipaz xtqn opyi e iwtz .j i`n zeitvzdn zg` lk xear (bz) zyx pd dlertde gn zeitvz zveaw inlzl zbven ef dpenz dpid zitvz ,azk iedif ly dniynd xear ,`nbe l .mibzl zeitvzn ietin xvil `ed inlzd ly ldw dl dzpw inlzde dxend xear ze`nbe n d inl ly zehytd .ze`d my `ed bzde daezk ze` ly .agx mi de` ly le b sqe` bivdl yx p dxend .miax mixwna miyi `ll eze` jted enild jildz ly jxe`d .i`etx oegai` rval aygn nll mipiipern ep`y `nbe l gipp .dze`p dnxl ribn inlzdy iptl ze`nbe jildz `id el` mipezp zxivi .i`etxd oegai`d `ed bze dleg ly i`etx wiz dpid zitvz ,df dxwna .irevwn m ` gk yxe d xwie jex` dhily . inlzl enild jildz lr znieqn dhily ozn i i-lr df jildz xvwl dqpn

dlirt d inl

zg` jx .ykxy r ia oeghia xqg yg inlzd mda mixef`d l` dxend z` oeekl inlzl zxyt`n ef dxendn ywale zeitvz xviil leki inlzd ef dhiya .zekiiy

zezli`ya

yeniy `id dlirt d inl reval

meqg wner ilra milbrn z inl oebk zeax zeira gezipa daexn dglvdl dzkf ef dhiy .oze` biizl .[67] ahid z aer dppi` ef dhiy ,dxend `ed yep`-oa xy`k ,la` . ere

[65]

dhlgd ivr enil ,[75]

zexqg od ik dxenl zexexa opi` dxenl inlzd dptn oze` zezli`ydn xkip wlgy oeeikn z`f lk mzixebl`a ynzydl eqip me`ae bp`l xy`k ,`nbe l .zeizin` zeitvzn dtevnd iniptd oeibidd z` mzixebl`d xviy zezli`ydn daxdy elib md ,zekiiy zezli`y i i-lr miaezk mieez iedifl egzity .mze` biizl did leki `l iyep`d dxend okle mieez iwlg eligd `l zeitvz zveaw zbven inlzl ,df dxwna

.oepiqd zhiy `id dlirt d inll zxg` zxbqn

dhiy ixeg`n oeibidd .mibz wtql dxend yx ii oxear ,zeitvzd oze` z` xegal leki inlzd .zebiezn miyw md mibzd era , e`n zepinf od zebiezn `l zeitvz ,zeax enil zeniynay d aerd `ed ef miknqn ly sqe` `ed biezn `ld r ind .miknqn beeiq zniyna mipiipern ep`y gipp .mixwie dbydl dgnen yx p jnqnl beeiq zzl zpn lr la` ;ziyep` zeaxrzd `ll hpxhpi`d zyxn ixedl ozipy .“mixwi” md mibzd era

“lef”

`ed biezn `l r in ,okl .jnqnd z` `xwiy

.'eke ,[113] miknqn beeiq ,[33] xai iwlg beiz oebk zeniyn oeebna dglvda dlrted oepiqd zhiy ,[46] eitzeye piext ,zizxe`zd hand z ewpn .daexn dglvd lr migee n miazekd ,el`d mixwnd lka zikixrn dphw df mzixebl` ly d`ibydy egiked md .[104]

zizveawd `zli`yd

mzixebl` z` egzp

VI mzixebl`y oeeikn miyi eppi` df mzixebl` la` .miliaq enil inzixebl` ly d`ibydn xzei xdn .zixwn dxryd lixbdl zlekid z` yxe zizveawd `zli`yd ef d ear ly dxhnd .yenin zexa opi`y e` izxe`z qeqia zexqg dlirt d inll zehiyd ziaxn `zli`d mzixebl`a fkxzn dt zbvend d eardn akp wlg .dyrnl dixe`z oia df xrt xebql `id ikixrn enil avw el yiy mi`xne df mzixebl` ly zizxe`zd dpadd z` miaigxn ep` .zizveawd mi ixtn enill mzixebl`d ly liri yenin mibivn ep` .xara mire i eid xy`n xzei migep mi`pza miteb ly dnib d ziiral ietinl zpzip dnib d ziiray dpgadd lr zqqazn eply dipad .miix`pil .[39,

78, ...]

mixweg xtqn i i-lr exwgp el` zeira .el` miteb ly gtpd aeyig e` mixenw

-zveaw `zli`y zlrtd ly ziaeyigd zeikeaiqdy mi`xn ep` ,xzei s` ize`ivnl df z` jetdl i k mipirxba yeniyd z` mb zxyt`n mbe ,dvixd ipnf z` zxvwn ef d aer . hlwd nina dielz dpi` zi .zepey zeira enila df mzixebl` ly dglvd lr migee n ep` . enill mixfgyn dxendy migipn ep` oi` df dxwna .yrx zegkepa dlirt d inla yeniy ly diral mb miqgiizn ep` zyx pd zexizid oia of`l inlzd lr df dxwna .eil` zpteny dl`y lkl dpekpd daeyzd z` wtqn zekiiyd zezli`y zxbqna od ef dira mixweg ep` . enild jildz z` xvwl oevxde yrx lr xabzdl i k .oepiqd zxbqna ode lreta yeniyd oiae dlirt d inl ly izxe`zd xwgnd oia xrtd zphwda mixfer eply mi`vnnd d eary mieewne zelert ly agx oeebn revial daeyg dlirt d inl ik mipin`n ep` .df beqn d inla .df oeeika sqep ja p `id ef z` zegtl minyiine migiken ep` ef d eara .“ nln o twd `le , nl oyiad `l" xn`p zea` iwxta .ef dxni` ly oey`xd dwlg

`ean xy`e r in mi arn ep` ea jildz `id d inl

[68]

.

zelekide r id zxabd e` iepiyl liaen

x`yl eppia dli an xy` z`fk ,m `-ipa ly zehlead zepekzdn zg` `id ddeabd enild zleki d`aend .zeakxen zeira xeztle dpzyn daiaq mr

enzdl epl zxyt`n d inld .zeigd zklnn r in xinn xy` jildz `id d inl .d inl ly miipeigd miaikxndn dnk lr z ner df enr y`xa .eply zelekid z` xiyrn xy` jildz `id d inl okl .zelekiae r ia zxne` z`f ;miaygenn enil ipepbpn oepkz ly zepn`d `id (machine

learning)

ziaeyig d inl

milwzp ep` ea r in xindl milbeqnd miaygenn mikildz oepkz ly zepn`d `id ziaeyig d inly :zeira xeztl i k `ad jildza zynzyn ziaeyig d inl .zelekie r il

r in seq` .1

r ind ly izivnz bevii z`ivn .2

r ina zertez z`ivn .3

r il el` zertez zxnd .4

(minrtl) zelertl r id zxnd .5

,mepbd xwgn ,zaygenn di`x ,xeai iedif ,miknqn beeiq oebk minegz ly oeebna liri `vnp df jildz . ere

ze`nbe jezn d inl -la ,xzei wie n ote`ae ze`nbe jezn d inla mipiiprzn ep` .zyx p d inl mda mixwn ly oeebn mpyi dxenl . inlzde dxend ,o`k mitzzyn mipwgy ipy .(supervised

learning) ze`nbe

jezn zgpen d in

,`id deewzd .wgyn df era dxend ixg` awer inlzd okl .yekxl oiipern inlzdy mieqn r i yi

VII

VIII zexewn ipy yi ,okl .r il df r in xindl i k witqn mkg didie ,r in witqn seq`l gilvi inlzdy

“dnkeg"d

.a ,dxend ixg` aewrl yx p inlzdy onfd .` :df jildza zeikeaiql

complexity) mb n

zeikeaiq mi`xwp el` zeikeaiq inxeb ,ziaeyig d inl ze e` zextqa . in

-lzdn zyx pd (sample

.(computational

complexity)

aeyig zeikeaiqe

`nbe lk .ze`nbe l dyibd zpzip inlzl , enild jildz jldna .ze`nbe n d inla weqrp ep` agxnn gewly ef zitvz ly bzd `ed

y -e X

zeitvzd agxn jezn zitvz `id

ezxhn .(mibz) mihltl (zeitvz) mihlw dtnn xy`

c dxhn

x xy`k (x, y) nv

dpid

byen rwxa miiw ik migipn ep` .Y mibzd .df dxhn byen axwl `id inlzd ly

zeikeaiqe mb nd zeikeaiq zzgtdl zepey mikx a fkxzn ziaeyig d inla xwgndn xwip wlg mipiipern ep` .aeyigd zeikeaiql mb nd zeikeaiq oia oefi`a zwqer ef d ear ,mieqn oaena .aeyigd ,zexg` milina .aeyigd zeikeaiqa i n xzei li bdl ila mb nd zeikeaiq z` oihwdl ozip mda mikx a . inlzl dxendn d eard qnern wlg xiardl ozip mda mikx a mipiipern ep` dx bd e`x)

PAC

`nbe l ,zeizxeqnd enild zexbqnl xarn jlp ep` d inla df xetiy biydl i k

era . enild jildza lirt iwtz yi inlzly `id dlirt d inla dpeekd .dlirt d inl dyxpe ,(1 .zel`y zli`y i i-lr dxend z` zegpdl leki lirt inlz ,dxena dtev wx (passive) liaqd inlzd ,miax mixwnay mi`xn ep` .dliaq d inl ly dagxd xeza dlirt d inl mixweg ep` ef d eara i i-lr od el` zexbqn mixweg ep` .miliaq mi neln daxda zeaeh ze`vez mibiyn milirt mi nel miynzyn ep` ,dlirt d inl ly megza zextqd ziaxnl ebipa .ieqip i i-lr ode (analysis) gezip dixe`zd oia dxertd medzd lrn xyb mipea ep` jk i i-lr .ieqipa ode gezipa od minzixebl` mze`a .dlirt d inl ly meyiide megza `iwad `xew .ziaeyig d inla zeiqiqa ze`veze eqi zex bd mibivn ep` df wxt jynda dlaha mbven mipeniqd zivnz) .miynzyn ep` mday mipeniqd z` xikdl i k el` mit a lrlrl leki .(11 enra 1.1

zizek`ln dpiae ziaeyig d inl ,dkx a zg` lk ,zeqpn zizek`ln dpiae ziaeyig d inl .zizek`ln dpia ly spr `id ziaeyig d inl zkxrnk r i dxi bn

“zizxeqn''

zizek`ln dpia .zeakxen zeira xzet m `d day jx d z` zewgl

ziaeyig d inl .etvp `l oii ry mixwn lr zepwqn wiqdl ozip el` miweg zxfra .miibel miweg ly .ykxp r id eay jildzd ;d inld jildz lr ax yb dny ziaeyig d inl ,ziy`x :mipaen ipya dpey miwegd znerl miizexazqde miihqihhq ywid iwega llk-jx a zynzyn ziaeyig d inl ,zipy miibel miywid lr zqqeand zizek`ln dpia ik miqxebd yi .zizek`ln dpiaa milaewnd miibeld .[98] dy gd jx d z` zbviin ziaeyig d inl era dpyid jx d z` zbviin

IX dniyn rvazy dpekn zepal mipiipern ep`y gipp .d`ad `nbe a zehiyd oia l add lr enrl ozip dgnen mr xyw xevil yi dl`k zekxrn ziipal

“zibel''d

dhiyd itl .i`etx oegai` `nbe l ,znieqn

miwegd zkxrny zg` .`ixal dleg oia mili and millkd zkxrn z` ywale (op dxwna `tex) megzl .mihpiv`t oegai`l zynyne dpeknd jezl zwvep `id ,dtq`p dyw ,zipy .zixyt` izla `id miwegd zx bd mixwnd ziaxna ,ziy`x :zepexqg xtqn ef dhiyl oegai`l jilen xy` weg `vnp ji` miweg itl` mr zkxrna :miweg ly efk zkxrn zeierh zetple wfgzl e` daiaqa miiepiyl efk zkxrn mini`zn ji` ,seqal ?dlek zkxrna rebtl ila eze` owzp ji` ?ieby ?zetqep oegai` zeniynl alya .zxg` dhiya zhwep ziaeyig d inl .r id zyikx jildz lr mirityn lirl ebvedy miiywd -e` miihqihhq mipezp sqe`e ez eara dgnen ixg` awer inlzd ,d inld alya epid ,r id zyikx .mipegai`e zeifgz revial df r ia ynzydl ozip ,r i witqn ykx inlzd xy`k .r ina min`zn ze oexzi ef dhiyl . ere r in xefgi` ,i`etx oegai` oebk minegz oeebna ziyeniy ze`nbe jezn d inl dhiy lr zeqqeand zepekn wfgzl xzei lw .ez eara dgnen ixg` awrn wx yxe enild jildzy jka .“zibel”d dhiyd lr zeqqeand zekxrna xy`n xzei zkxrnd iaikxa xfeg yeniy zxyt`n `ide ef ziaeyigd d inld z` wxtl ozip .r id zyikx jildza zfkxzn ,dnyn fnxpy itk ,ziaeyig d inl `l d inl ,dvgnl dgpen d inl ,dgpen d inl ,`nbe l) r id zyikx jildz ly ite`d itl dpyn inegzl miwqer ep` ef d eara .(diqxbx ,beeiq ,deev` zeniyn `nbe l) mi nel dprnly dniynd itle (dgpen .dgpen d inla

ziaeyig d inl ly ixehqid xivwz d inl ly d`ln zixehqid dxiwq .mipey zeny zgz d`n ivgn xzei jyna zxwgp ziaeyig d inl .jynda ynzyp mday ze eqid z` mibivn ep` o`k .ef d ear ly megzl uegn `id ziaeyig ,gen ixweg ,miaygn iyp` ,mi`wihnzn :mipey minegzn mixweg dil` zkyen ziaeyig d inl :megza xwgnl miixwir miripn dyely mpyi . ere mibeleia

gend xwg .1

zeakxen zeira oexztl jx k d inl .2

hyten byenk

“d inl''

ze e` xwgn .3

zlekid .zyxa mdipa mixeyw el` mipexiep .mipexiepd ,zei eqi oiipa ipa`n iepa gendy e`vn gend ixweg oiipa i i lry oin`dl mixweg dripd miiepiyl envr z` mi`zdle zeakxen zeira xeztl eply gend ly oiadl epl xyt`i df ,sqepa .zeakxen zeira xeztl ji` enll milbeqn didp zeizek`ln mipexiep zezyx

X xzei xge`ne

[85] qhite jebwn

.20-d d`nd ly 40-d zepya lgd df xwgn ew .gend zlert z` xzei aeh

d`xyd `lne dxet ew ly mirxfd md el` . earl zeleki mipexiep zezyx oditl mikx erivd

[51]

ad

.mixtq ze`n eazkp eay xwgn ly (zeqtpiq) mihlw ly ax xtqn yi oexiepl .oexiepd `id mipexiepd zyx ly zixwird oiipad oa` .[96] ix`pild ixtnd e` oexhtqxtd `id oexiepd ly zizek`lnd `qxibd .(oeqw`) zg` hlt z igie hlwd z` onqp ep` mipeniqd lr lwdl i k . igi hlte mihlw ly ax xtqn yi oexhtqxtl ,oexiepl enk sq ziivwpet aygn oexhtqxtd `ed dze` divwpetd .θ .x-e

w

∈ IR

sqe

.dqtpiql liawn df xehwea aikx lky jk ,x

w ∈ IRd

∈ IRd

igi xehwek

zelewyn xehwe yi oexhtqxt lkl .mihlwd lr zix`pil

mixehwed oia ziniptd dltknd `id

w·x

xy`k

cw,θ (x) = sign ((w · x − θ))

`id aygn

wlg .ziaeyig d inla xzeia uetpd ilkd oii r `ed ,mipy 50 hrnk iptl x bed oexhtqxtdy zexnl sqd xy`k .“ix`pil ixtn'' oexhtqxtl `xwp ep` llk-jx a .oexhtqxtd xwgl y wen ef d earn akp .ipbened ix`pil ixtn dfk beeqnl `xwp ep`

cw (x) = sign (w · x)

`ed drxkdd llk epid ,qt`zn

θ

zehiy mb zeniiw .zeyw zeira oexztl ilk ode gend xwgl od zeynyn zeizek`ln mipexiep zezyx zeqqeand zehiyde miaexwd mipkyd illk od el`k zehiyl mitqep mibivp .el` zeira oexztl zexg` zebeeqn okle ,hlwd agxn lr wgxn z in ly dneiw z` zegipn zehiyd izy .[32,

44, 111]

oelg

beeiqd yegip ,miaexwd mipkyd zhiya . enild onfa etvpy ze`nbe l ozaxw it-lr zey g ze`nbe zehiya .mipiipern ep` da `nbe d ly miaexwd mipkyd

k

oia aex zravd i i-lr ozip dy g `nbe ly

ep` da `nbe l zeaexw ody oeni`d ze`nbe lk oia aex zravd i i-lr dyrp beeiqd yegip ,oelgd ozpida ,izehtniq`d oaena zeil`nihte` od epid ;zeiawr ody gkede egzep zehiyd izy .mipiipern .[111] mixhnxt ly dpekp dxiga

[116]

hp`ile .“d inl'' byend z` oiadl oeiqipd `ed ziaeyig d inla xwgnd ly xzei y g ripn

oeeikn .ihnzn byenk d inl xi bdl oey`xd oeiqpd `edy (PAC)

oekp jxra i` el aexw

l en z` xi bd

zex ben eid `ly zel`y l`y `ed ,zrvazn `id day jx dn ielz izla ote`a d inl xi bd hp`hiley .“? enll ozip dn'' xzei illk ote`a e`

“?lkd

enll ozip m`d'' ,“? enll ozip m`d'' oebk ok iptl ahid

mitirqa .meid dze` mixikn ep`y enk ziaeyigd d inld zligz z` zpnqn hp`il`e ly ez ear .megza miaeygd mi`vnnd z`e d inl ly zetqep zex bd ,PAC-d l en z` xewqp mi`ad

PAC)

(

oekp jxra i` el aexw

d inl ,ziy`x .PAC-d l en ly miaeygd miaikxnd ody zeaeyg zepgad xtqn bivd

[116]

hpil`e

ze`nbe ly iteq sqe` ,okl .iteq onf ixg` xak enild zepexzia oigadl ozipy oaena iteq jildz `id llba mxbp wei xqeg .llek oelyike inlzd ly wei i` oia oigad hp`il`e .d inll witqdl jixv mb nd m` lilw lykp enild jildz ,minieqn mixwna ,la` .d`ex inlzd eze`y iteqd mb nd

XI .ddeab oeghia znxa ,xiaq wei biydl ozip m` enll ozipy orh hpil`e .bviin eppi`

[116] `id

C -y

xn`p .mibzd agxn

X ×Y

lr

µ

Y = {±1}

X

idie .

lr zix`pia mibyen zwlgn

C

didze mb nd agxn

m

L : (X × Y) → C mzixebl`e m < ∞ miiw ,ǫ, δ > 0 Prm errorµ (L (S)) > ǫ + inf (errorµ (c)) < δ

d in lkly jk

1 dx bd

oekp jxra i` el aexw

lkl m`

X

idi

PAC

c∈C

S∼D

errorµ (c) = µ {(x, y) : c (x) 6= y} PAC

hrnk `idy dwlgna dxryd `evnl i k iteq mb na i m` zewlgnly e`xd

[118]

d inl

xy`k

d inl `id mibyen zwlgn

qiwppeax've wipte .dllkd z`iby ly oaena dwlgna xzeia daehd dxrydd

nin dl yi m` wxe m`

PAC d inl

`id

ly uezipd in wn z` m ew xi bdl epilr

C

mibyen zwlgn :zi egii zixhne`b dpekz yi

VC-d

PAC ze inl

1

nin z` xi bdl i ka . iteq (VC) qiwppeax'v wipte .C dwlgnd

`ed

ΠC (m) =

oeeikn .ze`nbe

m

max

x1 ,...,xm ∈X

C

ly

m-d

uezipd m wn .mibyen zwlgn

C

2 dx bd

`idz

|{(c (x1 ) , . . . , c (xm )) : c ∈ C}|

biizl dleki mibyend zwlgny zepeyd mikx d xtqn z`

en uezipd m wn :VC-d nin ly dx bdd ixeg`n oeibidd df .ΠC `ed

C

(m) ≤ 2m

f`

mibyen zwlgn mibyen zwlgn ly

|Y| = |{±1}| = 2-y VC-d

3 dx bd

nin

d = max {m : ΠC (m) = 2m }

m

.

lkl

ΠC (m) = 2m

m` iteq oi` `ed nind

z` egiked qiwppeax've wipte .PAC ze inld zewlgnd z` zwie n dxeva xi bn

VC-d

nin

:(xewndn dpey hrn geqip mibivn ep`) `ad drtydd ax htynd

[4.8 htyne 4.2 htyn ,6℄

minb n ly

S ∈ (X × Y)

m

mb n ozpiday mzixebl` `ed

L-e

gipp .

d

oeni`d z`iby z` zxrfnn xy`

`edy

VC

nin mr dwlgny

c = L (X ) ∈ X

1 htyn

C

idz

dxryd xifgn mibiiezn

|{(x, y) ∈ S : c (x) 6= y}| z` e`vny

[16]

'zeye xnela el` eid

.bved

PAC-d

l eny iptl xeyrn xzei mdly d`vezd z` ebivd qiwppeax've wipte

1

.df l en oial qiwppeax've wipte ly d`vezd oia xeaigd

XII

zppeekn d inl

:1 xei`

zppeekn d inl ly igi aaq ly dnixf miyxz

dxen

inlz

⇒ ⇐ ⇒ L (yt , yˆt )

xt yt

:miiwzn

yˆt

X ×Y

lr

µ

zexazqd z in lkle

δ>0

lkl f`

Pr [errorµ (L (S)) > ǫ] ≤ δ

S∼µm

er lk

2 2em 2 ǫ≥ d ln + ln m d δ PAC

.

xy`k

O∗

d m -e illkd dxwna

O∗

q d m

d inl dppi`

C

f` iteqpi` `ed

C

ly

VC-d

nin m`

`ed zetvl mileki ep` el enild avwy d`xn 1 htyn 2

.mireawa wx l"pd minqgd z` xtyl ozipy oiivl yi . mi nel ep` da dwlgnd jezn `vnp dxhnd byen .daxda zeaeh ze`vez mibiyn ep` dlirt d inla miynzyn xy`ky d`xp ep` 6 wxta

zppeekn d inl jildz `id d inly d aerd z` yib d oehqlhil .d inl ly ztqep dx bd `id

[77]

mixefy el` mialy ipy zppeekn d inla .dllkdd alye enild aly :mialy ipy yi

PAC-d l ena

,okn xg`l .xt -l

yˆt

zppeekn d inl .sivx

beiz rivn inlzd ,xt zitvz bivn dxend ,t aaqa .miaaqa zxew d inl .dfa df

.zilily-i` qtd ziivwpet `id

L (·, ·) xy`k L (yt , yˆt ) jxra qtd laeq nele yt

bzd z` bivn dxend

.zppeekn d inl ly g` aaq ly dnixf miyxz bivn 1 xei` ziaxna

.xzeia zelwd zegpda

P∞

t=1

L (yt , yˆt )

:ze`ad zegpddn zg` migipn mixwnd

ik ze`xdl minrtl ozip df dxwna .yt

∞ X t=1

z` xrfnl `ed df dxwna inlzd ly dxhnd

= c (xt )-y

jk

C

dwlgnd jeza

c

dxhn byen miiw .1

L (yt , yˆt ) ≤ M < ∞ .miinzix`bel minxeb migipfn ep`y

O ∗ (·)

2

XIII ze`nbe zx q lkl ze`ibyd xtqn lr hlgen mqg `ed ik d`iby mqg `xwp .c

∈C

M

byen lke

df dxwna

x1 , x2 , . . .

ytgp df dxwna .mixgzn ly znvnevn dwlgn len l` ogap neld j` laben eppi` dxhnd byen .2 miiwzn

∞ X t=1

(x1 , y1 ) , (x2 , y2 ) , . . .

L (yt , yˆt ) ≤ inf

c∈C

∞ X

dx q lkly jk

f (·)

divwpet

f (L (c (xt ) , yˆt ))

t=1

mzixebl` ,`nbe l .oexkif hrna miynzyne mixidn ,miheyt md opeeknd enild inzixebl`n daxd 3

O (d)

gi . zelert

zyxe zifgz lke oexkif i`z

O (d)

yxe i nin

d

agxna lrtend

[96]

oexhtqxtd

.aeyige oexkifa yeniyd z` dliabn dpi` zppeeknd d inld zx bd ,z`f mr

qtd `id zirahd qtdd ziivwpet ,−1 e` .zxg`

+1 md 1

mibzd oiiprzp ep` mda mixwnd aexay oeeikn

yˆt -e yt    0 if yt = yˆt

jxrd z` zlawne zedf

L0−1 (yt , yˆt ) = (1 − yt yˆt ) /2 =

  1 if yt 6= yˆt

-xebl` xy`ky ,`nbe l . enild mzixebl` ly ze`ibyd xtqn wei a `ed

kxt k2 ≤ R

m`

P∞

t=1

L0−1 (yt , yˆt ) ≤ R2 /θ2

f`

oexhtqxtd mzixebl`l ze`iby mqg edf .yt

xy`k zqt`zny

P∞

t=1

(x1 , y1 ) , (x2 , y2 ) , . . . dx q

(w · xt ) ≥ θ-e kwk2 = 1-y

0−1

L0−1 (yt , yˆt ) df dxwna

lr lrten oexhtqxtd mzi

jk

w ∈ IRd

miiwe ,t lkl .[90]-a gkedy

dlirt d inl -k enk mininz minzixebl` elit` ,ei jex` oeni` mb n ozpiday d`xn oehq ly miyxnd htynd oeni` mb n seqi` ,la` .[111] zeixyt`d xzeia zeaehd ze`vezd z` wtql mileki miaexwd mipkyd yxe mile b minb n eair ,zipy .xwie jex` jildz `ed mipezpd seqi` ,ziy`x .zeira izy xvei le b dllkdd ly zeikeaiqd ,minieqn mixwna .oeni`d jldna r ind z` arl yex y xexa .miax mia`yn -e

[19] Support Vector Machines

llek ,minzixebl`d ziaxna ,la` .oeni`d jildz jxe`a ielz eppi`

zeaiyg yi okl .oeni`d jildz jxe` ly divwpet `id dllkdd ly aeyigd zeikeaiq

[45] Adaboost

.oeni`d jildz xeviwl dax zex bdl xarn z`vl epnvrl dyxp m` xwip mevnvl ozip oeni`d mb n l eby drivn dlirt d inl . enild jildz lr znieqn dhily inlzl xyt`pe ,zppeekn d inl e`

PAC oebk d inl ly zeihx phqd

kernels) mixfgyn mipirxba miynzyn xy`k .(primal) iy`xd agxna bvein oexhtqxtdy migipn ep` 3 .[38] -a `evnl ozip mitqep mihxt .lilk zepzyn aeyigde oexkifd zeyix ,df dxwna .(dual) ipeipy beviia

ynzydl jixv (

XIV mi`xew ep` okl .d`xi inlzd oze` ze`nbe d z` xgea dxend ,dk r epx`izy enild zexbqna .oeni`d ze`nbe zxiga ly znieqn drtyd yi inlzl ,dlirt d inla .dliaq d inl el` zehiyl dax d ina ely r id z` exiyriy ze`nbe a fkxzdl inlzl zxyt`n enild jildz lr ef dhily . enild jildz z` evi`i okle .zikixrn zeidl dleki dv`ddy d`xp ep` . enild jildz z` dvi`n ok` dlirt d inl miax mixwna jixv `ed , enild jildz lr dhily yi inlzly oeeikn .mlyl epilry xign yi miax mixwna ,la`

enild jildz ly aeyigd zeikeaiq minieqn mixwna ,okl .mdn xezt liaq nely zehlgd lawl ote`a dphw enild ly mb nd zeikeaiq ,hra da .dlirt d inll dliaq d inln mixaer xy`k dl b yi .oeni`d alyl dllkdd alyne , inlzl dxendn d eard qner z` mixiarn ep` ,epid .izernyn milirt mi inlz okl ;dpekn `ed inlzd era m `-oa llk-jx a `ed dxendy oeeikn ax oeibid jka .aeyig gk xzei miyxe la` zhiya mip ep`

“miitk

z ear'' zegt miyxe

III wlgae zekiiy zezli`ya mip ep` II wlga :lirt enil zexbqn izya mip ep`

zezli`y zhiya . enild jildz lr inlzl zpzipy dhilyd beq `ed el` zehiy oia l add .oepiqd yx p dxende zeitvz xeza dxenl zebven zel`yd .zel`y dxend z` le`yl leki inlzd ,[3℄ zekiiyd

inlzl bven zebiezn `l zeitvz sqe` .xzei zeax zelabn yi

[28]

oepiqd zhiya .el` zeitvz biizl

:dpyn zwelgl zpzip ef dhiy .el`d zeitvzd jezn dveaw-zz ly mibzd z` ywal leki inlzde .[52] mibz zliri d inl z`xwpy zppeekn d inle deev` z inl dxend .dxenl dxryd bivdl leki inlzd [3℄ oeieeiy zeli`ya .dlirt d inll zetqep zehiy zeniiw miieqip oepkz `id ztqep dhiy .zlykp dxrydd da `nbe wtql e` ,daeh `id dxryddy xy`l leki aexw llk-jx a `id dniynd ,ef dhiya .mi`wihqihhq i i-lr zeax dxwgp xy` ([7] `nbe l e`x) aexa ,dlirt d inll axd xywd zexnl .revial miieqipd z` xegal leki inlzde ,ziynn divwpet ly ze`vezl m`zda dxigad z` ok rn eppi` `ede miieqipd z` y`xn xegal inlzd lr mixwnd .min ew miieqip

They proved that the error of this algorithm reduces exponentially faster than passive algorithms. However, this ..... In a sense, this work deals with the trade-off between sample and computational ...... from the Internet. However, labeling these ...

Download PDF

1MB Sizes 3 Downloads 228 Views

Report

To PAC and Beyond

Recommend Documents