Connections between automatizability and learnability

Viewer
Transcript

C ONNECTIONS BETWEEN AUTOMATIZABILITY AND LEARNABILITY

by

Daniel Ivan

A thesis submitted in conformity with the requirements for the degree of Master of Science Graduate Department of Computer Science University of Toronto

c 2005 by Daniel Ivan Copyright

Abstract Connections between automatizability and learnability Daniel Ivan Master of Science Graduate Department of Computer Science University of Toronto 2005 Proof systems are a method of proving unsatisfiability of CNF formulas, based on the syntactic form of the CNF, rather than the semantic one. Automatizability of a proof system implies an algorithm that finds a proof in that proof system in time polynomial in the size of the shortest proof. Given a set of values of instances from a concept class, we say that we PAC-learn the concept class if we can find a function that, with high probability, evaluates random instances from that concept class with small error. We describe algorithms for the PAC-learnability of decision trees, DNF formulas and small degree polynomials. We show a connection between automatizability and PAC-learnability of DPLL versus decision trees, resolution versus DNF formulas, and polynomial calculus versus small degree polynomials. In the end, we give a somewhat simpler proof of [AR01] that resolution is not automatizable under complexity assumptions.

ii

Acknowledgements I would like to thank my supervisor Toniann Pitassi for her support and guidance while I worked on this thesis. Many thanks to my second reader, Alasdair Urquhart, for his useful suggestions and remarks that improved the thesis. Last, but not least, I would like to thank Charles Rackoff for the many fruitful discussions that we had.

iii

Contents

1

2

3

Introduction

1

1.1

Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1

1.2

Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2

1.3

Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

3

Proof Systems and Algorithms

5

2.1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

5

2.2

Resolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

6

2.3

Tree-like Resolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.4

DPLL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.5

Polynomial Calculus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.6

Proof System Hierarchy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.7

Proof System Automatizability . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.7.1

Resolution automatizability . . . . . . . . . . . . . . . . . . . . . . . 17

2.7.2

DPLL Automatizability . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.7.3

Polynomial Calculus Automatizability . . . . . . . . . . . . . . . . . . 24

Learning Models and Algorithms

26

3.1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

3.2

Probably Approximately Correct (PAC) learning model . . . . . . . . . . . . . 28

3.3

Occam learning model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 iv

3.4

Relationship between PAC and Occam . . . . . . . . . . . . . . . . . . . . . . 31

3.5

Learning decision trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

3.6

Learning DNF formulas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

3.7 4

3.6.1

Monotone disjunctions . . . . . . . . . . . . . . . . . . . . . . . . . . 42

3.6.2

Non-monotone disjunctions . . . . . . . . . . . . . . . . . . . . . . . 42

3.6.3

k-DNF formulas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

3.6.4

General DNF formulas . . . . . . . . . . . . . . . . . . . . . . . . . . 45

Learning degree d polynomials . . . . . . . . . . . . . . . . . . . . . . . . . . 54 ǫ

Resolution is not automatizable, unless SAT ⊆ DT IM E(2n )

57

4.1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

4.2

Fixed parameterized problems and hardness . . . . . . . . . . . . . . . . . . . 58

4.3

Reduction from minimum hitting set to automatizability of resolution . . . . . 59

5

On the automatizability of resolution refinements

72

6

Conclusion, Open Questions

75

Bibliography

77

v

List of Figures 2.1

The ordered DPLL algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.2

Resolution automatizability algorithm, [BSW01] . . . . . . . . . . . . . . . . 17

2.3

The Algorithm A′DPLL (F, n, n′ , z, τ ) . . . . . . . . . . . . . . . . . . . . . . . 20

2.4

The Algorithm ADPLL (F, n) for DPLL automatizability . . . . . . . . . . . . . 23

2.5

Groebner basis algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3.1

The Algorithm A′D.T. (S, n, n′ , z, τ ) . . . . . . . . . . . . . . . . . . . . . . . . 36

3.2

The Algorithm AD.T. (S, n) Occam learns the sample set S . . . . . . . . . . . 40

3.3

The Algorithm Anmd (S, n) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

3.4

The Algorithm AbDNF (S, k, n) . . . . . . . . . . . . . . . . . . . . . . . . . . 44

3.5

The Algorithm A′DNF (S, w, z, n, τ ) . . . . . . . . . . . . . . . . . . . . . . . . 48

3.6

The Algorithm ADNF (S, n) Occam learns the sample set S . . . . . . . . . . . 53

vi

Chapter 1 Introduction

1.1

Motivation

Proof systems are a powerful way of syntactically proving unsatisfiability of propositional formulas in conjunctive normal form (CNF). The result of a proof system applied to a CNF formula is a proof, which acts as a witness of unsatisfiability of the propositional formula. The more powerful the proof system, the shorter the refutation proof is. The existence of a propositional proof system that admits polynomial size proofs for all tautologies is equivalent to NP=co-NP. Some proof systems are so powerful that the gap between the size of the proof of some CNFs is polynomial in the size of the propositional formula, as opposed to the exponential time required to semantically prove the formula unsatisfiable. These are very interesting proof systems, especially if we are able to find an algorithm that given a propositional formula, automatically outputs a refutation proof of it, in “short” time. A proof system is automatizable if there exists an algorithm that given an unsatisfiable CNF formula, outputs a refutation proof in that proof system, in time polynomial in the size of the shortest refutation proof of the CNF in that proof system. Much attention has been given so far to finding which proof systems are automatizable and which are not, [KP95]. An automatizability algorithm for some proof system would give us a way to prove in polynomial time 1

C HAPTER 1. I NTRODUCTION

2

unsatisfiability of CNF formulas that have short proofs in that proof system. Intuitively, the more powerful a proof system is, the less likely it is to be automatizable. We know already that T C 0 -Frege, Frege and bounded-depth Frege proof systems are not automatizable, assuming widely believed cryptographic assumptions, [BDG+ 99], [BPR00]. The method used in showing that these proof systems are not automatizable was to prove that they do not have feasible interpolation (for an introduction to feasible interpolation, see [Kra97]). However, this method does not work for resolution because it has feasible interpolation, [Kra97], [Hak85]. Despite the considerable effort that has been dedicated to studying the automatizability of resolution, not much advance has been done so far. [BSW01] gives an upper bound for this problem, which is quasi-polynomial in the size of the shortest proof of the formula, and [AR01] shows that neither resolution nor tree-like resolution is automatizable, unless the class W[P] from the hierarchy of parameterized problems is fixed-parameter tractable by randomized algorithms with one-sided error. We would like however to have a super-polynomial lower bound for the automatizability of resolution, or better upper bounds, as for example quasi-polynomial in the number of variables. We are therefore, looking at different ways to tackle the problem of automatizability. PAClearnability is a learning model in which, given some examples of a concept, we are trying to find a concept from the same concept class and in reasonable amount of time, such that with high probability, its error on randomly chosen examples is small. In particular, we analyze PAC-learnability of decision trees and DNF formulas. We show that there is a connection between automatizability of DPLL and PAC-learning decision trees on one hand, and between automatizability of resolution and DNF formulas.

1.2

Overview

In the next section we give some definitions that will be used throughout the whole thesis. Chapter 2 defines the notions of proof system and automatizability of a proof system; it also

C HAPTER 1. I NTRODUCTION

3

presents the DPLL, resolution and polynomial calculus proof systems, and describes automatizability algorithm for them. Chapter 3 defines the probably approximately correct (PAC) learning model and algorithms for PAC-learning decision trees, DNF formulas and small deǫ

gree polynomials in this model. In Chapter 4 we prove that unless SAT ⊆ DT IM E(2n ), resolution is not automatizable.

1.3

Definitions

In the following x will generally denote a boolean variable, ranging over {0, 1}. We shall identify 1 with True and 0 with False. A literal over a variable x is either x (denoted also as x1 ) or x (denoted also as x0 ). A clause is a disjunction of literals. A CNF (conjunctive normal form) formula is a conjunction of clauses. A term is a conjunction of literals. A DNF (disjunctive normal form) formula is a disjunction of terms. Definition 1 A decision tree is a rooted binary tree in which every internal node is labeled with a variable, the edges leaving a node correspond to whether the variable is set to 0 or 1, and the leaves are labeled with either 0 or 1. Every path from the root to a leaf may be viewed as a partial assignment. For a decision tree T and v ∈ {0, 1}, we write the set of paths (partial assignments) that lead from the root to a leaf labeled v as Brv (T ). The size of a decision tree is the number of internal nodes of the associated binary tree. Definition 2 An n-partial assignment is a function τ : {x1 , . . . , xn } → {0, 1, ∗}. In the following we will rather write τi instead of τ (xi ), for 1 ≤ i ≤ n. Notation 1 Let τ be an n-partial assignment. We will denote with |τ | the size of the n-partial assignment: |τ | = n. We will denote with |τ |∗ the number of unset variables in τ : |τ |∗ = |i ∈ {1, . . . , n} such that τi = ∗}|

C HAPTER 1. I NTRODUCTION

4

Definition 3 Let 1 ≤ j ≤ 2n, and b ∈ {0, 1}. If τ is a partial assignment, then τ j,b is another partial assignment such that • τij,b = τi , for all 1 ≤ i ≤ n, i 6= j, and • τjj,b = b.

Chapter 2 Proof Systems and Algorithms 2.1

Introduction

Definition 4 A proof system for a language L is a polynomial time algorithm V such that for all x, x ∈ L if and only if there exists a string p such that V (x, p) accepts. Notice however that the definition above does not say how large the proof (the string p) can be. Hence the inputs (x, p) to the algorithm V can actually be exponential in the size of x if the proof length is exponential in the size of x. For a particular language L the complexity of a proof system V characterizes how large the shortest proof p of the fact that x ∈ L has to be, as a function of the size of x, for any x in the language. Definition 5 The complexity of proof system V is a function f : N → N defined by f (n) =

max

x∈L,|x|=n

min

p s.t. V (x,p) accepts

|p|

We say that V is polynomially-bounded if and only if its complexity is a polynomial function of n. However, the complexity of a particular proof system says nothing about how costly it is to find short proofs. This problem is analyzed in the section 2.7 which deals with the automatizability 5

C HAPTER 2. P ROOF S YSTEMS AND A LGORITHMS

6

of a proof system. In the following we are concerned with the problem of unsatisfiability of propositional formulas. In our case the language L contains all the unsatisfiable formulas. The proof system is an algorithm V , that given a formula F and a string p, verifies in polynomial time whether p is a proof of the fact that F is an unsatisfiable formula.

2.2

Resolution

Definition 6 [BSW01] Let F = {C1 , C2 , . . . , Cm } be a CNF formula over n variables. A resolution derivation π of a clause A from F is a sequence of clauses π = (D1 , D2 , . . . , DS ) such that the last clause is A and each line Di is either some initial clause Cj ∈ F or is derived from the previous lines using the following derivation rules: • resolution rule:

E∨ℓ

• weakening rule1 :

E∨F

F ∨¬ℓ

E E∨F

where ℓ is a literal over the variables {x1 , x2 , . . . , xn } and E, F are arbitrary clauses over the same set of variables. A resolution refutation is a resolution derivation of the empty clause ∅. The graph Gπ of a derivation π is a DAG with the clauses of the derivation as nodes, and for a derivation step edges are added from the assumption clauses to the consequence clauses. A derivation π is called tree-like if Gπ is a tree; we can make copies of the original clauses in F in order to make a proof tree-like. The size of the derivation π is the number of lines (clauses) in it, denoted Sπ . S(F)(ST (F)) is the minimal size of a (tree-like) refutation of F. The width of a clause C, denoted w(C), is defined to be the number of literals appearing in it. The width of a set (sequence) of clauses is the maximal width of a clause in the set (sequence), 1

The weakening rule is not essential, as even without it the resolution proof system is complete with respect to refutations, but we add it just for the sake of simplicity.

C HAPTER 2. P ROOF S YSTEMS AND A LGORITHMS

7

i.e. w(F) = maxC∈F {w(C)} (w(π) = maxC∈π {w(C)}, respectively for a sequence of clauses π). The width of deriving a clause A from the formula F, denoted w(F ⊢ A) is defined by minπ {w(π)}, where the minimum is taken over all derivations π of A from F. We also use the notation F ⊢w A to mean that A can be derived from F in width w. We will be mainly interested in the width of resolution refutations, namely in w(F ⊢ ∅). The width and size of a resolution refutation are not independent, as proven by [BSW01]. Before showing this rigorously (Theorem 5), we prove a few lemmas. F⌈x←b is created from F by disposing of all initial clauses that include the literal xb and removing the literal xb from all other initial clauses where it appears. Lemma 1 [BSW01] Let A be a clause. Let F be a CNF, and x be a variable in F. For b ∈ {0, 1}, if F⌈x←b ⊢w−1 A, then F ⊢w A ∨ xb . Proof. Let F ′ be the set of initial clauses containing the literal xb , and let π be a width w − 1 resolution derivation of A from F⌈x←b . Add the literal xb to all clauses in π and call the new derivation π ′ . Then π ′ is a legal resolution derivation. If C ∈ F ′ , then C ∨ xb is an initial clause of F. If C ∈ F \ F ′ , then C ∨ xb can be derived from C by a single weakening step. Finally, if C was derived from E, F via a resolution step, then C ∨ xb is the resolution consequence of E ∨ xb , F ∨ xb . The width of each clause in π ′ is larger by 1 than the matching clause in π, hence the lemma follows. ✷ Lemma 2 [BSW01] For b ∈ {0, 1}, define Fxb as the set of all clauses in F containing the literal xb . If F⌈x←b ⊢k−1 ∅ and F⌈x←b ⊢k ∅, then w (F ⊢ ∅) ≤ max (k, w (Fxb )). Moreover, if the refutations of F⌈x←b and F⌈x←b are tree-like, then the refutation of F will be tree-like. Proof. According to Lemma 1, if F⌈x←b ⊢k−1 ∅ then F ⊢k xb . We now resolve the clause xb with all clauses in Fxb and derive F⌈x←b . This part will have width w (Fxb ). Finally, by the assumption we can refute F⌈x←b in width k.

C HAPTER 2. P ROOF S YSTEMS AND A LGORITHMS

8

For the tree-like refutations, derive xb from F as many times as clauses in Fxb are, and resolve each xb with one clause in Fxb . ✷ Definition 7 Let A be a clause. Let π be a (tree-like) resolution derivation, let x be a variable in the derivation, and b ∈ {0, 1}. Then π⌈x←b represents a subsequence of π obtained by disposing from π all the clauses that include the literal xb , and removing the literal xb from all the other clauses where it appears. Lemma 3 Let π be a (tree-like) resolution derivation F ⊢ A, let ℓ be a literal of a variable in F, and b ∈ {0, 1}. Then π⌈ℓ←b is a (tree-like) resolution derivation of F⌈ℓ←b ⊢ A⌈ℓ←b . Proof. Let

E E∨F

be a weakening step in π. Then obviously, for any literal ℓ and any b ∈ {0, 1}, E⌈ℓ←b E⌈ℓ←b ∨F ⌈ℓ←b

is a legal weakening step in π⌈ℓ←b . Let E ∨ ℓ′

F ∨ ¬ℓ′ E∨F

be a resolution step in π. We show that for any literal ℓ and any b ∈ {0, 1}, (E ∨ ℓ′ )⌈ℓ←b (F ∨ ¬ℓ′ )⌈ℓ←b (E ∨ F )⌈ℓ←b is a legal derivation rule in π⌈ℓ←b . ′

• If ℓ′ = ℓb , for some b′ ∈ {0, 1}, then this is a resolution step over the literal ℓ in π, and in π⌈ℓ←b this is ′ ′ E ∨ ℓb ⌈ℓ←b F ∨ ¬ℓb ⌈ℓ←b . (E ∨ F ) ⌈ℓ←b

If b = b′ then this will be transformed into the weakening step will be transformed into the weakening step

E . E∨F

F , E∨F

and if b = b′ then it

Both are legal derivations in π⌈ℓ←b .

The next cases assume that ℓ and ℓ′ are literals over different variables.

C HAPTER 2. P ROOF S YSTEMS AND A LGORITHMS

9

• If E ∨ F contains the literal ℓb . Then this clause will be removed from π⌈ℓ←b . • If E ∨ F contains the literal ℓb , then at least one of E and F contain ℓb , and none of them contains the literal ℓb . The derivation is therefore2 F ′ ∨ ℓb ∨ ¬ℓ′ E ′ ∨ ℓb ∨ ℓ′ . E ′ ∨ F ′ ∨ ℓb By removing the literal ℓb from all the three clauses we obtain E ′ ∨ ℓ′ F ′ ∨ ¬ℓ′ , E′ ∨ F ′ which is a legal resolution step. Note that nothing changes if F does not contain the literal ℓb , i.e., F = F ′ ∨ ¬ℓ′ . ✷ Lemma 4 Let w be a non-negative integer. Call a clause large if it contains at least w literals. Let F be a CNF formula with at most n variables, that has a resolution refutation with at most w −k > z. Then F has a z large clauses. Let k be a non-negative integer such that 1 − 2n resolution refutation of width at most k + max (w, w(F)).

Proof. By induction on n. Base Case. If n = 0, then F has no variable, but it has a resolution refutation. Therefore, F contains the empty clause, and the proof of the lemma follows as well. Induction Case. Let n ≥ 1. We claim that there exists a literal that appears in at least a fraction w 2n

of the large clauses. This follows essentially from Corollary 34, where we replace DNF

with CNF and term with clause. Setting ℓ to 1 will satisfy all the clauses ℓ appears in, i.e., this w restriction satisfies at least z 2n large clauses.

Setting ℓ to 1 in the resolution refutation of F, by Lemma 3, we obtain a resolution refutation of F⌈ℓ←1 that has at most n−1 variables. Let z ′ be the number of large clauses of the refutation 2

assume w.l.o.g. that both E and F contain the literal ℓb

C HAPTER 2. P ROOF S YSTEMS AND A LGORITHMS of F⌈ℓ←1 . Then, since

w 2n

≤

w 2(n−1)

10

, we have that

−(k−1) w −(k−1) w w , < 1− ≤ 1− z ≤z 1− 2n 2n 2(n − 1) ′

(2.1)

and by the induction hypothesis, it follows that:

i) F⌈ℓ←1 has a resolution refutation of width at most k − 1 + max (w, w (F)). Setting ℓ to 0 in the resolution refutation of F, by Lemma 3, we obtain a refutation of F⌈ℓ←0 . Since the clauses of the resolution refutation of F⌈ℓ←0 are at most as wide as the clauses of the resolution refutation of F, we have that ii) F⌈ℓ←0 has a resolution refutation of width at most k + max (w, w (F)). From i) and ii), by Lemma 2, it follows that there is a resolution refutation of F of width at most max k + max (w, w (F)) , w (F) = k + max (w, w (F)) ,

(2.2)

and the proof of the lemma follows.

✷ Theorem 5 [BSW01] Every resolution refutation of F of size S can be converted to one of width at most O

p

n log S + w (F) .

Proof. Let n be the number of variables of F. By Lemma 4, for any non-negative integer w, w −k > z, then F has a if z is the number of large clauses, and there exists a k such that 1 − 2n

resolution refutation of width at most k + max (w, w (F)). Since z is at most equal to the size w −k of the resolution refutation of F, we determine w and k such that 1 − 2n > S. w k 1 1− < 2n S 2n w − ·k − ( ) w 1 w 2n 1− < 2n S w 1 exp −k ≈ 2n S

kw ≈ 2n ln S.

(2.3) (2.4) (2.5) (2.6)

C HAPTER 2. P ROOF S YSTEMS AND A LGORITHMS

11

We want to minimize w + k. If kw ≈ 2n ln S then k + w is minimum for k ≈ w. Hence, set √ √ w = 2n ln S and k = O n log S . The width of the new resolution refutation of F is at

most

O

p

n log S + w (F) .

(2.7) ✷

Corollary 6 Any resolution refutation of F requires size at least !! 2 [w (F ⊢ ∅) − w (F)] . exp Ω n

(2.8)

Proof. Immediate from the previous theorem. ✷

2.3

Tree-like Resolution

Theorem 7 [BSW01] Every tree-like resolution refutation of a CNF F of size at most S can be converted to another tree-like resolution refutation of width at most ⌈log2 S⌉ + w (F) . Proof. By induction on the size of the proof. If S = 0, then ∅ ∈ F, and the theorem follows. Assume S ≥ 1. Consider a tree-like resolution refutation of size S of a set of clauses F, and let x be the last variable resolved on to derive ∅. One of the subtrees has size at most

S 2

and the

other has size less than S. Let b ∈ {0, 1} be such that the tree-like resolution derivation F ⊢ xb has size at most S2 , and the tree-like resolution derivation F ⊢ xb has size less than S. Setting x to b in F ⊢ xb , by Lemma 3, we obtain a tree-like resolution refutation of F⌈x←b that has size at most S2 . By the induction hypothesis it follows that i) the tree-like resolution refutation of F⌈x←b can be converted into another tree-like reso lution refutation of width at most log2 S2 + w (F) = ⌈log2 S⌉ − 1 + w (F).

C HAPTER 2. P ROOF S YSTEMS AND A LGORITHMS

12

Setting x to b in F ⊢ xb , by Lemma 3, we obtain another tree-like resolution refutation of F⌈x←b that has size less than S. By the induction hypothesis it follows that ii) the tree-like resolution refutation of F⌈x←b can be converted in another tree-like resolution refutation of width at most ⌈log2 S⌉ + w (F). From i) and ii), by Lemma 2, it follows that there is a resolution refutation of F of width at most max ⌈log2 S⌉ + w (F) , w (F) = ⌈log2 S⌉ + w (F) ,

(2.9)

and the proof of the lemma follows.

✷ Corollary 8 Any tree-like resolution refutation of F requires size at least 2Ω(w(F ⊢∅)−w(F )) Proof. Immediate from the previous theorem.

2.4

✷

DPLL

DPLL is a proof system for proving unsatisfiability of CNF (conjunctive normal form) propositional formulas which is based on a system devised by Davis, Logemann and Loveland [DLL62]. The proof system uses backtrack search to look for a satisfying assignment. Given a CNF propositional formula F, the DPLL algorithm can be described recursively as follows. First check whether F is trivially satisfiable (has no clauses) or is trivially unsatisfiable (contains an empty clause) and if so stop. Otherwise, select a literal ℓ and apply the algorithm recursively to search for a satisfying assignment for the formula F⌈ℓ←0 . If the search succeeds, then we have an assignment for F. Otherwise, repeat the algorithm with the formula F⌈ℓ←1 . If neither of these searches finds a satisfying assignment then F is not satisfiable. A particular DPLL algorithm is specified by a splitting rule, which is a subroutine that for each recursively constructed formula, determines the splitter (next literal to split on) and the

C HAPTER 2. P ROOF S YSTEMS AND A LGORITHMS

13

assignment to try first. In general, the splitting rule may depend on the details of the structure of the original formula, and on the results of the computation in other recursive calls. For a particular formula F, different splitting rules may result in vastly different running times. If the splitter for some given formula F is a literal ℓ such that ℓ is contained in a unit clause (a clause C of size one), then the ℓ = 0 branch falsifies C, and thus terminates immediately. Effectively, the algorithm fixes ℓ = 1. A splitting rule is said to use unit propagation if, for any formula F that has a unit clause (clause of size one) the splitter is chosen to be a literal in such a clause. The simplest such splitting rule is: fix an ordering of the variables x1 , . . . , xn . For a sub-formula F ′ obtained by fixing some variables, if there is a unit clause, the splitter is the first literal belonging to such a clause. Otherwise select the first unfixed variable. The algorithm obtained from this splitting rule is called ordered DPLL, and it is presented in the Figure 2.1, for a CNF propositional formula F. 1

2

Refute(F) while(F contains a clause of size 1)

3

set variable to make that clause true

4

simplify all clauses using this assignment

5

6

if F has no clause then output “F is satisfiable” and HALT

7

if F does not contain an empty clause then

8

choose smallest-numbered unset literal ℓ

9

run Refute(F⌈ℓ←0 )

10

run Refute(F⌈ℓ←1 )

Figure 2.1: The ordered DPLL algorithm

The execution of a DPLL algorithm A on a formula F can be represented by a labeled rooted

C HAPTER 2. P ROOF S YSTEMS AND A LGORITHMS

14

binary tree, denoted TA (F ), in the following way. Each node corresponds to a recursive call. Each internal node is labeled by its splitter and the two out-edges correspond to the possible assignments. For any node, the path from the root to that node defines a restriction (partial assignment) of the variables, and the recursive call at that node is applied to the sub-formula obtained by applying the restriction to the original formula. Each leaf is either a success leaf, i.e., all the original clauses are satisfied by the associated restriction, or a failure leaf, i.e., at least one original clause is falsified by the restriction. Each failure leaf is labeled by one of the original clauses that it falsifies. F is unsatisfiable if and only if all leaves are failure leaves, in which case the tree as labeled above (with internal nodes labeled by splitters and leaves labeled by falsified clauses) is called a DPLL-refutation of F. It is easy to see that a formula F has a DPLL refutation if and only if it is unsatisfiable. Thus DPLL-refutations form a complete and sound proof system. The size of the refutation is defined to be the number of nodes of the tree.

2.5

Polynomial Calculus

The polynomial calculus is a refutation proof system (also known as the Groebner proof system) that works on a system of polynomial equations over a field. In order to use the polynomial calculus, we must first translate CNF formulas into a system of polynomials. Define R(x) = x and R(¬x) = 1 − x. A clause (ℓ1 ∨ ℓ2 ∨ . . . ∨ ℓk ) is translated into the polynomial Pk i=1 R (ℓi ) = 0. The system of polynomials E(f ) corresponding to a CNF formula, f , is the set of equations we obtain by translating each clause in f .

Definition 8 (Polynomial Calculus proof system) Given a set of variables x1 , . . . , xn and a field F [x1 , . . . , xn ], a polynomial calculus refutation of the set of polynomial equations P is a sequence of polynomials such that the last line is the polynomial 1 and each line is either an initial polynomial equation or is derived from the previous lines using the following inference

C HAPTER 2. P ROOF S YSTEMS AND A LGORITHMS

15

rules: f g αf + βg and f x·f where α, β ∈ F are any scalars, x is any variable, and f, g ∈ F [x1 , . . . , xn ] are polynomial equations over F . The refutation has degree d if all the polynomials in it have degree at most d. In order to get 0/1 solutions, we also add the equations x2 = x to the set of axioms P , for each variable x. The effect of the axioms x2 = x is, intuitively, to allow each line in the proof to be multi-linear, i.e., if a variable appears in any term with an exponent greater than 1, we replace the exponent by 1 and combine like terms. Define m(p) to be the multi-linear version of polynomial p. In the representation of polynomials, we require that they be represented explicitly, i.e., that zero coefficients be listed as coefficients, so a polynomial of degree d is represented as the vector P n of its coefficients, whose elements are indexed by the i=d i=0 i multi-linear terms of degree

i ≤ d, i.e., since the exponent of each variable in each term is either 0 or 1, a term of degree at most d is determined by a subset of variables with at most d members. We write p1 , . . . , pk ⊢ q

to mean that there is a polynomial calculus proof of polynomial q from p1 , . . . , pk where every line has degree at most d.

2.6

Proof System Hierarchy

Definition 9 A propositional proof system U p-simulates V if there is a polynomial-time function f such that ∀p ∀x V accepts (x, p) if and only if U accepts (x, f (p)). Definition 10 Two propositional proof systems U and V are polynomially equivalent (or pequivalent) if U p-simulates V and V p-simulates U .

C HAPTER 2. P ROOF S YSTEMS AND A LGORITHMS

16

Theorem 9 DPLL and tree-like resolution are polynomially equivalent. Proof. To prove this theorem, we need to show how to efficiently convert a DPLL proof for a given unsatisfiable formula F into a tree-like resolution refutation of F, and conversely, how to convert a tree-like resolution refutation of F into a DPLL proof of F. To construct a tree-like resolution refutation from a DPLL tree, arrange the clauses to be resolved in the same order as the leaves of the DPLL tree. Then resolve with respect to the clauses and variables which the DPLL tree branched on last. This will construct a tree-like resolution refutation with the same topology as the original DPLL tree. To show that the (shortest) size of a DPLL proof is polynomial in the (shortest) size of a tree-like resolution refutation, we first point out that, by Lemma 5.1 of [Urq95], a tree like resolution refutation of minimal size is regular. We construct a DPLL tree from a tree-like resolution refutation of minimal size (which is regular) as follows. Label the branches of the resolution tree with the truth values which falsify the clauses at the leaves of the branches such that if some variable x was resolved at that level of the resolution tree then the branch leading to sub-trees whose leaves contain x should be labeled x = 1 and the branch leading to sub-trees whose leaves contain x should be labeled x = 0. This will construct a DPLL tree with the same topology as the original tree-like resolution refutation. Thus the DPLL proof system can be viewed naturally as a restricted version of resolution (namely the tree-like resolution). ✷

2.7

Proof System Automatizability

Definition 11 (automatizability/quasi-automatizability [BPR00]) We say that a proof system P is automatizable if there exists a deterministic procedure D that takes as input a formula F and returns an P -refutation of F (if one exists) in time polynomial in the size of the shortest P -refutation of F.

C HAPTER 2. P ROOF S YSTEMS AND A LGORITHMS

17

If the algorithm D runs in time quasi-polynomial in the size of the shortest P -refutation of F, then we say that the proof system P is quasi-automatizable. In the following we are concerned with the automatizability results for DPLL, resolution and polynomial calculus.

2.7.1 Resolution automatizability Theorem 10 [BSW01] Resolution is automatizable in time nO(

√

n log S )+w(F )

, where S is the

size of the shortest resolution refutation. Proof. Let F be an unsatisfiable set of clauses over n variables x1 , . . . , xn , and let S be the size of the shortest resolution refutation proof of F. Let A be the algorithm from Figure 2.2 for automatizability of resolution. The algorithm iteratively derives all the clauses with width less than a parameter w, for increasing integer values of w. The runtime of A is at most O nw(F ⊢∅) . By Theorem 5, the shortest refutation √ proof of F has width at most O n log S + w(F). Hence, the runtime of the algorithm is

bounded by nO(

1

Fix w = 0

2

Repeat {

√

n log S )+w(F )

3

If ∅ ∈ F end

4

Else {

.

5

Increment w

6

Derive all resolution consequences of width ≤ w

7

}

8

} Figure 2.2: Resolution automatizability algorithm, [BSW01]

✷

C HAPTER 2. P ROOF S YSTEMS AND A LGORITHMS

18

2.7.2 DPLL Automatizability Theorem 11 ([CEI96], [BSW01]) DPLL is automatizable in time nO(log s) · O (|F|), where F is the initial DNF formula and s is the size of the shortest DPLL refutation proof of F.

Proof.

Let F be an unsatisfiable CNF formula over at most n variables x1 , . . . , xn , and let

s be the size of the shortest DPLL refutation of F. We show that there exists an algorithm for automatically refuting unsatisfiable formulas in the DPLL proof system, that runs in time nO(log s) . First, consider the Algorithm A′DPLL (F, n, n′ , z, τ ), presented in the Figure 2.3. τ is a partial assignment of variables of the CNF formula F, and n′ is the number of unset variables from τ . The algorithm searches for a decision tree of size at most z and over at most the n′ unset variables from τ , that is a DPLL refutation of F⌈τ . If no such decision tree is found, the algorithm returns Λ. This algorithm is recursive in n′ and z. For the initial case assume n′ = 0 or z < 1. If there exists a clause C from F such that C⌈τ = ∅, then return the clause C. Otherwise, return Λ. The recursion step is as follows: for all indexes i of the unset variables from τ , in sequential order, for both b ∈ {0, 1}, set the variable xi to b, and then recursively call Υbi ← A′DPLL F, n, n′ − 1, z2 , τ i,b , until for some xi and b, a valid decision tree Υbi 6= Λ is

computed, or until we exhaust all the possibilities for the index i and bit b. Afterward, set the variable xi to b, and recursively compute again ′ i,b b ′ Υi ← ADPLL F, n, n − 1, z − 1, τ ; when this returns, exit the loop. If Υbi and Υbi are not

both valid decision trees, then return Λ. Otherwise, return the decision tree Υ formed by: the root node labeled with xi ; the left child, corresponding to xi ← b, which is the decision tree Υbi ; and the right child, corresponding to xi ← b, which is the decision tree Υbi . Υ is

C HAPTER 2. P ROOF S YSTEMS AND A LGORITHMS

19

xi xi ← b Υbi

xi ← b Υbi

In the next three claims we formally prove the consistency and completeness, and we estimate the run-time of the algorithm described above. Claim 12 (Consistency) If the decision tree Υ returned by the Algorithm A′DPLL (F, n, n′ , z, τ ) from the Figure 2.3 is not Λ, then Υ is a DPLL refutation of F⌈τ . Proof. (of the “Consistency” claim). We proceed by induction on n′ and z. Base Case. Assume n′ = 0 or z < 1. since Υ is not Λ, then there exists a clause C from F such that C⌈τ = ∅, and the claim follows. Induction Case. Let n′ and z be at least 1. Assume that Υ 6= Λ. Therefore, by the line 20, there exists an unset variable xi from τ , there exists b ∈ {0, 1}, and there exist deci z ′ i,b b ′ b ′ ′ i,b such that sion trees Υi ← ADPLL F, n, n − 1, 2 , τ and Υi ← ADPLL F, n, n − 1, z, τ x

Υ ←Υb i i

Υb i

(the left child of xi corresponds to the restriction xi ← b, and the right child

corresponds to the restriction xi ← b). Since Υ 6= Λ, by the line 18, it follows that for some b ∈ {0, 1}, Υbi 6= Λ and that Υbi 6= Λ. By the induction hypothesis we have that the decision tree Υbi is a DPLL refutation of F⌈τ i,b and that the decision tree Υbi is a DPLL refutation of F⌈τ i,b . Hence, the decision tree Υ is a DPLL refutation of F⌈τ . ✷(of the “Consistency” claim) Claim 13 (Completeness) IF there exists a decision tree ΥTarget such that • ΥTarget has size at most z, • ΥTarget is over at most the unset variables from τ , • ΥTarget is a DPLL refutation of F⌈τ

C HAPTER 2. P ROOF S YSTEMS AND A LGORITHMS

20

THEN the Algorithm A′DPLL (F, n, n′ , z, τ ) from the Figure 2.3 will return a valid decision tree over at most the n′ unset variables from τ .

1

2

3

4

5

6

7

Algorithm A′DPLL (F, n, n′ , z, τ ) if n′ = 0 or z < 1 then if exists C ∈ F initial clause such that C⌈τ = ∅ then return C else return Λ for i = j1 , j2 , . . . , jn′ , where τj1 , τj2 , . . . , τjn′ = ∗, do

8

b←0

9

Υbi ← A′DPLL F, n, n′ − 1, z2 , τ i,b

10

if Υbi 6= Λ then

11

12

Υbi ← A′DPLL F, n, n′ − 1, z − 1, τ i,b

b←1

14

Υbi ← A′DPLL F, n, n′ − 1, z2 , τ i,b

15

if Υbi 6= Λ then

17

18

{set variable xi to b = 0} {set variable xi to b = 1}

exit the for loop

13

16

Υbi ← A′DPLL F, n, n′ − 1, z, τ i,b

{set variable xi to b = 1 } {set variable xi to b = 0}

exit the for loop

if Υbi = Λ or Υbi = Λ then

19

return Λ

20

Υ ←Υb i

x

i

21

Υb i

return Υ

Figure 2.3: The Algorithm A′DPLL (F, n, n′ , z, τ )

C HAPTER 2. P ROOF S YSTEMS AND A LGORITHMS

21

Proof. (of the “Completeness” claim). By induction on n′ . Base Case. If n′ = 0 or z < 1, then there exists a clause C from F such that C⌈τ = ∅, and the proof follows. Induction Case. Let n′ be at least 1. Let xi′ be the variable that labels the root of the decision tree ΥTarget . Then one of the children of the root must have size at most z2 , and the other has size at most z − 1. Let b′ ∈ {0, 1} such that the child corresponding to xi′ ← b′ is a decision tree of size at most z2 . Let us define ΥTarget =

′ Υb′ i ,Target

x ′

i

′ Υb′ i ,Target

, where the left

child corresponds to the restriction xi′ ← b′ , and the right child corresponds to the restriction xi′ ← b′ . so, ′

• Υbi′ ,Target has size at most z ′ = z2 , ′

• Υbi′ ,Target is over at most the n′ unset variables from τ , ′

• Υbi′ ,Target is a DPLL refutation of F⌈τ i′ ,b′ . Therefore, by the induction hypothesis, (at least) for these values of i′ and b′ , ′ ′ ′ Υbi′ ← A′D.T. F, n, n′ − 1, z2 , τ i ,b is a valid decision tree.

Hence the algorithm will eventually find some unset variable xi from τ , and b ∈ {0, 1}, such that Υbi ← A′D.T. F, n, n′ − 1, z2 , τ i,b is a valid decision tree. However, these are not necessarily the variable xi′ and b′ from above. ′ i,b b ′ ′ is also a valid decision tree. By induction on n , Υi ← AD.T. F, n, n − 1, z, τ

In the end, since both Υbi and Υbi are valid decision trees, A′D.T. (F, n, n′ , z, τ ) will return the x

(valid) decision tree Υb i i

Υb i

. And the proof follows. ✷(of the “Completeness” claim)

Claim 14 (Run-Time) The Algorithm A′DPLL (F, n, n′ , z, τ ) from the Figure 2.3 runs in time (n′ )O(log z) · O (|F|) + O (n log z). Proof. (of the “Run-Time” claim). Let T (z, n′ ) be the run-time of the algorithm. If n′ = 0 or z < 1, then T (z, n′ ) = O (|F|) .

(2.10)

C HAPTER 2. P ROOF S YSTEMS AND A LGORITHMS

22

Otherwise, ′

′

T (z, n ) = n + 2n · T

z

2

′

, n − 1 + T (z, n′ − 1) .

(2.11)

We apply the recursive formula again for T (z, n′ − 1), and we get T (z, n′ ) ≤ n + 2n′ · T

z

z , n′ − 1 + 2 (n′ − 1) · T , n′ − 1 + T (z, n′ − 2) . 2 2

(2.12)

Since T (z, n′ ) ≤ T (z, n′ + 1), for all n′ ≥ 1, and any z, we have that i h z , n′ − 1 + T (z, n′ − 2) . T (z, n′ ) ≤ n + 2n′ 2 · T 2

(2.13)

After we develop T (z, n′ − 2) , T (z, n′ − 3) , . . . , T (z, 1) recursively and regroup the terms in the right hand side of the inequality as above, we obtain h z i , n′ − 1 + T (z, 0) . T (z, n′ ) ≤ n + 2n′ n′ · T 2

(2.14)

Using the initial condition from the Equation 2.10, we have 2

T (z, n′ ) ≤ 2 (n′ ) · T

z

2

, n′ − 1 + O(n).

(2.15)

Applying recursion k times, we obtain T (z, n′ ) ≤ 2 (n′ )

2k

·T

z ′ , n − k + k · O(n). 2k

(2.16)

For the smallest k such that 2k > z, i.e., k = O(log z), and using the above Equation, we have T (z, n′ ) ≤ 2 (n′ )

O(log z)

· O (|F|) + O(log z) · O(n),

(2.17)

which gives T (z, n′ ) ≤ (n′ ) And the proof follows.

O(log z)

· O (|F|) + O (n log z) .

(2.18)

✷(of the “Run-Time” claim)

Continuing the proof of Theorem 11, let the Algorithm ADPLL (F, n) that finds (if there exists) a decision tree on at most n variables that is a DPLL refutation of F, as presented in the Figure 2.4.

C HAPTER 2. P ROOF S YSTEMS AND A LGORITHMS

1

Algorithm ADPLL (F, n)

2

for s = 1, 2, 3, . . . do

3

τ ← (∗, ∗, ∗, . . . , ∗)

4

Υ ← A′DPLL (F, n, n, s, τ )

5

if Υ 6= Λ then

6

return Υ

23

{all variables are unset in the partial assignment} {the algorithm from the Figure 2.3}

Figure 2.4: The Algorithm ADPLL (F, n) for DPLL automatizability The algorithm searches for a decision tree that is a DPLL refutation F. If ADPLL (F, n) returns a valid decision tree Υ 6= Λ, then Υ was returned by ADPLL (F, n, n, s, τ ) in the line 4. Hence, by the Claim 12 (“Consistency”), Υ is a DPLL refutation of F⌈τ = F. Assume that there exists a decision tree ΥTarget of size s, that is a DPLL refutation of F, and over n variables. Then, by the Claim 13 (“Completeness”), it follows that A′DPLL (F, n, n, s, τ ) will return a valid decision tree, which is a DPLL refutation of F⌈τ = F and over at most n variables. Moreover, since F has a DPLL refutation whose shortest decision tree has size at most s, by the Claim 14 (“Run-Time”), it follows that A′DPLL (F, n, n, s, τ ) will run in time nO(log s) · O (|F|). Hence, the Algorithm ADPLL (F, n) will return a decision tree over at most n variables that is a DPLL refutation of F, in time at most s · nO(log s) · O (|F|) ≤ nO(log s) · O (|F|) , and this concludes the proof.

(2.19) ✷(of Theorem 11)

Corollary 15 Tree-like resolution is automatizable in time nO(log s) · O (|F|), where F is the initial DNF formula and s is the size of the shortest tree-like (DPLL) refutation proof of F. Proof.

By Theorem 11, DPLL is automatizable in time nO(log s) · O (|F|). since tree-like

resolution polynomially simulates DPLL, by Theorem 9, it follows that tree-like resolution is automatizable in time nO(log s) · O (|F|).

✷

C HAPTER 2. P ROOF S YSTEMS AND A LGORITHMS

24

2.7.3 Polynomial Calculus Automatizability Consider multi-linear polynomials of degree d over F as a vector space whose coordinates are indexed by the multi-linear terms of degree at most d. Let Vd (p1 , . . . , pk ) be the smallest subspace V of this space that includes p1 , . . . , pk and so that if p ∈ V and p has degree at most d − 1, and x is any variable, then m(xp) is in V . Then we have the following Theorem 16 ([CEI96]) For any multi-linear polynomials p1 , . . . , pk , q of degree at most d, p1 , . . . , pk ⊢ q if and only if q ∈ Vd (p1 , . . . , pk ). Therefore, in order to see whether a polynomial q has a d degree polynomial calculus proof, using a set of axioms P , it is enough to check whether q is in the space spanned by a basis of P . In the following is presented a well-known algorithm for finding a basis of a set of polynomials. Fix an order on multi-linear monomials that respects degree, i.e., degree d monomials are larger than degree d − 1 in the sense of order. Define the leading term LT (p) of a multi-linear polynomial p to be the largest monomial with a non-zero coefficient. The algorithm maintains a set B of multi-linear polynomials with the property that no two elements of B have the same leading term, and a set P of polynomials that will eventually be added to B. Every element of P and B will be derivable from p1 , . . . , pk by a degree d polynomial calculus proof, and p1 , . . . , pk will be in the space spanned by B ∪ P . The algorithm halts when P is empty. Each polynomial p′ that causes n polynomials to be added to P adds a new polynomial of degree d−1 or smaller to B. Thus, since no two elements P P i=d−1 n n = O n · of B have the same leading term, there are at most k + n · i=d−1 i=0 i=0 i i

polynomials that are ever in P . Assuming that a constant cost hash table of leading terms P n dimensional vectors, reduccan be maintained and that polynomials are stored as i=d i=0 i Pi=d n 2 . So the total time for the algorithm is ing a polynomial requires time O n · i=0 i Pi=d n3 Pi=d−1 n Pi=d n2 . =O d· O n · i=0 i=0 i i=0 i i Lemma 17 ([CEI96]) At the end of this algorithm, B is a basis for Vd (p1 , . . . , pk ).

C HAPTER 2. P ROOF S YSTEMS AND A LGORITHMS

1

Initially, P = {p1 , . . . , pk } and B is empty

2

Repeat the following until P is empty

25

(a) Arbitrarily select and remove a polynomial p from P . (b) Perform a Gaussian reduction of p by B to get a polynomial p as follows: Check to see if LT (p) = LT (q) for some q ∈ B. If not, return p. If so, compute p − aq for a ∈ F which causes the leading terms to cancel. Then recursively reduce p − aq. (c) If the reduced polynomial p′ 6= 0 add it to B. If in addition it has degree ≤ d − 1, add m(xp′ ) to P for each variable x. Figure 2.5: Groebner basis algorithm

Therefore, in order to see whether q ∈ Vd (p1 , . . . , pk ) we run the above algorithm and then reduce q, and see if the result is 0. Hence we have

Theorem 18 ([CEI96]) There is an algorithm that runs in time O d ·

P 3 i=d i=0

= O n3d

which determines whether q is derivable from p1 , . . . , pk via a degree d polynomial calculus proof.

Chapter 3 Learning Models and Algorithms 3.1

Introduction

Usually a learning model is specified by several notions: 1. Learner represents the entity that does the learning. Usually it is a computer program. The learner can also be restricted. For example, often the learner is required to be efficient (meaning running in polynomial time), or to use finite memory. 2. Domain represents what is being learned. For example, the learner may be trying to learn an unknown concept, such as a chair. One of the most studied domains is the concept learning, where the learner is trying to come up with a “rule” to separate positive examples from negative examples. 3. Information Source represents how the learner is informed about the domain. This can be happen in many ways: (a) Examples. The learner is given positive and negative examples. These examples can be chosen in a variety of ways. They can be chosen at random, from some known or unknown distribution. They can also be chosen arbitrarily, or maliciously, 26

C HAPTER 3. L EARNING M ODELS AND A LGORITHMS

27

by some adversary who wants to know the worst-case behavior of a learning algorithm. (b) Queries. The learner may get information about the domain by doing queries. There are two types of queries: i. membership query, where the learner asks for the label of an example. ii. equivalence query, where the learner provides a rule and asks if that rule is correct. 4. Prior Knowledge represents what the learner knows about the domain initially. This generally restricts the learner’s uncertainty and/or biases and expectations about unknown domains. This tells what the learner knows about what is possible or probable in the domain. For example, the learner may know that the unknown concept is representable in a certain way. That is, the unknown concept might be known to be representable as a conjunction of features, or as a graph. An important issue here is how to handle “incorrect” prior knowledge, how to combine or trade-off prior versus new information. 5. Performance Criteria represents how we know whether, or how well, the learner has learned and how the learner demonstrates that it has learned something. Different performance criteria include:

(a) Off-line (batch) vs. on-line (interactive) measures; (b) Descriptive output (e.g., representation of unknown concept) versus predictive output (i.e., the learner need only label new instances as either positive or negative); (c) Accuracy. The learner is evaluated on the basis of its error rate, its correctness of description, or the number or mistakes it made during learning. (d) Efficiency. The learner may be evaluated on the amount of computation it does, and amount of information it needs (i.e., the number of examples it needs). The

C HAPTER 3. L EARNING M ODELS AND A LGORITHMS

28

leaner may be required to have asymptotic convergence, or it may be required to take polynomially bounded time. (e) Analysis assumption. Sometimes, for analysis purpose, we assume things about the learning scenario. For example, we may assume the learner is given instances drawn from the uniform distribution. This might not actually be the case, but we sometimes assume this to make analysis possible. This is different from the learner’s prior knowledge. Definition 12 Concept c is a boolean function over {0, 1}n . The domain variables are denoted by x1 , . . . , xn . A labeled example is a pair hα, βi, where α ∈ {0, 1}n and β ∈ {−, +} indicate whether α is a positive or negative example. A concept class C is a set of concepts over x1 , . . . , xn usually with an associated representation. Definition 13 We say that a labeled sample hα, βi over n variables is consistent with an npartial assignment τ , if for all i, 1 ≤ i ≤ n, such that τi 6= ∗ we have that αi = τi . We say that a set of samples S over n variables is consistent with a n-partial assignment τ if all the samples from S are consistent with τ . Notation 2 Let S be a set of samples over n variables, and let τ be an n-partial assignment. We denote by S⌈τ the largest subset of S consistent with τ .

3.2

Probably Approximately Correct (PAC) learning model

The PAC model, introduced for the first time in [Val84], includes a domain of instances X, a class of concepts C containing functions mapping X to {0, 1}, and a target concept c ∈ C. The learner’s goal is to identify the target concept c. The model also includes a probability distribution D on X. The learner has access to a random oracle EX(c, D), which at the learner’s request will return a labeled examples hx, c(x)i chosen

C HAPTER 3. L EARNING M ODELS AND A LGORITHMS

29

randomly and independently according to the probability distribution D. The label of x is given by the target concept c. Note that the distribution D is completely arbitrary. For example, some examples may have probability zero under D, in which case the learner will never see them. D is assumed to be stationary, fixed over time, so that the probability of seeing an example during the training period is the same as the probability of seeing it afterward. The learner does not know D. Using its oracle of random examples, the learner is expected to compute a hypothesis h which closely approximates the target concept c, where the accuracy of a hypothesis is defined as follows: Definition 14 The error rate error(h) for hypothesis h with respect to distribution D and target concept c is error(h) = Pr [c(x) 6= h(x)] . x∈D

Note that we couldn’t express the accuracy of a hypothesis without the distribution D, which specifies which areas of disagreement between the hypothesis and the true concept are more important than others. The learner takes as input an accuracy parameter ǫ and a confidence parameter δ, and is expected to output a hypothesis h such that with probability at least 1 − δ, error(h) ≤ ǫ. So, the learned function is probably (within δ) a good approximation (with an error rate of at most ǫ) of the true concept. Definition 15 (PAC-learning) Let C be a concept class (concepts over n variables), and let D be a distribution over the input space. Algorithm A is a randomized algorithm that has access to an oracle which returns labeled examples from some unknown c ∈ C where the examples are taken from distribution D. Algorithm A PAC-learns concept class C by hypothesis class H if for any c ∈ C, for any distribution D over the input space, any ǫ, δ > 0, A runs in time polynomial in n, 1/ǫ, 1/δ, size(c) (and thus also uses at most polynomially many samples), to

C HAPTER 3. L EARNING M ODELS AND A LGORITHMS

30

produce a hypothesis h ∈ H such that: with probability at least (1 − δ), the error between h and c over D is at most ǫ. This probability is taken over random coin tosses of A as well as over the random labeled examples seen from distribution D. When H = C (the hypothesis must be some concept from C), then we say that algorithm A proper PAC-learns concept class C. Definition 16 We say that the algorithm A PAC-quasi-learns concept class C by hypothesis class H if the running time of A is quasi-polynomial in n. Previous research has shown that the classes N C1 and AC0 can not be PAC-learned under cryptographic assumptions [KV94a]. Since any boolean function can be represented by a sufficiently large DNF, and DNF formulas are easily understood by humans and seem to be a natural form of knowledge representation, the big open question in the machine learning theory is whether DNF and decision trees are PAC-learnable. The best known algorithm for PAC-learning DNF formulas runs in time 2O(n

1/3 )

[KS01]. Other results exists with better

running times, but they are for particular distributions of the PAC model: [Ver90] gives a quasi-polynomial algorithm for the uniform distribution, [BMOS03] gives a polynomial time algorithm for learning decision trees and DNF formulas in the uniform random walk model on {0, 1}n . Since after more than two decades of research, it has not been possible to advance too much in the question whether the DNF formulas are PAC-learnable or not, it became increasingly clear that we must study the learnability of DNF formulas under simpler models.

3.3

Occam learning model

In Occam learning, an algorithm takes a sample S = {hx1 , c(x1 )i, . . . , hxm , c(xm )i}, and outputs a “succinct” hypothesis h that is consistent with the sample, i.e., h(xi ) = c(xi ) for each i, and size(h) is a slowly growing function of n, size(c), and m

C HAPTER 3. L EARNING M ODELS AND A LGORITHMS

31

Definition 17 (Occam learning model) Let α ≥ 0 and 0 ≤ β < 1. L is an (α, β)-Occam algorithm for C using H if on input a sample set S of cardinality m labeled according to c ∈ C, L outputs a hypothesis h ∈ H such that: • h is consistent with S; • size(h) ≤ (n · size(c))α mβ . L is an efficient (α, β)-Occam algorithm if and only if its running time is bounded by a polynomial in n, m, and size(c).

3.4

Relationship between PAC and Occam

Theorem 19 (Occam’s Razor) Let L be an efficient (α, β)-Occam algorithm for C using H. Let D be the target distribution over the instance space X = x1 , . . . , xn and let c ∈ C be the target concept, 0 < ǫ, δ ≤ 1. Then there is a constant a > 0 such that if L is given an input a random sample S of m examples drawn according to D and c, where m satisfies 1 ! 1 (n · size(c))α 1−β 1 log + , m≥a ǫ δ ǫ then with probability at least 1 − δ the output h of L has error at most ǫ. Moreover, L runs in time polynomial in n, size(c), Proof.

1 ǫ

and 1δ .

Let Hn,m be the set of possible hypothesis representations that the (α, β)-Occam

algorithm L may output when given as input a labeled sample S, |S| = m. Since L is an (α, β)-Occam algorithm, each hypothesis has bit length at most (n·size(c))α mβ , and therefore |Hn,m | ≤ 2(n·size(c))

α mβ

.

For an (α, β)-Occam algorithm L, we want to determine mǫ,δ such that for any concept concept class C, for any concept c ∈ C and any distribution of samples D, if L uses at least mǫ,δ samples from D, then L learns C in the PAC model (i.e., with probability at least 1 − δ, L outputs a hypothesis that has error less than ǫ).

C HAPTER 3. L EARNING M ODELS AND A LGORITHMS

32

Let ~v be a sequence of mǫ,δ samples drawn according to the distribution D. We say that a hypothesis h is ǫ-close, if error(h) < ǫ. For a sequence ~v of samples from D, we say that a hypothesis h is nice if it has the following property: nice(h, ~v ) = “if h is consistent with c on ~v , then h is ǫ−close to c′′ .

(3.1)

Therefore we want to determine mǫ,δ such that for any concept class C, for any concept c ∈ C, for any sample distribution D, for a fraction at least 1 − δ of all sequences of samples ~v of size mǫ,δ drawn according to D, all hypothesis h ∈ Hn,m are nice: ∀C, ∀c ∈ C, ∀D, ∀h ∈ Hn,m , Pr [nice(h, ~v )] > 1 − δ. ~v ∈D

(3.2)

Assume that the relation 3.2 above does not hold. Then ∃C, ∃c ∈ C, ∃D, Pr [∃h ∈ Hn,m such that ¬nice(h, ~v )] ≥ δ ~v ∈D

(3.3)

Since there are no more than |Hn,m | possible hypothesis, for the above C, c, and D, we further have that Pr

h∈Hn,m , ~v ∈D

[¬nice(h, ~v )] ≥ δ ·

1 . |Hn,m |

(3.4)

Let h ∈ Hn,m be a bad hypothesis (i.e., error(h) > ǫ), arbitrarily chosen, but fixed. Then, by Definition 3.1, Pr [h(x) 6= c(x)] > ǫ.

x∈D

(3.5)

The probability that h is consistent with a random sample from D is hence: Pr [h(x) = c(x)] ≤ 1 − ǫ.

x∈D

(3.6)

Since the samples are drawn independently, each according to the distribution D, we further have that for a sequence ~v of mǫ,δ samples drawn according to D, Pr [h(x) = c(x), ∀x ∈ ~v ] ≤ (1 − ǫ)mǫ,δ .

~v ∈D

(3.7)

C HAPTER 3. L EARNING M ODELS AND A LGORITHMS

33

By the Definition 3.1, h is not a nice hypothesis. Using the fact that h was chosen arbitrarily, we have that Pr

~v ∈D, h∈Hn,m

[¬nice(h, ~v )] ≤ (1 − ǫ)mǫ,δ .

(3.8)

From Equations 3.4 and 3.8, it follows that δ·

1 ≤ Pr [¬nice(h, ~v )] ≤ (1 − ǫ)mǫ,δ . |Hn,m | h∈Hn,m , ~v∈D

(3.9)

So, for mǫ,δ satisfying the equation δ·

1 ≤ (1 − ǫ)mǫ,δ , |Hn,m |

(3.10)

the equation 3.2 does not hold. Therefore L is a PAC algorithm if mǫ,δ satisfies the equation (1 − ǫ)mǫ,δ ≤ δ ·

1 |Hn,m |

(3.11)

This further gives: log ((1 − ǫ)mǫ,δ ) + log |Hn,m | ≤ log δ,

(3.12)

and finally: mǫ,δ · log

1 1 ≥ log |Hn,m | + log . 1−ǫ δ

(3.13)

1 Let b > 0 such that bǫ = log 1−ǫ . Then

mǫ,δ ≥

1 1 1 log |Hn,m | + log . bǫ bǫ δ

(3.14)

Inequality 3.14 is satisfied if mǫ,δ ≥ 2 · max

1 1 1 log |Hn,m |, log bǫ bǫ δ

(3.15)

Solving the inequality 3.15 for each term in under the max operator, we have the following lower bounds, respectively: mǫ,δ mǫ,δ

2 (n · size(c))α ≥ bǫ 1 2 log ≥ bǫ δ

1 1−β

(3.16) (3.17)

C HAPTER 3. L EARNING M ODELS AND A LGORITHMS

34

Hence, the inequality 3.15 is satisfied for mǫ,δ

1 1−β 2 1 2 α ≥ log + . (n · size(c)) bǫ δ bǫ

(3.18)

Taking a = 2b , we further obtain: mǫ,δ ≥

1 i 1−β a 1 ha . log + (n · size(c))α ǫ δ ǫ

(3.19) ✷

3.5

Learning decision trees

Decision tree learning is one of the most widely used and practical methods for inductive inference over supervised data. It approximates discrete-valued functions (as opposed to continuous) and can represent any discrete function on discrete features. The general technique for learning decision trees goes as follows. First we learn the decision tree in the Occam model, and afterward, using the Occam’s Razor theorem it follows that decision trees are learnable in the PAC model. The result below however does not learn decision trees in polynomial time, but rather in quasi-polynomial time. Hence, the final result will be a quasi-polynomial time running algorithm that learns the decision tree concept class in the PAC model. Definition 18 Let Υ be a decision tree, and let S be a set of labeled examples. We say that Υ is consistent with S if and only if for every labeled sample hα, βi ∈ S, the path induced by the assignment α in the tree Υ leads to a leaf labeled with β. Theorem 20 The concept class of decision trees over n variables is quasi-learnable in the Occam model, in time nO(log s) · O(m), where m is the size of the sample set used and s is the size of the shortest decision tree consistent with the sample set. Proof.

Let C be the concept class of decision trees over n variables, and let S be a set of

samples of size m. First, consider the Algorithm A′D.T. (S, n, n′ , z, τ ), from the Figure 3.1,

C HAPTER 3. L EARNING M ODELS AND A LGORITHMS

35

which, except the initial case, is almost the same as the Algorithm A′DPLL (F, n, n′ , z, τ ) for automatizability of DPLL from the Figure 2.3. The intuition here is that we are looking for a decision tree of size at most z. τ is a partial assignment over at most n variables, and n′ = |τ |∗ is the number of unset variables from τ . Both algorithms (this one for learning decision trees and the algorithm for automatizability of DPLL from Theorem 11) search for a decision tree of size at most z and over at most n′ variables, that (in this learning case) is consistent with the samples S⌈τ (in the automatizability case, the decision tree must be a DPLL refutation of a particular formula). If no such decision tree is found, the algorithm returns Λ. Both algorithms are recursive on n′ and z. For the initial case of the learning algorithm, (the only case in which they differ), assume n′ = 0 or z < 1. If there exists a label β ∗ ∈ {+, −} such that β ∗ labels all the examples from S⌈τ , then return β ∗ . Otherwise return Λ. In the recursion step of both algorithms we are looking for a decision tree which has either half the size, either it has one less variables. The computation done so far is stored in the truth assignment τ which contains the values of the variables that have been evaluated so far. In the learning algorithm, the new smaller decision tree that we are looking for through recursion must be consistent with the set of examples in which the variables are consistent with the assignments done so far. In the automatizability algorithm, the new decision tree must be a DPLL refutation of the initial formula restricted to the assignments done so far. More precisely, for all the indexes i of the unset variables from τ , in sequential order, for both b ∈ {0, 1} set the variable xi to b, and then recursively call Υbi ← A′D.T. S, n, n′ − 1, z2 , τ i,b ,

until for some xi and b, a valid decision tree Υbi 6= Λ is computed. Afterward, set the variable xi to b, and recursively compute again Υbi ← A′D.T. S, n, n′ − 1, z, τ i,b ; when this returns, exit the loop.

x

If Υbi and Υbi are not both valid decision trees, then return Λ. Otherwise, return Υ ←Υb i b , i

Υ i

the decision tree formed by: the root node labeled with xi ; the left child corresponding to

C HAPTER 3. L EARNING M ODELS AND A LGORITHMS

36

xi ← b, which is the decision tree Υbi ; and the right child corresponding to xi ← b, which is the decision tree Υbi . 1

2

3

4

5

6

7

Algorithm A′D.T. (S, n, n′ , z, τ ) if n′ = 0 or z < 1 then if exists β ∗ ∈ {+, −} such that for all hα, βi ∈ S⌈τ , we have β = β ∗ then return β ∗ else return Λ for i = j1 , j2 , . . . , jn′ , where τj1 , τj2 , . . . , τjn′ = ∗, do

8

b←0

9

Υbi ← A′D.T. S, n, n′ − 1, z2 , τ i,b

10

11

12

if Υbi 6= Λ then Υbi ← A′D.T. S, n, n′ − 1, z, τ i,b b←1

14

Υbi ← A′D.T. S, n, n′ − 1, z2 , τ i,b

16

17

{set variable xi to b = 0} {set variable xi to b = 1}

exit the for loop

13

15

if Υbi 6= Λ then Υbi ← A′D.T. S, n, n′ − 1, z, τ i,b

{set variable xi to b = 1 } {set variable xi to b = 0}

exit the for loop

18

if Υbi = Λ or Υbi = Λ then return Λ

19

Υ ←Υb i

x

i

20

Υb i

return Υ

Figure 3.1: The Algorithm A′D.T. (S, n, n′ , z, τ ) In the next three claims we formally prove the consistency and completeness, and we estimate the run-time of the algorithm described above.

C HAPTER 3. L EARNING M ODELS AND A LGORITHMS

37

Claim 21 (Consistency) If the decision tree Υ returned by the Algorithm A′D.T. (S, n, n′ , z, τ ) from the Figure 3.1 is not Λ, then Υ is a decision tree consistent with the set of samples S⌈τ . Proof. (of the “Consistency” claim). We proceed the proof by induction on n′ . Base Case. Assume n′ = 0 or z < 1. Since Υ is not Λ, there exists β ∗ ∈ {+, −} such that all the examples from S⌈τ are labeled with β ∗ , and the proof follows. Induction Case. Let n′ be at least 1. Let F be fixed. We assume that for any z ≥ 0 and for any partial assignment τ over n variables, if A′D.T. (F, n, n′ , z, τ ) returns a valid decision tree, then it is consistent with F⌈τ . We prove that for any z ≥ 0 and for any partial assignment τ over n variables, if A′D.T. (F, n, n′ + 1, z, τ ) returns a valid decision tree Υ, then Υ is consistent with S⌈τ . Let any z ≥ 0 and let τ be any partial assignment over n variables. Let Υ be a valid decision tree returned by A′D.T. (F, n, n′ + 1, z, τ ). If z < 1, then the proof follows by the base case above. Hence assume z ≥ 1. By line 19, there exists an unset variable xi from τ , there exists b ∈ {0, 1}, and there exist z ′ i,b b ′ i,b b ′ ′ such and Υi ← AD.T. F, n, n − 1, z, τ decision trees Υi ← AD.T. F, n, n − 1, 2 , τ x

that Υ ←Υb i i

Υb i

(the left child of xi corresponds to the restriction xi ← b, and the right

child corresponds to the restriction xi ← b). By the induction hypothesis we have that Υbi is a decision tree consistent with the samples S⌈τ i,b , and that Υbi is a decision tree consistent with the samples S⌈τ i,b . Let hα, βi be a sample from S⌈τ . We have two cases, depending on the value of αi : • if αi = b, then any path induced on Υ by α is a path induced on Υbi . By the induction hypothesis, Υbi is consistent with S⌈τ i,b and hence, it leads to a leaf labeled with β. • if αi = b, then any path induced on Υ by α is a path induced on Υbi . By the induction hypothesis, Υbi is consistent with S⌈τ i,b and hence, it leads to a leaf labeled with β. Therefore, since for any sample hα, βi ∈ S⌈τ , the path induced by α on Υ leads to a leaf labeled with β it follows that Υ is consistent with S⌈τ .

C HAPTER 3. L EARNING M ODELS AND A LGORITHMS

38 ✷(of the “Consistency” claim)

Claim 22 (Completeness) IF there exists a decision tree ΥTarget such that • ΥTarget has size at most z, • ΥTarget is over at most the variables not set by the partial assignment τ , • ΥTarget is consistent with the samples S⌈τ , and THEN the Algorithm A′D.T. (S, n, n′ , z, τ ) from the Figure 3.1 will return a decision tree over at most the n′ variables unset by τ . Proof. (of the “Completeness” claim). By induction on n′ . Base Case. If n′ = 0 or z < 1, then there exists β ∗ ∈ {+, −} such that all the examples from S are labeled with β ∗ , and the proof follows. Induction Case. Let n be at least 1. Let xi′ be the variable that labels the root of ΥTarget . Then one of the children of the root must have size at most z2 , and the other has size at most z. Let b′ ∈ {0, 1} such that the child corresponding to xi′ ← b′ is a decision tree of size at most z2 . Let us define ΥTarget =

′ Υb′ i ,Target

x ′

i

′ Υb′ i ,Target

, where the left child corresponds to the restriction

xi′ ← b′ , and the right child corresponds to the restriction xi′ ← b′ . So, ′

• Υbi′ ,Target has size at most z, ′

′

′

• Υbi′ ,Target is over at most the unset variables from τ i ,b , ′

• Υbi′ ,Target is consistent with the samples S⌈τ i′ ,b′ , and Therefore, by the induction hypothesis, (at least) for these values of i′ and b′ , ′ ′ ′ Υbi′ ← A′D.T. S, n, n′ − 1, z2 , τ i ,b is a valid decision tree.

Hence the algorithm will eventually find some unset variable xi from τ , and b ∈ {0, 1}, such that Υbi ← A′D.T. S, n, n′ − 1, z2 , τ i,b is a valid decision tree. However, these are not necessar-

ily the variable xi′ and b′ from above.

C HAPTER 3. L EARNING M ODELS AND A LGORITHMS

39

By induction on n, Υbi ← A′D.T. S, n, n′ − 1, z, τ i,b is also a valid decision tree.

In the end, since both Υbi and Υbi are valid decision trees, A′D.T. (S, n, n′ , z, τ ) will return the x

(valid) decision tree Υb i i

Υb i

. And the proof follows. ✷(of the “Completeness” claim)

Claim 23 (Run-Time) The Algorithm A′D.T. (S, n, n′ , z, τ ) from the Figure 3.1 runs in time (n′ )

O(log z)

· O(m · n),

(3.20)

where m is the size of the set of samples S. Proof. (of the “Run-Time” claim). Let Tm,n (z, n′ ) be the run-time of the algorithm. If n′ = 0 or z < 1, then Tm,n (z, n′ ) = O(m · n).

(3.21)

Otherwise, Tm,n (z, n′ ) ≤ n + 2n′ · Tm,n

z

2

, n′ − 1 + Tm (z, n′ − 1) .

(3.22)

We apply the recursive formula again for Tm,n (z, n′ − 1), and we get Tm,n (z, n′ ) ≤ n + 2n′ · Tm,n

z

z , n′ − 1 + 2n′ · Tm,n , n′ − 1 + Tm,n (z, n′ − 2) (3.23) 2 2

which gives, since Tm,n (z, n′ ) ≤ Tm,n (z, n′ + 1) for any n at least 1 and any z, h i z Tm,n (z, n′ ) ≤ n + 2n′ 2 · Tm,n , n′ − 1 + Tm,n (z, n′ − 2) . 2

(3.24)

After we develop Tm,n (z, n′ − 2) , Tm,n (z, n′ − 3) , . . . , Tm,n (z, 1) recursively and regroup the terms in the right hand side of the inequality as above, we obtain ′

′

h

′

Tm,n (z, n ) ≤ n + 2n n · Tm,n

z

2

′

,n − 1

i

+ Tm,n (z, 0) .

(3.25)

Using the initial condition from the Equation 3.21, we have 2

Tm,n (z, n′ ) ≤ 2 (n′ ) · Tm,n

z

2

, n′ − 1 + O (m · n) .

(3.26)

C HAPTER 3. L EARNING M ODELS AND A LGORITHMS

40

Applying recursion k times, we obtain Tm,n (z, n′ ) ≤ 2 (n′ ) For the smallest k such that

z 2k

2k

· Tm,n

z ′ , n − k + k · O (m · n) . 2k

(3.27)

< 1, i.e., k = O(log z), using the above Equation, we have

Tm,n (z, n′ ) ≤ 2 (n′ )

O(log z)

· O(m · n) + O (log z) · O (m · n) ,

(3.28)

which gives Tm,n (z, n′ ) ≤ (n′ )

O(log z)

· O(m · n).

(3.29)

✷(of the “Run-Time” claim) And finally, let the Algorithm AD.T. (S, n) that finds a decision tree over at most n variables consistent with the set of samples S, as presented in the Figure 3.2. 1

Algorithm AD.T. (S, n)

2

m ← |S|

3

for s = 1, 2, 3, . . . m · n do

4

τ ← (∗, ∗, ∗, . . . , ∗)

5

Υ ← A′D.T. (S, n, n, s, τ )

6

if Υ 6= Λ then

7

return Υ

8

return Λ

{all variables are unset in the partial assignment} {the algorithm from the Figure 3.1}

{“no consistent concept” }

Figure 3.2: The Algorithm AD.T. (S, n) Occam learns the sample set S The algorithm searches for a decision tree consistent with S assuming that there is one that has size 1, 2, 3, . . . m · n. If AD.T. (S, n) returns a valid decision tree Υ 6= Λ, then Υ was returned by AD.T. (S, n, n, s, τ ) in the line 5. Hence, by the Claim 21 (“Consistency”), Υ is consistent with S⌈τ = S. If there is a decision tree over at most n variables consistent with S, then the

C HAPTER 3. L EARNING M ODELS AND A LGORITHMS

41

shortest one has size at most m · n. Hence if s > m · n then there is no decision tree over at most n variables consistent with the sample set S, and the algorithm returns Λ. Assume that, for some value of s, there exists a decision tree ΥTarget of size s, consistent with S, and over at most n variables. Then, since the size of ΥTarget is at most s, by the Claim 22 (“Completeness”), it follows that A′D.T. (S, n, n, s, τ ) will return a valid decision tree, consistent with S⌈τ = S and over at most n variables. Moreover, since the size of the decision tree is at most s, by Claim 23 (“Run-Time”), it follows that A′D.T. (S, n, n, s, τ ) will run in time nO(log s) · O(m · n)

(3.30)

Hence the Algorithm AD.T. (S, n) will return a decision tree over at most n variables consistent with S, or Λ if no such decision tree exists, in time s · nO(log s) · O(m · n) ≤ m · n · nO(log s) · O(m · n) = nO(log s) · O(m · n),

(3.31)

and this concludes the proof. ✷(of the Theorem 20) Corollary 24 The class of decision trees is PAC-quasi-learnable Proof. By the Theorem 19 it follows that there is a PAC learning algorithm associated with the Occam learning algorithm for the class of decision trees that follows from the theorem 20. Moreover, since the Occam algorithm is quasi-polynomial in s, the PAC algorithm can be obtained by a polynomial algorithm, it follows that the final algorithm is quasi-polynomial. ✷

C HAPTER 3. L EARNING M ODELS AND A LGORITHMS

3.6

42

Learning DNF formulas

3.6.1 Monotone disjunctions Lemma 25 The concept class of monotone disjunctions is efficient Occam learnable. Proof. Let C be the concept class of monotone disjunctions over n variables x1 , . . . , xn , and let S be a set of samples, of size m. Then the Algorithm Amd (S, n) finds a monotone disjunction that is consistent with the set S: Amd (S, n) outputs the disjunction of all the variables xi where all the negative samples have a 0 in position i, and then verifies that this disjunction is satisfied by all the positive samples; if it is not satisfied by all the positive samples, then Amd (S, n) outputs Λ (meaning “no consistent concept”). Amd (S, n) runs in O (n · m) time and it outputs a hypothesis of size less than n, therefore it is an efficient (1, 0)-Occam learning algorithm.

✷

Corollary 26 The concept class of monotone disjunctions is PAC learnable. Proof. By Lemma 25, the concept class of monotone disjunctions is efficient Occam learnable. Using the Theorem 19 there exists an algorithm that efficiently learns this concept class in the PAC learning model.

✷

3.6.2 Non-monotone disjunctions Lemma 27 The concept class of non-monotone disjunctions is efficient Occam learnable. Proof. Let C be the concept class of non-monotone disjunctions over n variables {x1 , . . . , xn }, and let S be a set of samples, of size m. Then let the learning Algorithm Anmd (S, n), shown in the Figure 3.3, in which the new variables {y1 , . . . , y2n } corresponds to {x1 , x2 , . . . , xn , xn }, and computes the set of samples S ′ in the new variables {y1 , . . . , y2n } that corresponds to the set of samples S in the variables {x1 , . . . , xn }. Then using the Algorithm Amd (S ′ , 2n) from the Lemma 25 we can efficiently Occam learn a monotone disjunction over the set of variables {y1 , . . . , y2n }. And finally Anmd (S, n) returns a

C HAPTER 3. L EARNING M ODELS AND A LGORITHMS

1

43

Algorithm Anmd (S, n)

2

Let {y1 , . . . , y2n } be new variables, corresponding to {x1 , x1 , . . . , xn , xn }

3

Compute S ′ the set samples in variables {y1 , . . . , y2n }, corresponding to S

4

F ← Amd (S ′ , 2n)

5

In F replace the literal xi for the literal y2i , and the literal xi for the literal y2i+1

6

return F

{call the algorithm from the Lemma 25}

Figure 3.3: The Algorithm Anmd (S, n) disjunction in variables {x1 , . . . , xn }, by replacing all the y’s with the appropriate x’s by the correspondence previously defined. If F = Λ, then there is no concept consistent with the set of samples S. Since Amd (S ′ , 2n) is an efficient (1, 0)-Occam learning algorithm, and its runtime is O (n · m), it follows that Anmd (S, n) is also an efficient (1, 0)-Occam learning algorithm, running in time O (n · m).

✷

Corollary 28 The concept class of non-monotone disjunctions is PAC learnable. Proof.

By Lemma 27, the concept class of non-monotone disjunctions is efficient Occam

learnable. Using the Theorem 19, there exists an algorithm that efficiently learns this concept class in the PAC learning model.

✷

3.6.3 k-DNF formulas In the following we show how to learn the concept class of bounded width DNF formulas in the Occam model. Let C be the concept class of k-DNF formulas over a set of n variables {x1 , . . . , xn }, and let S be a set of samples of size m. Let {T1 , T2 , . . . , Tp } be the set of all terms over the variables x1 , . . . , xn , of size at most k. Note that p = nO(k) . Let {y1 , y2 , . . . , yp } be a set of variables,

C HAPTER 3. L EARNING M ODELS AND A LGORITHMS

44

yi corresponds to the term Ti , for all 1 ≤ i ≤ p. Let S ′ be a new set of samples in the variables {y1 , y2 , . . . , yp } obtained from the initial set of samples S using the correspondence described above. Obviously S ′ has size m. Let AbDNF (S, k, n) be the following algorithm, 1

Algorithm AbDNF (S, k, n)

2

Let {T1 , . . . , Tp } be all the terms containing at most k literals from {z1 , z2 , . . . , zn }

3

Let {y1 , . . . , yp } be new variables, yi maps Ti

4

Compute S ′ the set samples in variables {y1 , . . . , yp }, corresponding to S

5

F ← Anmd (S ′ , p)

6

In F, for all 1 ≤ i ≤ p, replace the term Ti for the literal yi , and the disjunction of the negated literals from Ti for the literal yi

7

8

{the algorithm from the Figure 3.3}

return F Figure 3.4: The Algorithm AbDNF (S, k, n)

shown in the Figure 3.4, that constructs the set S ′ as described above, and then runs the Algorithm Anmd (S ′ , p) from the Lemma 27 for efficiently learning non-monotone disjunctions in the Occam model in time O (p · m) ≤ nO(k) · O (m). If the Algorithm Anmd (S ′ , p) returns F, then AbDNF (S, k, n) replaces the positive variables of F with the corresponding terms {T1 , T2 , . . . , Tp } , and the negative variables are replaced with the disjunction of the negated literals that appear in the corresponding term. Finally, AbDNF (S, k, n) returns the k-DNF formula F that results ( or Λ if Anmd (S ′ , p) found no consistent concept). Claim 29 (Ω-Consistency) If the Algorithm AbDNF (S, k, n) from the Figure 3.4 returns a formula F that is not Λ, then F is a k-DNF formula consistent with the samples S. Proof.

The proof follows from the consistency of the Algorithm Anmd (S ′ , p), proved in the

Lemma 27. ✷

C HAPTER 3. L EARNING M ODELS AND A LGORITHMS

45

Claim 30 (Ω-Completeness) If there exists a k-DNF formula consistent with the sample set S, then the Algorithm AbDNF (S, k, n) from the Figure 3.4 will find and return a k-DNF formula F 6= Λ, such that F is consistent with the set of samples S. Proof.

The proof follows from completeness of the Algorithm Anmd (S ′ , p), proved in the

Lemma 27. ✷ Lemma 31 The concept class of k-DNF formulas is efficient Occam learnable, in time nO(k) · O(m), where m is the size of the sample set. Proof. The proof follows from the Claim 29 (“Ω-Consistency”), the Claim 30 (“ Ω-Completeness”), the observation that p = nO(k) , and the fact that the Algorithm Anmd (S ′ , p) runs in time O(p·m), proved in the Lemma 27. Hence AbDNF (S, k, n) is an efficient (O (k) , 0)-Occam learning algorithm. ✷ Corollary 32 The concept class of k-DNF formulas is PAC learnable. Proof.

By Lemma 31, the concept class of k-DNF formulas is efficient Occam learnable.

Using the Theorem 19, there exists an algorithm that efficiently learns this concept class in the PAC learning model. ✷ Notice that the Lemma 31 does not apply for general DNF formulas in which the size of the terms is not constant bounded, because the number of terms can be of the order of nO(n) and hence, the Algorithm Anmd from the Lemma 27 returns a hypothesis which is not Occam. However, we can use the Lemma 31 in order to give an Occam algorithm for learning DNF; even if it is not efficient, it is still sub-exponential.

3.6.4 General DNF formulas Definition 19 Let F be a DNF formula of size s, over n variables, and let w = term t of F is called large if t contains more than w literals.

√

2n log s. A

C HAPTER 3. L EARNING M ODELS AND A LGORITHMS

46

Fact 33 Let F = t1 ∨ t2 ∨ . . . ∨ tk be a DNF formula over n variables, and let z be the number of large terms of F. Consider the experiment in which a literal ℓ is chosen randomly. Then the expectation of the w . number of large terms of the formula F in which ℓ appears is at least z 2n

Proof. (of fact). Each large term has more than w literals. Hence, for each large term t we have that the literal ℓ is in t with probability Pr

ℓ∈{ℓ1 ,...,ℓ2n }

[ℓ ∈ t] ≥

w . 2n

(3.32)

Since there are z large terms, the expectation of the number of large terms in which the ranw domly chosen literal ℓ appears is at least z 2n .

✷

Corollary 34 Let F be a DNF formula over n variables. Then there is a literal that appears in a fraction at least

w 2n

of large terms.

Proof. (of corollary). By Fact 33, the expectation of the fraction number of large terms in which a randomly chosen literal appears is at least appears in a fraction at least

w 2n

of large terms.

w . 2n

Hence there must be a literal that ✷

The connection between the automatizability of resolution and PAC-learning DNF formulas is not that obvious. The automatizability algorithm for resolution, see Figure 2.2, is very simple. However, it very much relies on the combinatorial property from Lemma 4. The DNF learning is related more to this combinatorial property than to the resolution automatizability algorithm itself. Looking more carefully to the proof of Lemma 4, we can find its structural recursion over the two variables n, the number of unset literals, and z, the number of large clauses, in the DNF learning algorithm from the Figure 3.5.

Theorem 35 The concept class of DNF formulas over n variables can be Occam learned in time nO( set used.

√

n log s)

· O(m), where s is the size of the shortest DNF and m is the size of the sample

C HAPTER 3. L EARNING M ODELS AND A LGORITHMS Proof.

47

Let C be the concept class of DNF formulas over n variables and let S be a set of

samples of size m. First consider the Algorithm A′DNF (S, w, n, n′ , z, τ ), presented in the Figure 3.5. The intuition here is that we are looking for a DNF formula with at most z large clauses. τ is a partial assignment over at most n variables, and n′ = |τ |∗ is the number of unset variables from τ . The algorithm searches for a DNF formula of size at most z and over at most n′ variables, that is consistent with the samples S⌈τ . If no such DNF formula is found, the algorithm returns Λ. This algorithm is recursive on n′ and z. The initial cases are when n′ = 0 or z < 1. If n′ = 0 then let β be the unique example from S⌈τ . If S⌈τ is empty then return an arbitrary DNF formula between 0 or 1. Otherwise, if β = +, then return 1, and if β = −, then return 0. If z < 1 then return the DNF formula generated by the Algorithm AbDNF (S⌈τ , w, n′ ). The recursion step is as follows: for all the indexes i of the unset variables from τ , in sequential order, for both b ∈ {0, 1}, set the variable xi to b, and then recursively compute i,b w , τ , until for some index i, a valid DNF formula Fib ← A′DNF S, w, n, n′ − 1, z 1 − 2n ′ Fib 6= Λ is computed. Afterward set the variable xi to b, and recursively compute again Fib ← A′DNF S, w, n, n′ − 1, z, τ i,b ; when this returns, exit the loop.

If Fib and Fib are not both valid DNF formulas, then return Λ. Otherwise, return F, the DNF b b b b representation of the formula xi ∧ Fi ∨ xi ∧ Fi . F is computed as follows: take the

OR between the disjunction of the terms of Fib in which it is inserted the literal xbi , and the disjunction of the terms of Fib in which it is inserted the literal xbi . In the next three claims we formally prove the consistency and completeness, and we estimate the run-time of the algorithm described above.

Claim 36 (Consistency) If the formula F returned by the Algorithm A′DNF (S, w, n, n′ , z, τ ) from the Figure 3.5 is not Λ, then F is a DNF formula consistent with the samples S⌈τ . Proof. (of the “Consistency” claim). We proceed the proof by induction on n′ and z. Base Case. Assume n′ = 0. If S⌈τ is empty (there are no examples), then obviously any

C HAPTER 3. L EARNING M ODELS AND A LGORITHMS

1

2

Algorithm A′DNF (S, w, n, n′ , z, τ ) if n′ = 0 then

3

if S⌈τ = ∅ then return 0 or 1 arbitrarily

4

let hα, βi be the (unique) example from S⌈τ

5

if β = + then return 1

6

else return 0

7

48

if z < 1 then

8

F ← AbDNF (S⌈τ , w, n′ )

9

rename in F all variables {z1 , z2 , . . . , zn′ } with the unset variables from τ .

10

return F

11

for i = j1 , j2 , . . . , jn′ , where τj1 , τj2 , . . . , τjn′ = ∗, do

12

b←0

13

Fib ← A′DNF S, w, n, n′ − 1, z 1 −

14

15

16

18

Fib ← A′DNF S, w, n, n′ − 1, z 1 −

21

22

23

24

, τ i,b

{set variable xi to b = 0} {set variable xi to b = 1}

exit the for loop

b←1

20

w 2n′

if Fib 6= Λ then Fib ← A′DNF S, w, n, n′ − 1, z, τ i,b

17

19

{the algorithm from the Figure 3.4}

w 2n′

if Fib 6= Λ then Fib ← A′DNF S, w, n, n′ − 1, z, τ i,b

, τ i,b

{set variable xi to b = 1} {set variable xi to b = 0}

exit the for loop

if Fib = Λ or Fib = Λ then return Λ

let F be the proper DNF form of the formula xbi ∧ Fib ∨ xbi ∧ Fib return F

Figure 3.5: The Algorithm A′DNF (S, w, z, n, τ )

C HAPTER 3. L EARNING M ODELS AND A LGORITHMS

49

formula will be consistent with S⌈τ . Otherwise S⌈τ contains a unique example1 labeled with β. It follows immediately from the algorithm that the return value is consistent with S⌈τ . If z < 1 the proof follows by Claim 29 (”Ω-Consistency”). Induction Case. Let n′ be at least 1. Let w and S be fixed. We assume that for any z ≥ 0 and for any partial assignment τ over n variables, if A′DNF (S, w, n, n′ , z, τ ) returns a valid DNF formula, then it is consistent with S⌈τ . We prove that for any z ≥ 0 and for any partial assignment τ over n variables, if A′DNF (S, w, n, n′ + 1, z, τ ) returns a valid DNF formula F, then F is consistent with S⌈τ . Let any z ≥ 0 and let τ be any partial assignment over n variables. Let F be a valid DNF formula returned by A′DNF (S, w, n, n′ + 1, z, τ ). If z < 1, then the proof follows by the Base Case above. Hence assume z ≥ 1. By the line 23, there exists an unset variable xi from τ , there exists i,b w ,τ b ∈ {0, 1}, and there exist valid DNF formulas Fib ← A′DNF S, w, n, n′ − 1, z 1 − 2n ′ and Fib ← A′DNF S, w, n, n′ − 1, z, τ i,b such that F is the DNF form of the formula xbi ∧ Fib ∨ xbi ∧ Fib . By the induction hypothesis we have that Fib is a DNF formula consistent with the samples

S⌈τ i,b , and that Fib is a DNF formula consistent with the samples S⌈τ i,b . We have two cases, depending on the value of xi : • if xi = b, then xbi = 1 and xbi = 0. Hence xbi ∧ Fib ∨ xbi ∧ Fib = Fib and it is consistent with S⌈τ i,b ;

• if xi = b, then xbi = 0 and xbi = 1. Hence, xbi ∧ Fib ∨ xbi ∧ Fib = Fib and it is consistent with S⌈τ i,b ;

Since S⌈τ = S⌈τ i,b ∪S⌈τ i,b it follows that xbi ∧ Fib ∨ xbi ∧ Fib is consistent with S⌈τ . F is the DNF form of xbi ∧ Fib ∨ xbi ∧ Fib , hence it is consistent with S⌈τ . ✷(of the “Consistency” claim)

Claim 37 (Completeness) IF there exists a DNF formula FTarget , such that 1

We assume that the sample set is consistent with itself and does not contain duplicates.

C HAPTER 3. L EARNING M ODELS AND A LGORITHMS

50

• FTarget has at most z large terms , • FTarget is over at most the variables not set by the partial assignment τ , • FTarget is consistent with the samples S⌈τ , THEN the Algorithm A′DNF (S, w, n, n′ , z, τ ) from the Figure 3.5 will find a valid DNF formula over at most the n′ variables unset by τ . Proof. (of the “Completeness” claim). By induction on n′ . Base Case. If z < 1, by Claim 30 (”Ω-Completeness”), the Algorithm AbDNF (S⌈τ , w, n′ ) from the Figure 3.4 computes valid DNF formula. Assume n′ = 0. If S⌈τ is empty, then any of 0 or 1 would be a valid DNF formula. Otherwise, S contains a unique example labeled with β. It follows immediately from the algorithm that the return value is a valid DNF formula. Induction Case. Let n′ be at least 1. By Corollary 34 it follows that there is a literal that occurs in at least a fraction

w 2n′

of large terms in FTarget . Hence, there is a literal such that when it is

restricted to zero, at least a fraction

w 2n′

of large terms from FTarget are nullified. Therefore, ′

there exists some index i′ of an unset variable from τ , and b′ ∈ {0, 1} such that setting xbi′ to zero in FTarget nullifies at least a fraction

w 2n′

of large terms. Let us define the DNF formula

′

Fib′ ,Target = FTarget ⌈xb′′ ←0 . So, i

′

• Fib′ ,Target has z 1 − ′

w 2n′

large terms, which is at most z since 1 − ′

′

• Fib′ ,Target is over at most the unset variables from τ i ,b ,

w 2n′

is less than 1,

′

• Fib′ ,Target is consistent with the samples S⌈τ i′ ,b′ . Therefore, by the induction hypothesis, (at least) for these values of i′ and b′ , i′ ,b′ ′ w is a valid DNF formula. Fib′ ← A′DNF S, w, n, n′ − 1, z 1 − 2n ,τ ′

Hence, the algorithm will eventually find some unset variable xi from τ , and b ∈ {0, 1}, such i,b w ,τ is a valid DNF formula. However, these are that Fib ← A′DNF S, w, n, n′ − 1, z 1 − 2n ′

not necessarily the variable xi and b from above.

C HAPTER 3. L EARNING M ODELS AND A LGORITHMS

51

By the induction hypothesis, Fib ← A′DNF S, w, n, n′ − 1, z, τ i,b is also a valid DNF formula.

In the end, since both Fib and Fib are valid DNF formulas, A′DNF (S, w, n, n′ − 1, z, τ ) will return the DNF form of xbi ∧ Fib ∨ xbi ∧ Fib , which is itself a valid DNF formula. And the proof

follows.

✷(of the “Completeness” claim) Claim 38 (Run-Time) The Algorithm A′DNF (S, w, n, n′ , z, τ ) from the Figure 3.5 runs in time n′ log z ′ O w+ w

(n )

· O(m) + O (m · n) ,

(3.33)

where m is the size of the set of samples S. Proof. (of the “Run-Time” claim). Let Tm,w,n (z, n′ ) be the run-time of the algorithm. If z < 1, then Tm,w,n (z, n′ ) is O(m · n) plus the run-time of the Algorithm AbDNF (S⌈τ , w, n′ ) (see line 7), and by Lemma 31, Tm,w,n (z, n′ ) = (n′ )

O(w)

· O(m) + O (m · n) .

(3.34)

If n′ = 0, then Tm,w,n (z, n′ ) = O(m · n).

(3.35)

Otherwise, w Tm,w,n (z, n′ ) ≤ n + 2n′ · Tm,w,n z 1 − ′ , n′ − 1 + Tm,w,n (z, n′ − 1) . 2n

(3.36)

We apply the recursive formula again for Tm,w,n (z, n′ − 1), and we get w ′ Tm,w,n (z, n ) ≤ n + 2n · Tm,w,n z 1 − ′ , n − 1 + 2n w ′ ′ 2n · Tm,w,n z 1 − ′ , n − 1 + Tm,w,n (z, n′ − 2) , 2n ′

′

(3.37)

which gives h i w Tm,w,n (z, n′ ) ≤ n + 2n′ 2 · Tm,w,n z 1 − ′ , n′ − 1 + Tm,w,n (z, n′ − 2) . 2n

(3.38)

C HAPTER 3. L EARNING M ODELS AND A LGORITHMS

52

After we develop Tm,w (z, n′ − 2) , Tm,w (z, n′ − 3) , . . . , Tm,w (z, 1) recursively and regroup the terms in the right hand side of the inequality as above, we obtain h i w Tm,w,n (z, n′ ) ≤ n + 2n′ n′ · Tm,w,n z 1 − ′ , n′ − 1 + Tm,w,n (z, 0) . 2n

(3.39)

Using the initial condition from the Equation 3.35, we have w 2 Tm,w,n (z, n′ ) ≤ 2 (n′ ) · Tm,w,n z 1 − ′ , n′ − 1 + O(m · n). 2n

(3.40)

Applying recursion k times, we obtain, w k ′ (3.41) Tm,w,n (z, n ) ≤ 2 (n ) · Tm,w,n z 1 − ′ , n − k + k · O(m · n). 2n ′ z w k For the smallest k such that z 1 − 2n < 1, i.e., k = O( n log ), using the Equation 3.34, we ′ w ′ 2k

′

have

′

′ O

Tm,w,n (z, n ) ≤ 2 (n )

n′ log z w

h

′ O(w)

· (n )

i

· O(m) + O (m · n) + O

n′ log z w

· O(m · n), (3.42)

which gives ′

n′ log z ′ O w+ w

Tm,w,n (z, n ) ≤ (n )

· O (m) + O (m · n)

(3.43)

✷(of the “Run-Time” claim) And finally, let the Algorithm ADNF (S, n) that finds a DNF formula over at most n variables consistent with the set of samples S, as presented in the Figure 3.6. The algorithm searches for a DNF formula consistent with S assuming that there is one that has size 1, 2, 3, . . . , m · n. If ADNF (S, n) returns a valid DNF formula F 6= Λ, then F was returned by A′DNF (S, w, n, n, s, τ ) in the line 6. Hence, by Claim 36 (“Consistency”), F is consistent with S⌈τ = S. If there is a DNF formula over at most n variables consistent with S, then the shortest one has size at most m · n. Hence, if s > m · n then there is no DNF formula over at most n variables consistent with the sample set S, and the algorithm returns Λ. Assume that, for some value of s, there exists a DNF formula FTarget of size s, consistent with S, and over at most n variables. Then, since the number of large terms of FTarget is at most s,

C HAPTER 3. L EARNING M ODELS AND A LGORITHMS

1

2

3

4

Algorithm ADNF (S, n) m ← |S| for s = 1, 2, 3, . . . , m · n do √ w ← 2n log s

5

τ ← (∗, ∗, ∗, . . . , ∗)

6

F ← A′DNF (S, w, n, n, s, τ )

7

if F 6= Λ then

8

return F

9

53

{set width of large terms} {all variables are unset in the partial assignment} {the algorithm from the Figure 3.5}

{“no consistent concept”}

return Λ

Figure 3.6: The Algorithm ADNF (S, n) Occam learns the sample set S by Claim 37 (”Completeness”), it follows that A′DNF (S, w, n, n, s, τ ) will return a valid DNF formula, consistent with S⌈τ = S and over at most n variables. Moreover, since the number of large terms is at most s, by Claim 38 (“Run-Time”), it follows that A′DNF (S, w, n, n, s, τ ) will run in time at most n

√ O w+ nwlog s

· O(m).

(3.44)

Hence the Algorithm ADNF (S, n) will return a DNF formula over at most n variables consistent with S, or Λ if no such DNF formula exists, in time at most s·n

√ O w+ nwlog s

· O(m) ≤ m · n · n

which is minimum for w =

√

√ O w+ nwlog s

· O(m) = n

√ O w+ nwlog s

· O(m),

(3.45)

n log s. So, the returned DNF formula has size at most nO (

√

n log s)

· O(m),

(3.46)

and this concludes the proof. ✷(of the Theorem 35)

C HAPTER 3. L EARNING M ODELS AND A LGORITHMS

54

Corollary 39 The concept class of DNF formulas of size at most s can be PAC learned in time nO (

√

n log s)

· O(m), where m is the size of the sample set and s is the size of the shortest DNF

consistent with the samples. Proof.

By Theorem 35, the Algorithm ADNF (S, n) is an Occam learning algorithm for the

concept class of DNF formulas of size at most s that runs in time nO(

√

n log s)

· O(m). By

Theorem 19 there is a PAC learning algorithm for the same concept class that runs in time polynomial in the running time of the corresponding Occam learning algorithm. Hence the corollary follows. ✷

3.7

Learning degree d polynomials

Consider the concept class C of degree d polynomials and let p (~x) be a degree d polynomial from C. Given S, a size m set of samples of q (~x), the problem is to find a polynomial in C which is consistent with S. The learning algorithm presented bellow is related to the polynomial calculus automatizability algorithm from section 2.7.3. Theorem 40 The concept class of degree d polynomials is efficiently Occam learnable. Proof. Let P be the set of all degree d terms in variables x1 , . . . , xn . P contains

Pi=d i=0

n i

terms. The polynomial h (~x) that is consistent with all the samples from S is a linear combination of terms from P . Assume that h (~x) = α1 t1 (~x) + . . . + αk tk (~x) , where k ≤

Pi=d i=0

n i

(3.47)

.

Let S = {h~x1 , +i , h~x2 , +i , . . . , h~xp , +i , h~xp+1 , −i , . . . , h~xm , −i}, where p is the number of positive samples in S, p ≤ m. A sample ~xi is positive if the polynomial q (~x) evaluated in ~xi is zero, and negative otherwise.

C HAPTER 3. L EARNING M ODELS AND A LGORITHMS

55

Let βp+1 , . . . , βm be m − p variables such that βp+1 6= 0, . . . , βm 6= 0. The coefficients α1 , . . . , αk are the solutions of the following system of m equations in m + k − p unknown variables.

α1 t1 ~x1 + . . . + αk tk ~x1 = 0 α1 t1 ~x2 + . . . + αk tk ~x2 = 0

(3.48)

...

α1 t1 (~xp ) + . . . + αk tk (~xp ) = 0 α1 t1 ~xp+1 + . . . + αk tk ~xp+1 = βp+1 ...

α1 t1 (~xm ) + . . . + αk tk (~xm ) = βm We are looking for a solution (α1 , . . . , αk , βp+1 , . . . , βm ) of this system, such that no β variable is zero:

βp+1 · βp+2 · . . . · βm 6= 0

(3.49)

The system 3.48 can be solved by the standard Gaussian elimination method. If the system has no solution, then the set of samples S are inconsistent. If the solution is unique, then check the condition 3.49, and if it is satisfied, then h (x) can be determined by the correspondent values of α1 , . . . , αk . If the solution is not unique, then the number of independent equations is greater than k. The system should be solved as follows: first eliminate by the Gaussian method the variables α1 , . . . , αk , and then the variables βp+1 , . . . , βm−i . Let the variables βm−i+1 , . . . , βm that can be chosen arbitrarily, where 1 ≤ i ≤ m − p − 1. Choose them such that no variable of the βp+1 , . . . , βm is zero. This can be simply done by using the equations p + 1, . . . , m − i, of the final system. If no such choice is possible, then the set of samples is inconsistent.

C HAPTER 3. L EARNING M ODELS AND A LGORITHMS

56

In both cases, weather the solution of the system is unique or not, the system can be solved in time O (max {k, m})3 = O max n3d , m3 .

For nd ≥ m we obtain a solution (or the inconsistency conclusion) in time O n3d .

(3.50) ✷

Corollary 41 The concept class of degree d polynomials is efficiently PAC learnable. Proof. The above theorem finds an efficient Occam learning algorithm for the class of degree d polynomials. By the theorem 19 there is a PAC learning algorithm for the same concept class that runs in time polynomial in the running time of the corresponding Occam learning algorithm.

✷

Chapter 4 Resolution is not automatizable, unless ǫ n SAT ⊆ DT IM E(2 )

4.1

Introduction

The minimum hitting set problem takes as input a collection C of subsets of a finite set S and outputs a subset S ′ of S of minimal size, such that S ′ contains at least one element from each subset in C. The minimum hitting set problem and minimum set cover problem are equivalent. Therefore, approximation algorithms and nonapproximability results for minimum hitting set will carry over to minimum set cover, and vice-versa. Recently it has been shown that a constant factor approximation algorithm for the minimum set cover problem in polynomial time ǫ

implies N T IM E(t(n)) ⊆ DT IM E(2t(n) ), for some ǫ < 1, ([AKP04]). An obvious question that arises from these results is how hard is to at least approximate the minimum set cover problem (minimum hitting set). The parameterized version of the minimum hitting set problem takes an additional parameter k, and it outputs a subset S ′ of S of size k, such that S ′ contains at least one element from each subset in C, if such a subset S ′ exists. We know that the parameterized version of the minimum hitting set problem is complete for the class W [2] of parameterized hierarchy, ([DF98]). 57

ǫ

C HAPTER 4. R ESOLUTION IS NOT AUTOMATIZABLE , UNLESS SAT ⊆ DT IM E(2n )

58

In this section, using the technique from [AR01], we show that if we can automatize resolution, then we can approximate the minimum hitting set problem up to a constant factor; in light of the ǫ

above mentioned result, it implies that N T IM E(t(n)) ⊆ DT IM E(2t(n) ), for some ǫ < 1, and this is believed to be unlikely.

4.2

Fixed parameterized problems and hardness

Parameterized complexity theory provides a frame-work for a fine-grain complexity analysis of algorithmic problems that are intractable in general. Central to the theory is the notion of fixed-parameter tractability, which relaxes the classical notion of tractability, i.e., polynomial time computability, by admitting algorithms whose running time is exponential, but only in term of some parameter of the problem instance that can be expected to be small in the typical applications. A parameterized problem is a set P ⊆ Σ∗ × N, where Σ is a finite alphabet. If hx, ki ∈ Σ∗ ×N is an instance of a parameterized problem, we refer to x as the input and to k as the parameter. A parameterized problem is fixed-parameter tractable if there is a computable function f and a constant c such that the problem can be solved in time f (k) · nc , where n is the input size and k is the parameter value. Definition 20 The class FPT (fixed parameter tractable) of parameterized problems consists of all languages L ⊆ Σ∗ ×N for which there exists an algorithm Φ, a constant c and a recursive function f : N −→ N such that: 1. the running time of Φ (hx, ki) is at most f (k) · |x|c ; 2. hx, ki ∈ L if and only if Φ (hx, ki) = 1. The class FPR is the same as FPT, except that if hx, ki ∈ L, we only need that Φ(hx, ki) = 1 with probability at least half. The W-hierarchy is the inclusion FPT ⊆ W [1] ⊆ W [2] ⊆ . . . ⊆ W [P],

(4.1)

ǫ

C HAPTER 4. R ESOLUTION IS NOT AUTOMATIZABLE , UNLESS SAT ⊆ DT IM E(2n )

59

where P is any polynomial. In this hierarchy, it is believed to be very unlikely that W [1] is fixed parameter tractable (or, the first inclusion is believed to be strict).

4.3

Reduction from minimum hitting set to automatizability of resolution

In this section we formally define the minimum hitting set problem and we give a reduction from the approximability of the minimum hitting set problem to the automatizability of resolution.

Definition 21 (Minimum hitting set) The minimum hitting set problem is an optimization problem which takes as input a collection of q subsets of some set S, U = {S1 , S2 , . . . , Sq }. The output is a hitting set for U of minimal cardinality, i.e., a set S ′ ⊆ S s.t. |S ′ | is minimal and S ′ ∩ Sj 6= ∅, for all j ∈ [q]. We denote by γ(U) the optimal size of the hitting set of U. Note that an instance of the minimum hitting set problem is completely determined by only the collection U; S can be obtained by the union of all the subsets from U. Definition 22 (Minimum hitting set approximability) Let U be an instance of the minimum hitting set problem, and let h be a real number, at least equal to 1. Let S ′ be a subset of S that is a hitting set for U, and let |S ′ | = k. We say that S ′ is an h-approximation of the minimum hitting set problem if and only if k ≤ h · γ(U). Notation 3 Let ~a ∈ {0, 1}n and C be a monotone circuit. We denote by k(~a) the weight of ~a, and by k(C) the optimal (minimal) value of a satisfying truth assignment for C. Definition 23 We say that an onto function g : {0, 1}k → D is r-surjective if for any restriction ρ with |ρ| ≤ r the function g⌈ρ is still onto.

ǫ

C HAPTER 4. R ESOLUTION IS NOT AUTOMATIZABLE , UNLESS SAT ⊆ DT IM E(2n )

60

In the following we show that if resolution is automatizable, then we can approximate the minimum hitting set problem up to a constant factor. So, from now on, we fix an instance of the minimum hitting set problem, let this be U = {S1 , . . . , Sq }, and let S = {e1 , . . . , en } be the set of all elements in U. We will also fix CU , the monotone circuit of fan-in two that takes p1 , . . . , pn as input variables, and computes the boolean function   ^ _  pi  j∈[q]

(4.2)

ei ∈Sj

The size of CU is O(n). Let d > 0 such that |CU | = dn.

For any hitting set S ′ of U we put in correspondence a truth assignment τ for CU as follows. τ is the characteristic vector of S ′ : τi = 1 if and only if ei ∈ S ′ . And vice-versa, a satisfying assignment for CU corresponds to a hitting set of U. Moreover, a minimum satisfying assignment for CU corresponds to a minimum hitting set of U, and hence γ(U) = k(CU ). The following definition introduces a CNF formula that we will use in the automatizability algorithm for resolution. Definition 24 Let h and k be two parameters, h ∈ R, k ∈ N. Let m be defined as follows:  √   2hk if k ≥ log dn, def (4.3) m =   2h· logkdn , otherwise

Let P be the (m × m) 0-1 Paley matrix given by aij = 1 if and only if j 6= i and (j − i) is a quadratic residue mod m. We will use a property of the Paley graphs, result of [Alo95], which says that any node in P has

log m 4

neighbors and

log m 4

anti-neighbors. In the following

we will denote by A ⊆ {0, 1}m all columns of P . Let F1 , F2 , . . . , Fn : {0, 1}h log m → A, f1 , f2 , . . . , fm : {0, 1}h log m → [log m] be (log m)-surjective functions. We introduce the following new variables • xνj , for every j ∈ [n] and ν ∈ [log m]; • yiν , for every i ∈ [m], and ν ∈ [log m];

ǫ

C HAPTER 4. R ESOLUTION IS NOT AUTOMATIZABLE , UNLESS SAT ⊆ DT IM E(2n )

61

c • ziv , for every i ∈ [m], every “control” c ∈ [log m] and every gate v of the circuit CU .

For ease of description, let us introduce the predicates • [Columnj = ~a] denote Fj

x1j , x2j , . . . , xhj log m

= ~a, for j ∈ [n] and ~a ∈ A;

• [Controli = c] denote fi yi1 , yi2 , . . . , yih log m = c, for i ∈ [m] and c ∈ [log m]. Let the CNF τ (U, h, k, F~ , f~) which consists of all the clauses that result from the expansion of the following Boolean predicates as CNFs:

  

c ([Columnj = ~a] ∧ [Controli = c]) ⊃ zi,p j

  for all ~a ∈ A, i ∈ [m] such that ai = 1 and all j ∈ [n] , c ∈ [log m] ;   c c c  ⊃ ziv [Control = c] ∧ z ∗ z ′ ′′  i i,v i,v   for all i ∈ [m] , c ∈ [log m] and all internal gates v      corresponding to the instruction v ← v ′ ∗ v ′′ , ∗ ∈ {∧, ∨};    [Controli = c] ⊃ z ci,vfin ,

(4.4)

(4.5)

(4.6)

  where vfin is the output gate of CU .

Informally, every assignment to x-variables encodes a matrix M . The axioms (4.4) and (4.5) ci is greater or equal than the result of computation of the gate v on the inductively state that ziv

i-th row of the matrix M . Thus, τ (U, h, k, F~ , f~) is unsatisfiable (for arbitrary surjective F~ , f~) if and only if there exists a row i ∈ [m] such that1 CU (ai1 , ai2 , . . . , ain ) = 1. In the following lemma we analyze the upper and lower bounds for a resolution refutation of the CNF formula defined above. Lemma 42 In the context of Definition 24, the following bounds hold. 1. if γ(U) ≤ 1

log m 4

then ST τ (U, h, k, F~ , f~) ≤ O (n) · exp (O (γ(U)2 ));

The order of the vectors a’s from M does not matter. It is enough that an ordering exists such that this relation holds.

ǫ

C HAPTER 4. R ESOLUTION IS NOT AUTOMATIZABLE , UNLESS SAT ⊆ DT IM E(2n ) 2. w τ (U, h, k, F~ , f~) ⊢ ∅ ≥

log m 2

~ ~ 3. S τ (U, h, k, F , f ) ≥ exp Ω

Proof.

62

· min γ(U), log4m ;

log m h

. · min γ(U), log4m

1. Let S ′ ⊆ S be a minimum hitting set of U. We can assume without loss of generality that S ′ = {e1 , . . . , eγ(U ) }. Choose any γ(U) vectors from A: 

 ai,1 def  .. ~a1 =   .  am,1





 ai,γ(U )    ..  , . . . , ~aγ(U ) def =  .     am,γ(U )



  .  

(4.7)

Since P is a Paley graph of size m, and M is the Paley Matrix associated with P , there exists a row i ∈ [m] such that ai,1 = ai,2 = . . . = ai,γ(U ) = 1. Keep this i fixed now. In the following we show how to infer from τ (U, h, k, F~ , f~) all clauses in the CNF expansion of [Column1 6= ~a1 ] ∨ . . . ∨ Columnγ(U ) 6= ~aγ(U ) ∨ [Controli 6= c] ,

(4.8)

for all c ∈ [log m], using resolution rules, such that the final proof is a tree-like resolution proof. To do so, let V be the set of gates of CU , that are evaluated to 1 by the assignment 1γ(U ) , 0n−γ(U ) that corresponds to S ′ . Fix an ordering on V which is consistent with

the topology of CU , such that all wires between gates in the enumeration go from left to right: V = hv1 = p1 , v2 = p2 , . . . , vγ(U ) = pγ(U ) , vγ(U )+1 , vγ(U )+2 , . . . , vt = vfin i. By induction on µ = t, t − 1, . . . , γ(U) we infer all clauses in the CNF expansion of [Controli = c] ⊃ z ci,v1 ∨ z ci,v2 ∨ . . . ∨ z ci,vµ , for c ∈ [log m].

(4.9)

ǫ

C HAPTER 4. R ESOLUTION IS NOT AUTOMATIZABLE , UNLESS SAT ⊆ DT IM E(2n )

63

For µ = t the formula (4.9) is a weakening of (4.6), for c ∈ [log m]. For the inductive step, assume we can infer from τ (U, h, k, F~ , f~) all clauses in (4.9) for µ and we prove that we can also infer the clauses from (4.9) for µ − 1.

Let vµ′ and vµ′′ be the gates

whose outputs are the inputs of vµ . By the order we have imposed on the gates from V , both indexes µ′ and µ′′ are at most µ − 1. We will consider two cases: when the gate vµ is ∨ and when vµ is ∧. (a) Assume vµ is ∨. Hence, for controls c ∈ [log m] the initial formulas from (4.5) are

c c c [Controli = c] ∧ zi,vµ′ ∨ zi,vµ′′ ⊃ ziv µ

which can be decomposed into two formulas: c c [Controli = c] ∧ zi,v ⊃ ziv µ µ′ c c [Controli = c] ∧ zi,v ⊃ ziv µ µ′′

(4.10)

(4.11) (4.12)

Since vµ is in V and vµ is ∨, it follows that at least one of vµ′ and vµ′′ are in V as well. Hence, we resolve the formula from (4.9), obtained by the induction hypothesis, with (4.11) or with (4.12), depending on which of vµ′ and vµ′′ are in V , or with any formula if both gates are in V . After these refutations we will have c and obtained gotten rid of zi,v µ

for c ∈ [log m].

c c c [Controli = c] ⊃ z i,v1 ∨ z i,v2 ∨ . . . ∨ z i,vµ−1 ,

(4.13)

(b) Assume vµ is ∧. Hence, for c ∈ [log m], the initial formulas from (4.5) are c c c ∧ z [Controli = c] ∧ zi,v ⊃ ziv i,vµ′′ µ µ′

(4.14)

Since vµ is in V and vµ is an ∧ gate, it follows that both vµ′ and vµ′′ are in V as well. Hence we resolve the formula from (4.9), obtained by the induction hypothesis, c . Finally we will have obtained with (4.14) and get rid of zi,v µ [Controli = c] ⊃ z ci,v1 ∨ z ci,v2 ∨ . . . ∨ z ci,vµ−1 ,

(4.15)

ǫ

C HAPTER 4. R ESOLUTION IS NOT AUTOMATIZABLE , UNLESS SAT ⊆ DT IM E(2n )

64

for c ∈ [log m]. Hence, for µ = γ(U) we will have derived [Controli = c] ⊃

z ci,p1

∨

z ci,p2

∨ ... ∨

z ci,pγ(U )

.

(4.16)

Now, for µ = γ(U), γ(U) − 1, . . . , 1 we consecutively resolve (4.16) with the corresponding equation c , ([Columnµ = ~aµ ] ∧ [Controli = c]) ⊃ zi,p µ

(4.17)

c from (4.4), for c ∈ [log m]. This will remove one by one zi,p from the right hand side µ

of (4.16), while adding [Columnµ = ~aµ ] in the left hand side, for µ = γ(U), γ(U) − 1, . . . , 1. In the end we will have obtained [Column1 6= ~a1 ] ∨ . . . ∨ Columnγ(U ) 6= ~aγ(U ) ∨ [Controli 6= c] ,

(4.18)

for c ∈ [log m]. Notice that all this resolution inference is tree-like and has size O (n). Finally, for every i ∈ [m], every clause in the variables {yiν |ν ∈ [h log m]} appears in one of the CNFs resulting from the predicates {[Controli 6= c] |c ∈ [log m]} and every clause in the variables {xνj |ν ∈ [h log m]} appears in one of [Columnj 6= ~aj ]. This gives us a tree-like refutation of the set of clauses (4.18) that has size O 2(h log m)(γ(U )+1) . Com-

bining this refutation with previously constructed inferences of (4.18) from τ (U, h, k, F~ , f~), we get the desired upper bound.

2. Let Rowi be the set of axioms in τ (U, h, k, F~ , f~) that corresponds to the row i. For a clause D, define the measure µ such that µ (D) is the smallest cardinality of I ⊆ [m] such that {Rowi |i ∈ I} semantically implies D. We show that µ is sub-additive. Assume we obtain D from D1 and D2 by one resolution step. Let I1 be the set of cardinality µ (D1 ) and I2 be the set of cardinality µ (D2 ) from the definition of µ for clauses D1 and D2 . Hence {Rowi |i ∈ I1 } ∪ {Rowi |i ∈ I2 } ⊃ D1 ∧ D2

(4.19)

ǫ

C HAPTER 4. R ESOLUTION IS NOT AUTOMATIZABLE , UNLESS SAT ⊆ DT IM E(2n )

65

By the soundness property of resolution, D1 ∧ D2 ⊃ D. Hence {Rowi |i ∈ I1 } ∪ {Rowi |i ∈ I2 } ⊃ D,

(4.20)

and I1 ∪ I2 is a witness that µ (D) ≤ µ (D1 ) + µ (D2 ). Directly from the definition of µ, it follows immediately that if A is an axiom in τ (U, h, k, F~ , f~), then its measure is 1. Moreover, the empty set has the measure at least

log m , 4

i.e., for any up to

log m 4

“Row’s”

in τ (U, h, k, F~ , f~), there exists a satisfying assignment for all of them. Let I ⊆ [m] with |I| ≤

log m . 4

We construct an assignment that satisfies all axioms in {Rowi |i ∈ I}.

Since P is a Paley graph, there exists a vector ~v ∈ A such that vi = 0 for all i ∈ I. By the surjectivity of Fj , for each j ∈ [n], there exists αj1 , αj2 , . . . , αjh log m such that Fj αj1 , αj2 , . . . , αjh log m = ~v , for every j ∈ [n]. We assign • every xνj to αjν , for all j ∈ [n] and ν ∈ [h log m]; c • every zi,v to zero, for all i ∈ [m], c ∈ [log m] and all gates v from CU ;

This assignment will satisfy all the axioms from {Rowi |i ∈ I} because • for any vector ~a from A, other than ~v , [Columnj = ~a] is unsatisfied, hence the axioms from the Equation (4.4) are satisfied for every ~a ∈ A, ~a 6= ~v , and for every i ∈ [m], j ∈ [n], c ∈ [log m]; for the vector ~a, since ai = 0 for all i ∈ I, no axiom from the Equation (4.4), for any i ∈ [m], c ∈ [log m], appears in {Rowi |i ∈ I}. • the axioms from the Equation (4.5) are trivially satisfied since all z variables are set to 0; • and finally, the axioms from the Equation (4.6) are also satisfied since all z variables are set to 0. Hence µ (∅) >

log m . 4

ǫ

C HAPTER 4. R ESOLUTION IS NOT AUTOMATIZABLE , UNLESS SAT ⊆ DT IM E(2n )

66

Thus, any resolution refutation of τ (U, h, k, F~ , f~) must contain a clause D0 with log m log m ≤ µ (D0 ) ≤ , 8 4

(4.21)

log m log m · min γ(U), w (D0 ) ≥ 2 4

(4.22)

log m log m ≤ |I0 | ≤ and {Rowi |i ∈ I0 } ⊃ D0 , 8 4

(4.23)

and we will show that

Fix I0 ⊆ [m] such that

and I0 is one of the minimal sets with this property. If for every i ∈ I0 the clause D0 contains at least log m variables among {yiν |ν ∈ c [h log m]} ∪ {ziv |c ∈ [log m] , v is a node}, we are done. Thus, suppose that this is

not the case for some i0 ∈ I0 . Fix an arbitrary assignment α ~ that satisfies all axioms in {Rowi |i ∈ I0 \ {i0 }} and falsifies D0 (such assignment exists due to the minimality of I0 ). Let J0 consist of those j ∈ [n], for which the clause D0 contains at least log m variables from {xνj |ν ∈ [h log m]}. If |J0 | ≥ γ(U), we are also done. If this is not the case, we will show how to construct another assignment β~ from α ~ , so that β~ will satisfy all axioms in {Rowi |i ∈ I0 } (including Rowi0 ) but still falsify D0 , and this will give us the contradiction. Using the fact that P is a Paley graph and |I0 | ≤

log m , 4

there exists a vector ~v from A such

that vi = 0, for all i ∈ I0 . The index of this vector in the matrix M corresponds to the node which is the anti-neighbor of all the nodes whose indexes are in I0 . We construct β~ as follows. Initially assign β~ ← α ~. (a) For every j 6∈ J0 (i.e., for each j such that D0 contains less than log m variables from {xνj |ν ∈ [h log m]}), using the fact that Fj is log m-surjective, we change in

ǫ

C HAPTER 4. R ESOLUTION IS NOT AUTOMATIZABLE , UNLESS SAT ⊆ DT IM E(2n )

67

β~ the values of the variables {xνj |ν ∈ [h log m]} not appearing in D0 , such that h log m 1 2 = ~v . F j xj , xj , . . . , xj

(b) Pick any control c0 ∈ [log m] such that no variable zic00,v appears in D0 . By the

choice of i0 , there exists a c0 ∈ [log m] such that no variable zic00,v appears in D0 , for any gate v from CU . Moreover, D0 contains less than h log m variables from {yiν0 |ν ∈ [h log m]}, and, using the fact that fi0 is log m-surjective, we change in β~ the values of the variables {yiν0 |ν ∈ [h log m]} that do not appear in D0 such that h log m 1 = c0 . f i0 y i0 , . . . , y i0

(c) Change in β~ the value for every variable zic00,v to the value computed by the gate v

on the characteristic vector of the set J0 , (i.e., the value computed by the gates of   1 if j ∈ J0 , for all j ∈ [n]). CU (p1 , p2 , . . . , pn ) when the inputs are pj =   0 if j 6∈ J0

It is easy to notice that β~ falsifies D0 , since α ~ falsifies D0 and we did not change the values corresponding to the variables that appear in D0 . We argue now that β~ satisfies all axioms from {Rowi |i ∈ I0 }. First of all, notice that we did not change in β~ the value of any variable {yiν |ν ∈ c [h log m] , i ∈ I0 \ {i0 }} or {zi,v | (i, c) ∈ {(I0 × [log m])} \ {(i0 , c0 )}, v a gate of CU }.

Hence, since α ~ satisfies {Rowi |i ∈ I0 \ {i0 }}, it follows that β~ will satisfy the axioms from Equations (4.5), and (4.6), for i ∈ I0 \ {i0 }, c ∈ [log m], and v a gate of CU . We next analyze the case i = i0 , where i0 ∈ I0 . If i = i0 , then fi yi1 , yi2 , . . . , yih log m = c0 , and hence [Controli = c] is true if and only if c = c0 .

~ and hence the axioms from Equations (a) if c 6= c0 , then [Controli = c] is false under β, ~ for any vector ~a ∈ A, such that ai = 1, (4.4), (4.5) and (4.6) are all satisfied by β, 0 for all j ∈ [n], and for all gates v, v ′ , v ′′ of CU , such that v ← v ′ ∗ v ′′ , where ∗ ∈ {∧, ∨} is an instruction of CU . ~ if and only if v evaluates to 1 on the input (b) if c = c0 , then zic00,v is set to 1 under β,

ǫ

C HAPTER 4. R ESOLUTION IS NOT AUTOMATIZABLE , UNLESS SAT ⊆ DT IM E(2n )

68

1J0 . But v evaluates to 1 if and only if v ′ ∗v ′′ is 1, this is equivalent with zic00,v′ ∗zic00,v′′ ~ for is 1. Hence, the axioms from Equation (4.5) are satisfied by the assignment β, all internal gates v, v ′ , v ′′ such that v = v ′ ∗ v ′′ , where ∗ ∈ {∧, ∨} is an instruction of CU . Since |J0 | < γ(U), it follows that vfin will evaluate to 0 when CU is given the input ~ and therefore the axioms from Equation (4.6) 1J0 , hence zic00,vfin is set to 0 under β, ~ are satisfied by β. We analyze in the following the axioms from Equation (4.4). i. if ~a = ~v , then since ~v was chosen such that vi = 0 for all i ∈ I0 , it follows that no axioms from Equation (4.4) appears in {Rowi |i ∈ I0 }. ii. if ~a 6= ~v , then A. if j ∈ J0 , then the value in β~ corresponding to zic00,pj is 1, and hence the ~ axioms from Equation (4.4) are satisfied by β. h log m 1 2 = ~v . Since ~a 6= ~v , it follows that B. if j 6∈ J0 , then Fj xj , xj , . . . , xj ~ and hence the axioms from the predicate [Columnj = ~a] is false under β, ~ Equation (4.4) are satisfied by β. This ends the argument that β~ satisfies all axioms in {Rowi |i ∈ I0 }. Since β~ falsifies D0 , this is a contradiction with the fact that {Rowi |i ∈ I0 } ⊃ D0 . Hence the proof of this part is finished. 3. We apply the standard argument with width-reducing restrictions. For doing this we observe that the CNFs of this form behave well with respect to certain restrictions. Namely, let p ≤ log m and R ⊆ [log m] be an arbitrary set of controls. Denote by Rp,R the set of all restrictions that arbitrarily assign to a Boolean value p variables in every one of the groups {xνj |ν ∈ [h log m]}, {yiν |ν ∈ [h log m]} with j ∈ [n] , i ∈ [m] c as well as all variables zi,v with c 6∈ R. Then, for ρ ∈ Rp,R every non-trivial clause in

τ (U, h, k, F~ , f~)⌈ρ , after a suitable re-enumeration of variables and controls, contains a

ǫ

C HAPTER 4. R ESOLUTION IS NOT AUTOMATIZABLE , UNLESS SAT ⊆ DT IM E(2n )

69

subclause from τ (U, h, k, F~ ⌈ρ , f~⌈ρ ⌈R ) (the function (fi ⌈ρ ) ⌈R is obtained from fi ⌈ρ

by restricting its domain to {yi |fi ⌈ρ (yi ) ∈ R} and range to R ).

Pick ρ uniformly and at random from R log m ,R , where R is picked at random from 2 log m log m 2 [log m] . Then Fj ⌈ρ , (fi ⌈ρ ) ⌈R will be -surjective. Therefore, by the proof of 2

the previous part, for every refutation P of τ (U, h, k, F~ , f~), ρ (P ) will contain a clause of width Ω log m · min γ(U), log4m with probability 1. It is easy to see that every clause of this width is killed by ρ with a probability at least . 1 − exp −Ω log m · min γ(U), log4m Therefore, the size of P must be at least exp Ω log m · min γ(U), log4m , since otherwise a random restriction ρ would have killed all such clauses with non-zero probability, which is impossible. ✷

We are able now to show the reduction from approximating minimum hitting set problem to automatizing resolution. Lemma 43 If either resolution or tree-like resolution is automatizable then there exist a constant h′ and an algorithm Φ working on pairs hU, ki, where U is an instance of the minimum hitting set problem and k is an integer such that: 1. the running time of Φ (hU, ki) is at most exp (O (k 2 )) · nO(1) ; 2. if γ(U) ≤ k then Φ (hU, ki) = 1; 3. if γ(U) ≥ h′ k then Φ (hU, ki) = 0. Proof. In the context of Definition 24 for parameters h and k, let G : {0, 1}h log m → {0, 1}log m be a poly-time computable linear function which is log m-surjective. Consider arbitrary polytime computable surjective functions Π : {0, 1}log m → A, π : {0, 1}log m → [log m], and let def the CNF formula τ (U, h, k) = τ (U, h, k, F~ , f~), where we consider the functions Fj = Π ◦ G, def

fi = π ◦ G, for all i ∈ [m] and j ∈ [n].

ǫ

C HAPTER 4. R ESOLUTION IS NOT AUTOMATIZABLE , UNLESS SAT ⊆ DT IM E(2n )

70

Assume that resolution or tree-like resolution is automatizable. Hence, there exists an algorithm A that computes a refutation proof of τ (U, h, k) in time polynomial in S (τ (U, h, k)), the shortest refutation proof of τ (U, h, k). Hence, the size of the refutation proof that algorithm A computes is also polynomial in S (τ (U, h, k)). Let sA (U, k) be the size of the refutation proof of τ (U, h, k), computed by the automatizability algorithm. Hence sA (U, k) is an integer-valued function, and there exist absolute constants ǫ, h0 , h1 > 0 such that h

• sA (U, k) is computable in time (n · exp(γ(U)2 )) 0 ; • and mǫ·min{γ(U ),log m} ≤ sA (U, k) ≤ n · mmin{γ(U ),log m}

h1

(4.24)

Set the constant h in τ (U, h, k) to be such that h2 >

h1 (h + 1) . ǫ

The algorithm Φ works as follows. Φ simulates n · mk and it outputs

h0

(4.25) steps in the computation of sA (U, k),

• 1 if the computation halts within this time and its result sA (U, k) satisfies the inequality h sA (U, k) ≤ n · mk 1 , and • 0 in all the other cases.

In the following we show that Φ satisfies all the requirements from the lemma, where we choose h′ = h. 1. Φ runs for time at most nh0 · mkh0 . By our definition of m (Equation 4.3), mk ≤ 2

max{2k , dn}h . Hence the running time is at most nO(1) · exp(O(k 2 )). 2. If γ(U) ≤ k then, since the simulation of Φ lasts dn · mk

h0

steps, it follows that Φ will h have computed the value of sA (U, k), and since it is at most dn · mk 1 , it follows that

Φ (hU, ki) will output 1.

ǫ

C HAPTER 4. R ESOLUTION IS NOT AUTOMATIZABLE , UNLESS SAT ⊆ DT IM E(2n )

71

3. If γ(U) ≥ hk then, by Equation (4.24) it follows that sA (U, k) ≥ mǫ·min{γ(U ),log m} . Since log m ≥ hk, it follows that sA (U, k) ≥ mǫhk . • If k ≥

√

log dn, by Equation (4.25) it follows that

sA (U, k) > mkh1 ·m

kh1 h

(4.3)

= mkh1 ·2k

2h

1

≥ mkh1 ·2h1 log dn = dn · mk

h1

, (4.26)

and hence Φ will output 0. • If k <

√

log dn, then since sA (U, k) ≥ mǫ·hk , using Equation (4.25), it follows that

sA (U, k) > mkh1 · m

kh1 h

(4.3)

= mkh1 · 2h

h log dn ·k h1 k

= (dn · mk )h1 ,

(4.27)

and hence Φ outputs the value 0. ✷ Theorem 44 [AKP04] For all c ≥ 0 there exists θ < 1 such that Set Cover, for k = O(log n) cannot be approximated to within a factor of c in polynomial time unless N T IM E(t(n)) ⊆ θ

DT IM E(2t(n) ). Corollary 45 If resolution is automatizable then there exists θ < 1 such that N T IM E(t(n)) ⊆ θ

DT IM E(2t(n) ). Proof. Let U = {S1 , . . . , Sq } be an instance of the hitting set problem, such that the optimal solution is γ(U) = O(log n), where n is the number of elements in the collection U. By Lemma 43, it follows that if resolution is automatizable, then we can approximate the hitting set problem up to a constant factor. Hence, an automatizability algorithm for resolution gives us a constant factor approximation for the hitting set problem. A constant factor approximation for the hitting set problem gives us a constant factor approximation for the set cover problem, since θ

they are equivalent. By Theorem 44, this implies that N T IM E(t(n)) ⊆ DT IM E(2t(n) ), and the proof follows. ✷

Chapter 5 On the automatizability of resolution refinements Although resolution is the most widely studied approach to propositional theorem proving, many variants and refinements of resolution have been studied. The most common of these refinements are: ordered resolution (also known as Davis-Putnam resolution), regular resolution, linear resolution, etc. A regular resolution refutation of F is a resolution refutation such that on any path from Λ to a clause in F , no variable appears more than once as an arc-label. We call a regular resolution refutation ordered if every sequence of variables labeling a path from Λ to a clause in F respects the same ordering on the variables. A linear resolution refutation of F is a resolution refutation with the additional restriction that the underlying DAG must be linear. That is, the proof consists of a sequence of clauses C1 , C2 , . . . , Cm such that Cm is the empty clause. For every 1 ≤ i ≤ m, either Ci is an empty clause, or Ci is derived from Ci−1 and an initial clause, or Ci is derived from Ci−1 and Cj , for some j < i − 1. A more extensive presentation of resolution refinements can be found in [BEGJ00] or [BOP03]. Note that all resolution refinements are weaker than resolution. Hence, any lower bound for 72

C HAPTER 5. O N THE AUTOMATIZABILITY OF RESOLUTION REFINEMENTS

73

resolution also holds for its refinements. In particular, we can extend the result from Chapter 4 for those refinements of resolution for which the upper bound from Lemma 42 holds. Note that for the upper bound limit of Lemma 42, in the efficient part of the refutation proof, at each derivation we use an initial axiom, and no literal is resolved on more than once. Therefore, the refutation proof described in Lemma 42, part 1 is also a linear refutation proof, a regular refutation proof and an ordered refutation proof. It follows that neither linear resolution, nor ǫ

regular resolution, nor ordered resolution is automatizable, unless SAT ⊆ DT IM E(2n ). Lemma 46 The resolution proof given for the upper bound in the proof of Lemma 42 is also a regular resolution, an ordered resolution and a linear resolution. Proof. Note that it is enough to prove that the derivation of the formula from Equation 4.18 is regular, ordered and linear resolution. • The derivation proof is linear, since in each induction step in which we infer the CNF expansion of the formula from Equation 4.9, we resolve the previously obtained CNF formula from Equation 4.9, for µ = t, t − 1, . . . , γ(U), with the CNF formula from either the Equation 4.11, either 4.12, either 4.14. All of these are initial axioms, and hence the derivation of the CNF formula from Equation 4.16 is a linear resolution. This formula is further refuted with the formula from Equation 4.17, for µ = γ(U), γ(U)−1, . . . , 1. Note that the formula from Equation 4.17 is an initial axiom, and hence the whole derivation of 4.18 is linear resolution. • The derivation proof is regular since in every derivation step we refute on a different c , for µ = variable. In the first part of the derivation, we refute over the variables zi,v µ c , for µ = γ(U), γ(U) − t, t − 1, . . . , γ(U), and afterward we refute over the variables zi,p µ

1, . . . , 1. Hence, every derivation is done over a different variable, and the whole derivation of the formula from Equation 4.18 is regular. • The derivation proof is ordered since it is regular and linear.

C HAPTER 5. O N THE AUTOMATIZABILITY OF RESOLUTION REFINEMENTS

74 ✷

Corollary 47 If either regular, or ordered, or linear resolution is automatizable, then SAT ⊆ ǫ

2n . Proof.

By Lemma 46, it follows that the upper bound from Lemma 42 also holds for the

refinements of resolution mentioned above. Since these refinements are weaker than resolution, every lower bound for resolution also holds for the regular, ordered and linear resolution. Hence, the Lemma 42 also holds for these refinements. It further follows that Lemma 43 also holds if regular, ordered or linear resolution are automatizable: Fact 48 If either regular, ordered, or linear resolution is automatizable then there exist a constant h′ and an algorithm Φ working on pairs hU, ki, where U is an instance of the minimum hitting set problem and k is an integer such that: 1. the running time of Φ (hU, ki) is at most exp (O (k 2 )) · nO(1) ; 2. if γ(U) ≤ k then Φ (hU, ki) = 1; 3. if γ(U) ≥ h′ k then Φ (hU, ki) = 0. Now, let U = {S1 , . . . , Sq } be an instance of the hitting set problem, such that the optimal solution is γ(U) = O(log n), where n is the number of elements in the collection U. By the Fact 48, it follows that if either regular, either ordered, either linear resolution is automatizable, then we can approximate the hitting set problem up to a constant factor. Hence, an automatizability algorithm for regular/ordered/linear resolution gives us a constant factor approximation for the hitting set problem. A constant factor approximation for the hitting set problem gives us a constant factor approximation for the set cover problem, since they are equivalent. By Theorem θ

44, this implies that N T IM E(t(n)) ⊆ DT IM E(2t(n) ), and the proof follows. ✷

Chapter 6 Conclusion, Open Questions In this section we present what we consider as the most interesting open questions regarding proof systems automatizability results and their connections to PAC-learning.

Cutting Planes Automatizability The cutting planes is a refutation proof system that works with inequalities of the form a1 x1 + a2 x2 + . . . + an xn ≥ A,

(6.1)

where ai , A ∈ Z. It has four sound rules of inference: i) basic algebraic simplification of sums and products of integers and integers; ii) addition of two inequalities; iii) multiplication of an P inequality by an integer; and iv) division: if c divides each ai , then we can derive i aci xi ≥ P ⌈ Ac ⌉ from i ai xi ≥ A. In order to use the cutting planes as a refutation proof system for CNF

formulas, we must first translate CNF formulas into a system of linear inequalities. Define

R(x) = x and R(¬x) = 1 − x. A clause (ℓ1 ∨ . . . ∨ ℓk ) is translated into the linear inequality Pk i=1 R (ℓi ) ≥ 1. The system of linear inequalities E(f ) corresponding to a CNF formula, f , is the set of inequalities we obtain by translating each clause in f , together with the inequalities

x ≥ 0 and −x ≥ −1 for all variables x in f . A cutting planes refutation of a CNF formula f is defined to be a directed acyclic graph where each node is labeled with a particular linear inequality, and such that: i) each leaf formula is a 75

C HAPTER 6. C ONCLUSION , O PEN Q UESTIONS

76

linear inequality in E(f ); ii) intermediate formulas follow from two previous inequalities by one of the cutting planes rules, and the final (root) inequality is 0 ≥ 1 A tree-like cutting planes proof is a cutting planes proof where the underlying directed acyclic graph is a tree. It is not known if they are strictly weaker than the cutting planes proofs. From the point of view of the hierarchy, we know that the tree-like cutting planes is strictly stronger than DPLL and cutting planes is strictly stronger than resolution. It is still open whether the (tree-like) cutting planes system is automatizable or not. The interpolation argument does not work, since cutting planes has feasible interpolation [Kra97]. Using the same technique we did in Chapter 4 for resolution, we think we could extend the result to obtain something similar for the automatizability of tree like cutting planes. The upper bound for the translation of the CNF formula τ (U, h, k, F~ , f~) follows easily from Definition 24; However, the lower bound is more problematic since the existence of an inequality with many terms does not directly imply a large proof.

Automatizability of Resolution Refinements We have seen some refinements of resolution in Chapter 5. More refinements exists however. A negative resolution refutation of F is a resolution refutation with the additional restriction that all resolutions must be negative. A resolution step C ∨ x and D ∨ x implies C ∨ D is negative whenever D contains only negative literals. Positive resolution is the dual of negative resolution, where one of the two premises in each resolution step must contain only positive literals. More generally, given a formula F over n variables and an assignment α ∈ {0, 1}n , an α-refutation of F is a resolution refutation such that, when two clauses are resolved, at least one of them must be falsified by α. A refutation of F is called semantic if it is a α-refutation for any α ∈ {0, 1}n . An open question is whether these resolution refinements are automatizable, or at least finding a complexity argument of non-automatizability for them, similar to the one for resolution and its refinements from Chapters 4 and 5.

C HAPTER 6. C ONCLUSION , O PEN Q UESTIONS

77

Resolution Automatizability Resolution is automatizable in time exponential in

√

n log S, where S is the size of the shortest

refutation proof [BSW01]. On the other hand, it is very unlikely that resolution is automatizable in time polynomial in S. However, no lower bound between polynomial in S and exponential √ in n log S exists so far. An open question remains to prove a lower bound for resolution automatizability in this interval.

Automatizability vs. PAC-learning One open question is to find more examples of similarities between learning algorithms and proof system automatizability, with purpose to establish connections between learning and automatizability in general. The search problem associated with learning of a concept class is connected with the search problem associated with the automatizability of some proof system. An open question is to find a general correspondence between a proof system and a concept class, such that results for the automatizability of the proof system can be translated into results for PAC-learning of the correspondent concept class, and vice-versa.

Bibliography [ABMP98] M. Alekhnovich, S. Buss, S. Moran, and T. Pitassi. Minimum propositional proof length is NP-hard to linearly approximate. In Lubos Brim, Jozef Gruska, and Jir´ı Zlatuska, editors, Procedings of the 23rd International Symposium on the Mathematical Foundations of Computer Science, volume 1450 of Lecture Notes in Computer Science, pages 176–184. Springer-Verlag, 1998.

[AKP04]

M. Alekhnovich, S. Khot, and T. Pitassi. Inapproximability of fixed parameter problems. Manuscript, 2004.

[Alo95]

N. Alon. Tools from higher algebra. In Handbook of Combinatorics, volume 2, chapter 32. Elsevier, 1995.

[AR01]

M. Alekhnovich and A. Razborvov. Resolution is not automatizable unless W[P] is tractable. Foundations of Computer Science, pages 210–219, 2001.

[BDG+ 99] M. Bonet, C. Domingo, R. Gavald´a, A. Maciel, and T. Pitassi.

Non-

automatizability of bounded-depth Frege proofs. In Proceedings IEEE Symposium on Complexity, pages 15–23, Atlanta, GA, 1999.

[BEGJ00] M. L. Bonet, J. L. Esteban, N. Galesi, and J. Johannsen. On the relative complexity of Resolution refinements and Cutting Planes proof systems. SIAM Journal on Computing, 30(5):1462–1484, 2000. 78

B IBLIOGRAPHY

79

[BMOS03] N. H. Bshouty, E. Mossel, R. O’Donnel, and R. A. Servedio. Learning DNF from random walks. In Proceedings of the 44th Annual Symposium on Foundations of Computer Science, pages 189–198, 2003. [BOP03]

J. Buresh-Oppenheim and T. Pitassi. The complexity of Resolution refinements. Logic in Computer Science, page 138, 2003.

[BPR00]

M. L. Bonet, T. Pitassi, and R. Raz. On interpolation and automatization for Frege systems. Journal of the ACM, 29(6):1939–1967, 2000.

[BSW01]

E. Ben-Sasson and A. Wigderson. Short proofs are narrow – Resolution made simple. Journal of the ACM, 48(2):149–169, 2001.

[CEI96]

M. Clegg, J. Edmonds, and R. Impagliazzo. Using the Groebner basis algorithm to find proofs of unsatisfiability. In Proc. 28th ACM Symposium on Theory of Computing, pages 174–183, 1996.

[CR79]

S. A. Cook and R. A. Reckhow. The relative efficiency of propositional proof systems. Journal of Symbolic Logic, 44:36–50, 1979.

[DF98]

R. G. Downey and M. R. Fellows. Parameterized Complexity. Springer-Verlag, 1998.

[DLL62]

M. Davis, G. Logemann, and D. Loveland. A machine program for theorem proving. Journal of the ACM, 5(7):394 – 397, 1962.

[EH88]

A. Ehrenfeucht and D. Haussler. Learning decision trees from random examples. In Proceedings of the 1st Workshop on Computational Learning Theory, pages 182–194. Morgan Kaufmann, San Mateo, CA, 1988.

[Hak85]

A. Haken. Intractability of Resolution. Theoretical Computer Science, 39:297– 308, 1985.

B IBLIOGRAPHY [Kha93]

80

M. Kharitonov. Cryptographic hardness of distribution-specific learning. In Proceedings of the 25th ACM Symposium on the Theory of Computing, pages 372– 381. ACM Press, New York, NY, 1993.

[KP95]

J. Kraj´ıc˘ ek and P. Pudlak. Some consequences of cryptographical conjectures for S21 and EF . In Logic and Computational Complexity, volume 960 of Lecture Notes in Computer Science, pages 210–220. Springer-Verlag, New York, 1995.

[Kra97]

J. Kraj´ıc˘ ek. Interpolation theorems, lower bounds for proof systems and independence results for bounded arithmetic. Journal of Symbolic Logic, 62:457–486, 1997.

[KS01]

˜

A. Klivans and R. Servedio. Learning DNF in time 2O(n

1/3 )

. In Proceedings of the

Thirty-Third Annual Symposium on Theory of Computing, pages 258–265, 2001. [KV94a]

M. J. Kearns and L. G. Valiant. Cryptographic limitations on learning boolean formulae and finite automata. Journal of the ACM, 41(1):67–95, 1994.

[KV94b]

M. J. Kearns and U. V. Vazirani. An introduction to Computational Learning Theory. The MIT Press, 1994.

[Mun82]

D. Mundici. Complexity of Craig’s interpolation. Fundamenta Informaticae, 5:261–278, 1982.

[Mun83]

D. Mundici. A lower bound for the complexity of Craig’s interpolants in sentential logic. Archiv fur Mathematische Logik und Grundlagenforschung, 23:27–36, 1983.

[Mun84]

D. Mundici. Tautologies with a unique Craig interpolant, uniform vs. non-uniform complexity. Annals of Pure and Applied Logic, 27:265–273, 1984.

[PS98]

P. Pudlak and J. Sgall. Algebraic models of computation and interpolation for algebraic proof systems. In S. Buss, editor, Proof Complexity and Feasible Arith-

B IBLIOGRAPHY

81

metic, volume 39 of Lecture Notes in Computer Science, pages 279–295. SpringerVerlag, New York, 1998. [Pud97]

P. Pudlak. Lower bounds for Resolution and Cutting Planes proofs and monotone computations. Journal of Symbolic Logic, 62:981–998, 1997.

[PV88]

L. Pitt and L. Valiant. Computational limitations on learning from examples. Journal of the ACM, 35:965–984, 1988.

[Raz95]

A. Razborov. Unprovability of lower bounds on the circuit size in certain fragments of bounded arithmetic. Izvestiya Mathematics, 59(1):205–227, 1995.

[Urq95]

A. Urquhart. The complexity of propositional proofs. Bulletin of Symbolic Logic, 1:425–467, 1995.

[Val84]

L. G. Valiant.

A theory of the learnable.

Communications of the ACM,

27(11):1134–1142, 1984. [Ver90]

K. Verbeurgt. Learning DNF under the uniform distribution in quasi-polynomial time. In Proceedings of the Third Annual Workshop on Computational Learning Theory, pages 314–326, 1990.

Connections Between Theory and Software By Carl ...

pDF The Connections Between Language and ...

Connections Between Internet Use and Political Efficacy ... - CiteSeerX

Connections between Potential Vorticity Intrusions and ...

Connections Between Internet Use and Political Efficacy ... - CiteSeerX

Learnability and the Doubling Dimension - Semantic Scholar

Connections between cutting-pattern sequencing, VLSI ...

The Strength of Weak Learnability - Springer Link

nDeterminacy and Learnability of Monetary Policy ...

The Strength of Weak Learnability - Springer Link

Cultural selection for learnability: Three principles ...

controls, connections and indicators - Philips InCenter