Conditional ML Estimation Using Rational Function Growth Transform

Viewer
Transcript

Conditional ML Estimation Using Rational Function Growth Transform Ciprian Chelba and Alex Acero Microsoft Research, Redmond, U.S.A.

Abstract

Smoothing and Initialization

Conditional Maximum Likelihood (CML) estimation of probability models by means of the rational function growth transform (RFGT) [GKNN91]. Model parameter values — probabilities — properly normalized at each iteration. Discriminatively train a Na¨ıve Bayes (NB) classiﬁer; same procedure is at the basis of discriminative training of HMMs in speech recognition [Nor91].

Model parameters θ are initialized with maximum likelihood estimates smoothed using MAP:

vergence speed signiﬁcantly over a straightforward implementation.

Conditional Likelihood: Training Data 0

#(y) λy = , #(y) + α

λ y = 1 − λy

(3) −500

LogL

#(k, y) 1 θky = λy · + λy · , #(y) 2

P (y|x; θ) = Z(x; θ)

−1

· θy

F

total (min=0.1) ML count (min=0.1) ML count (min=0.075)

−1000

Model in Eq. (1) is re-parameterized: 1 fk (x) 1 fk (x) (λy · θky + λy · ) (λy · θky + λy · ) 2 2

−1500

(4)

k=1

✔ conditional likelihood H(T ; θ) =

0

50

100 150 200 Conditional Likelihood: Held−Out Data

250

300

0

50

100 150 200 Class Error Rate: Held−Out Data

250

300

0

50

100

250

300

−400

T

j=1 P (yj |xj ) is still a ratio of polynomials

−600

LogL

✔ Simple modiﬁcation of the reestimation eqns. increases con-

Experiments: Convergence Speed

−800

✔ Reduces class error rate over ML classiﬁer by 40% relative.

−1000

Model Parameter Updates

10

The re-estimation equations take the form:

−1θ = N θ ky y ky

• Conditional probability P (y|x), y ∈ Y, x ∈ X

∂ log H(T ; θ) + Cθ ∂θky

CER

Conditional Na¨ıve Bayes Models

where Cθ > 0 satisﬁes:

Na¨ıve Bayes model for the feature vector and the predicted variable (f (x), y): P (y|x; θ) = Z(x; θ)

−1

· θy

fk (x) (x) f k θky θky

6

(5)

4

• Feature set F = {fk (x) : X → {0, 1}, fk (x) = 1 − f (x), k = 1 . . . F }

F

∂ log H(T ; θ) + Cθ > , ∀k and y ∂θky

(1)

k=1

where: fk (x) = 1 − f (x), θy ≥ 0, ∀y ∈ Y, y∈Y θy = 1; θky ≥ 0, θky ≥ 0, θky + θky = −1 1, ∀k = 1 . . . F, y ∈ Y and Z(x; θ) = y P (f (x), y) is a normalization term.

Equivalence with Exponential Models

Experiments: Classiﬁcation Performance Training Objective Function Smoothing Class Error Rate (%) ML smoothed 11.3 CML not smoothed 8.5 CML smoothed 6.7 MaxEnt smoothed 4.9

where:

F θky Setting: fk (x, y) = fk (x) · δ(y); λky = log( ); λ0y = log(θy · k=1 θky ) and f0(x, y) = f0(y) θky

☞ log-linear model — maximum entropy probability estimation [BPP96]: P (y|x; λ) = Z(x; λ)−1 · exp 

F

• δ() Kronecker delta operator, δ(y, yi) = 1 f or y = yi, 0 otherwise

 λky fk (x, y)

(2)

Algorithmic Reﬁnements

k=0

where λky are free real-valued parameters and Z(x; λ) the normalization term.

Rational Function Growth Transform for CML Estimation of Na¨ıve Bayes Models Estimate θ = {θy , θky , θky : y ∈ Y, k = 1 . . . F } ∈ Θ such that the conditional likelihood of training data T = {(x1, y1) . . . (xT , yT )} is maximized: θ∗ = arg max H(T ; θ), θ

H(T ; θ) =

T

P (yj |xj )

normalize

γθ βθ = T

☞ Dependency on Class Count:

make Eqn. (6) dimensionally correct ζθ (y) βθ = #(y)

– Considerable convergence speedup #(y) – Setting ζθ (y) = T · γθ results in identical updates under either γ or ζ.

✔ CML trained classiﬁer reduces CER by 40% relative over ML counterpart ✔ although equivalent with MaxEnt, does not result in same performance – diﬀerent smoothing (Gaussian prior for MaxEnt [CR00]) – objective function (conditional log-likelihood) is not convex in parameter values for Na¨ıve Bayes (NB) model whereas it is convex for MaxEnt model – one extra free parameter in NB versus MaxEnt

Acknowledgements Thanks to Milind Mahajan and Asela Gunawardana for useful comments and discussions.

References [BPP96]

A. L. Berger, S. A. Della Pietra, and V. J. Della Pietra. A maximum entropy approach to natural language processing. Computational Linguistics, 22(1):39–72, March 1996.

[CA04]

Ciprian Chelba and Alex Acero. Conditional maximum likelihood estimation of Na¨ıve Bayes probability models. Technical Report to appear, Microsoft Research, Redmond, WA, 2004.

[CR00]

Stanley F. Chen and Ronald Rosenfeld. A survey of smoothing techniques for maximum entropy models. IEEE Transactions on Speech and Audio Processing, 8(1):37–50, 2000.

j=1

H(T ; θ) is a ratio of two polynomials with real coeﬃcients, each deﬁned over a set Θ of probability distributions: θy = 1; θky ≥ 0, θky ≥ 0 and θky + θky = 1, ∀y ∈ Y, ∀k} Θ = {θ : θy ≥ 0, ∀y ∈ Y and y

✔ iteratively estimate model parameters using growth transform for rational functions on the domain Θ [GKNN91].

☞ Dependency on Training Data Size:

Experiments: Setup

✔ ATIS II + III type A utterances (can be interpreted w/o context) • type A utterances: TRAIN/DEV/TEST: 5,822/410/914 (74,442/5,326/10,673wds); • class label assigned from SQL query (14 classes); • word-vocabulary: 780, OOV 0.24%

200

10-fold speed-up as measured by increase in log-likelihood and decrease in class error rate (CER) on held-out data

with > 0 suitably chosen, see [GKNN91] and [CA04] for details. Calculating the partial derivatives in Eq. (5) we obtain:   T   λ y −1 · θ · 1 + β θ = N fk (xi)[δ(y, yi) − p(y|xi; θ)] (6) ky ky  θ ky 1  λy · θky + λy · −1 normalization constant that ensures θ + θ = 1 • Nky ky ky • βθ = 1/Cθ

150 Iteration Number

✔ about

2 i=1



8

[GKNN91] P. S. Gopalakrishnan, Dimitri Kanevski, Arthur Nadas, and David Nahamoo. An inequality for rational functions with applications to some statistical estimation problems. IEEE Transactions on Information Theory, 37(1):107–113, January 1991. [Nor91]

Y. Normandin. Hidden Markov Models, Maximum Mutual Information Estimation and the Speech Recognition Problem. PhD thesis, McGill University, Montreal, 1991.

Fast Conditional Kernel Density Estimation