Conditional ML Estimation Using Rational Function Growth Transform Ciprian Chelba and Alex Acero Microsoft Research, Redmond, U.S.A.

Abstract

Smoothing and Initialization

Conditional Maximum Likelihood (CML) estimation of probability models by means of the rational function growth transform (RFGT) [GKNN91]. Model parameter values — probabilities — properly normalized at each iteration. Discriminatively train a Na¨ıve Bayes (NB) classifier; same procedure is at the basis of discriminative training of HMMs in speech recognition [Nor91].

Model parameters θ are initialized with maximum likelihood estimates smoothed using MAP:

vergence speed significantly over a straightforward implementation.

Conditional Likelihood: Training Data 0

#(y) λy = , #(y) + α

λ y = 1 − λy

(3) −500

LogL

#(k, y) 1 θky = λy · + λy · , #(y) 2

P (y|x; θ) = Z(x; θ)

−1

· θy

F 

total (min=0.1) ML count (min=0.1) ML count (min=0.075)

−1000

Model in Eq. (1) is re-parameterized: 1 fk (x) 1 fk (x) (λy · θky + λy · ) (λy · θky + λy · ) 2 2

−1500

(4)

k=1

✔ conditional likelihood H(T ; θ) =

0

50

100 150 200 Conditional Likelihood: Held−Out Data

250

300

0

50

100 150 200 Class Error Rate: Held−Out Data

250

300

0

50

100

250

300

−400

T

j=1 P (yj |xj ) is still a ratio of polynomials

−600

LogL

✔ Simple modification of the reestimation eqns. increases con-

Experiments: Convergence Speed

−800

✔ Reduces class error rate over ML classifier by 40% relative.

−1000

Model Parameter Updates

10

The re-estimation equations take the form:

−1θ = N θ ky y ky

• Conditional probability P (y|x), y ∈ Y, x ∈ X

∂ log H(T ; θ) + Cθ ∂θky

CER

Conditional Na¨ıve Bayes Models

where Cθ > 0 satisfies:

Na¨ıve Bayes model for the feature vector and the predicted variable (f (x), y): P (y|x; θ) = Z(x; θ)

−1

· θy

fk (x) (x) f k θky θky

6

(5)

4

• Feature set F = {fk (x) : X → {0, 1}, fk (x) = 1 − f (x), k = 1 . . . F }

F 

∂ log H(T ; θ) + Cθ > , ∀k and y ∂θky

(1)

k=1



where: fk (x) = 1 − f (x), θy ≥ 0, ∀y ∈ Y, y∈Y θy = 1; θky ≥ 0, θky ≥ 0, θky + θky =  −1 1, ∀k = 1 . . . F, y ∈ Y and Z(x; θ) = y P (f (x), y) is a normalization term.

Equivalence with Exponential Models

Experiments: Classification Performance Training Objective Function Smoothing Class Error Rate (%) ML smoothed 11.3 CML not smoothed 8.5 CML smoothed 6.7 MaxEnt smoothed 4.9

where:

F θky Setting: fk (x, y) = fk (x) · δ(y); λky = log( ); λ0y = log(θy · k=1 θky ) and f0(x, y) = f0(y) θky

☞ log-linear model — maximum entropy probability estimation [BPP96]: P (y|x; λ) = Z(x; λ)−1 · exp 

F 

• δ() Kronecker delta operator, δ(y, yi) = 1 f or y = yi, 0 otherwise

 λky fk (x, y)

(2)

Algorithmic Refinements

k=0

where λky are free real-valued parameters and Z(x; λ) the normalization term.

Rational Function Growth Transform for CML Estimation of Na¨ıve Bayes Models Estimate θ = {θy , θky , θky : y ∈ Y, k = 1 . . . F } ∈ Θ such that the conditional likelihood of training data T = {(x1, y1) . . . (xT , yT )} is maximized: θ∗ = arg max H(T ; θ), θ

H(T ; θ) =

T 

P (yj |xj )

normalize

γθ βθ = T

☞ Dependency on Class Count:

make Eqn. (6) dimensionally correct ζθ (y) βθ = #(y)

– Considerable convergence speedup #(y) – Setting ζθ (y) = T · γθ results in identical updates under either γ or ζ.

✔ CML trained classifier reduces CER by 40% relative over ML counterpart ✔ although equivalent with MaxEnt, does not result in same performance – different smoothing (Gaussian prior for MaxEnt [CR00]) – objective function (conditional log-likelihood) is not convex in parameter values for Na¨ıve Bayes (NB) model whereas it is convex for MaxEnt model – one extra free parameter in NB versus MaxEnt

Acknowledgements Thanks to Milind Mahajan and Asela Gunawardana for useful comments and discussions.

References [BPP96]

A. L. Berger, S. A. Della Pietra, and V. J. Della Pietra. A maximum entropy approach to natural language processing. Computational Linguistics, 22(1):39–72, March 1996.

[CA04]

Ciprian Chelba and Alex Acero. Conditional maximum likelihood estimation of Na¨ıve Bayes probability models. Technical Report to appear, Microsoft Research, Redmond, WA, 2004.

[CR00]

Stanley F. Chen and Ronald Rosenfeld. A survey of smoothing techniques for maximum entropy models. IEEE Transactions on Speech and Audio Processing, 8(1):37–50, 2000.

j=1

H(T ; θ) is a ratio of two polynomials with real coefficients, each defined over a set Θ of probability distributions:  θy = 1; θky ≥ 0, θky ≥ 0 and θky + θky = 1, ∀y ∈ Y, ∀k} Θ = {θ : θy ≥ 0, ∀y ∈ Y and y

✔ iteratively estimate model parameters using growth transform for rational functions on the domain Θ [GKNN91].

☞ Dependency on Training Data Size:

Experiments: Setup

✔ ATIS II + III type A utterances (can be interpreted w/o context) • type A utterances: TRAIN/DEV/TEST: 5,822/410/914 (74,442/5,326/10,673wds); • class label assigned from SQL query (14 classes); • word-vocabulary: 780, OOV 0.24%

200

10-fold speed-up as measured by increase in log-likelihood and decrease in class error rate (CER) on held-out data

with  > 0 suitably chosen, see [GKNN91] and [CA04] for details. Calculating the partial derivatives in Eq. (5) we obtain:   T    λ y −1 · θ · 1 + β θ = N fk (xi)[δ(y, yi) − p(y|xi; θ)] (6) ky ky  θ ky 1  λy · θky + λy · −1 normalization constant that ensures θ + θ = 1 • Nky ky ky • βθ = 1/Cθ

150 Iteration Number

✔ about

2 i=1



8

[GKNN91] P. S. Gopalakrishnan, Dimitri Kanevski, Arthur Nadas, and David Nahamoo. An inequality for rational functions with applications to some statistical estimation problems. IEEE Transactions on Information Theory, 37(1):107–113, January 1991. [Nor91]

Y. Normandin. Hidden Markov Models, Maximum Mutual Information Estimation and the Speech Recognition Problem. PhD thesis, McGill University, Montreal, 1991.

Conditional ML Estimation Using Rational Function Growth Transform

ity models by means of the rational function growth transform. (RFGT) [GKNN91]. .... ☞Dependency on Training Data Size: normalize βθ. = γθ. T. ☞Dependency ...

108KB Sizes 2 Downloads 170 Views

Recommend Documents

Fast Conditional Kernel Density Estimation
15 Dec 2006 - Fast Conditional Kernel Density Estimation. Niels Stender. University .... 2.0. 0.0. 0.5. 1.0. 1.5. 2.0 x1 x2. Level 4. Assume the existence of two datasets: a query set and a training set. Suppose we would like to calculate the likelih

The differential Hilbert function of a differential rational ...
order indeterminates (its symbol) has full rank, the sys- tem (1) can be locally .... bra software packages, based on rewriting techniques. This is the reason why our ...... some Jacobian matrices by means of division-free slp. For this purpose, we .

4 Rational Function Attributes Test review.pdf
4 Rational Function Attributes Test review.pdf. 4 Rational Function Attributes Test review.pdf. Open. Extract. Open with. Sign In. Main menu.

Nearest Neighbor Conditional Estimation for Harris ...
Jul 5, 2008 - However, under more restrictive conditions central limit theorems can also be inferred and details are provided. Inferential arguments in ...

C1-L6 - Rational Functions - Reciprocal of a Linear Function - Note ...
Page 2 of 2. C1-L6 - Rational Functions - Reciprocal of a Linear Function - Note filled in.pdf. C1-L6 - Rational Functions - Reciprocal of a Linear Function - Note filled in.pdf. Open. Extract. Open with. Sign In. Main menu. Displaying C1-L6 - Ration

Dictionary-based probability density function estimation ...
an extension of previously existing method for lower resolution images. ...... image processing, Proceedings of the IGARSS Conference, Toronto (Canada), 2002.

Page 1 Z 7654 ML ML LEAL ML ML 8_2m1L _22.13_ _BML _BML ...
S e e e cl S t L_l cl 1 o. TITLE: ñrch BLE v1.84. Design: v? 32. 31. 29. 28. || 27. 26. 25. 19. En „3 21. En ai 22. En „5 23. En ná 24. 123456789 ...

Unscented Transform with Online Distortion Estimation ...
Microsoft Corporation, One Microsoft Way, Redmond, WA 98052. {jinyli; dongyu; ygong; deng}@microsoft.com. Abstract. In this paper, we propose to improve our previously developed method for joint compensation of additive and convolutive distortions (J

Sound insulation evaluation using transfer function measurements
domain, it is better to use MLS or Sweep Sine technique respectively, to get the impulse ... method to get the sound pressure difference level between two rooms.

Sound insulation evaluation using transfer function measurements
With the new available front-end devices ... If in the time domain ISO/DIS 18233 [3] suggests to use the Schoroeder energy integration method to get ... the space-time average sound pressure level, in dB, inside the cabin under test. The most ...

Machine learning (ML)-guided OPC using basis ...
signal computation and Python for MLP construction. ... K.-S. Luo, Z. Shi, X.-L. Yan, and Z. Geng, “SVM based layout retargeting for fast and regularized inverse.

Unscented Transform with Online Distortion Estimation ...
between the clean and distorted speech model parameters is shared across the entire .... A and set-B contain eight different types of additive noise while set-C ... coefficient of order zero is used instead of the log energy in the original script.

inteligibility improvement using snr estimation
Speech enhancement is one of the most important topics in speech signal processing. Several techniques have been proposed for this purpose like the spectral subtraction approach, the signal subspace approach, adaptive noise canceling and Wiener filte

CONDITIONAL MEASURES AND CONDITIONAL EXPECTATION ...
Abstract. The purpose of this paper is to give a clean formulation and proof of Rohlin's Disintegration. Theorem (Rohlin '52). Another (possible) proof can be ...

Machine learning (ML)-guided OPC using basis ...
Machine Learning (ML)-Guided OPC Using Basis. Functions of Polar Fourier Transform. Suhyeong Choi a. , Seongbo Shim ab. , and Youngsoo Shin a a.

Erratum to ``The Estimation of the Growth and ...
DTM research unit, Cemagref. July 10, 2008. The published version of Bresson (2008) contains many small mistakes in sections 2.2 and 3. Though main propositions and remarks are not affected by these errors, we would like to indicate which corrections

Accurate Estimation of Pulmonary Nodule's Growth ...
posed in the CT imaging based Computer-Aided Diagnosis. (CAD). Most methods [12 ... For lung nodules' detection in CT data, two main types of methods have ...

The Estimation of the Growth and Redistribution ...
Sep 1, 2008 - Ct,v = Ct,u +Cu,v and It,v = It,u +Iu,v . (2.4). However, this property is not satisfied by any of the decomposition procedures pre- sented in the last section. Therefore, to deal with this issue, one has to choose ..... 1998 and 2001 a

Causal Conditional Reasoning and Conditional ...
judgments of predictive likelihood leading to a relatively poor fit to the Modus .... Predictive Likelihood. Diagnostic Likelihood. Cummins' Theory. No Prediction. No Prediction. Probability Model. Causal Power (Wc). Full Diagnostic Model. Qualitativ

Arduino programing of ML-style in ATS - ML Family Workshop
binaries generated from ATS source are very close (in terms of size) to those generated from the C counterpart. 2. ATS programming language. ATS is a programming language equipped with a highly expressive type system rooted in the framework Applied T