Vowpal Wabbit 5.1

http://hunch.net/~vw/ John Langford Yahoo! Research {With help from Nikos Karampatziakis & Daniel Hsu} git clone git://github.com/JohnLangford/vowpal_wabbit.git

Why VW? 1. There should exist an open source online learning system. 2. Online learning



online optimization, which is

or competes with best practice for many learning algorithms. 3. VW is a multitrick pony, all useful, many orthogonally composable. [hashing, caching, parallelizing, feature crossing, features splitting, feature combining, etc...] 4. It's simple. No strange dependencies, currently only 6338 lines of code.

On RCV1, training time = ~3s [caching, pipelining] On large scale learning challenge datasets

≤ 10

minutes [caching]

5

[ICML 2009] 10 -way personalize spam lter. [-q, hashing]

6

[UAI 2009] 10 -way conditional probability estimation. [library, hashing] [Rutgers grad] Gexample/day data feed. [daemon] [Matt Homan] LDA-100 on 2.5M Wikipedia in 1 hour. [Paul Mineiro] True Love @ eHarmony [Stock Investors] Unknown

The Tutorial Plan 1. Baseline online linear algorithm 2. Common Questions. 3. Importance Aware Updates 4. Adaptive updates. 5. Conjugate Gradient. 6. Active Learning. Missing: Online LDA: See Matt's slides Ask Questions!

The basic learning algorithm (classic) wi = 0, Repeatedly: ∗ Get example x ∈ (∞, ∞) . P Make prediction y ˆ = i wi xi clipped to interval

Start with 1. 2.

∀i :

[0, 1].

3. Learn truth

y ∈ [0, 1] with importance I

or goto

(1). 4. Update

wi ← wi + η2(y − yˆ)Ixi

and go to (1).

Input Format Label [Importance] [Tag]|Namespace Feature ... |Namespace Feature ... ... \n Namespace = String[:Float] Feature = String[:Float] Feature and Label are what you expect. Importance is multiplier on learning rate. Tag is an identier for an example, echoed on example output. Namespace is a mechanism for feature manipulation and grouping.

Valid input examples

1 | 13:3.96e-02 24:3.47e-02 69:4.62e-02 example_39|excuses the dog ate my homework 1 0.500000 example_39|excuses:0.1 the:0.01 dog ate my homework |teacher male white Bagnell AI ate breakfast

Example Input Options [-d] [ data ] : Read examples from f. Multiple



use all

cat | vw : read from stdin daemon : read from port 39524 port

: read from port p passes : Number of passes over examples. Can't multipass a noncached stream. -c [ cache ] : Use a cache (or create one if it doesn't exist). cache_le : Use the fc cache le. Multiple use all. Missing



create. Multiple+missing



concatenate compressed : Read a gzip compressed le.



Example Output Options Default diagnostic information: Progressive Validation, Example Count, Label, Prediction, Feature Count -p [ predictions ] : File to dump predictions into. -r [ raw_predictions ] : File to output unnormalized prediction into. sendto : Send examples to host:port. audit : Detailed information about feature_name: feature_index: feature_value: weight_value quiet : No default diagnostics

Example Manipulation Options -t [ testonly ] : Don't train, even if the label is there. -q [ quadratic ] : Cross every feature in namespace a* with every feature in namespace b*. Example: -q et (= extra feature for every excuse feature and teacher feature) ignore : Remove a namespace and all features in it. sort_features: Sort features for small cache les. ngram : Generate N-grams on features. Incompatible with sort_features skips : ...with S skips. hash all: hash even integer features.

Update Rule Options decay_learning_rate

[= 1]

[= 1]

[= 0.5]

initial_t power_t

-l [ learning_rate ]

ηe =

[= 10]

(i +

ld n−1i p

i

p e 0
P

Basic observation: there exists no one learning rate satisfying all uses. Example: state tracking vs. online optimization. loss_function {squared,logistic,hinge,quantile} Switch loss function

Weight Options -b [ bit_precision ] [=18] : Number of weights. Too many features in example set⇒ collisions occur. -i [ initial_regressor ] : Initial weight values. Multiple



average.

-f [ nal_regressor ] : File to store nal weight values in. random_weights : make initial weights random. Particularly useful with LDA. initial_weight : Initial weight value

Useful Parallelization Options

thread-bits : Use 2

b threads for multicore.

Introduces some nondeterminism (oating point add order). Only useful with -q (There are other experimental cluster parallel options.)

The Tutorial Plan 1. Baseline online linear algorithm 2. Common Questions. 3. Importance Aware Updates 4. Adaptive updates. 5. Conjugate Gradient. 6. Active Learning. Missing: Online LDA: See Matt's slides Ask Questions!

How do I choose good features? Think like a physicist: Everything has units. Let

xi

be the base unit. Output

hw · x i

has unit

probability, median, etc... So predictor is a unit transformation machine. The ideal

wi

has units of

value halves weight. Update



∂ L (x ) ∂w w

Thus update = sense.

1

x since doubling feature i

' ∆L∆w(x ) has units of xi . 1 x + xi unitwise, which doesn't w

i

make

Implications xi Choose xi

1. Choose

near 1, so units are less of an issue.

2.

on a similar scale to

xj

so unit

mismatch across features doesn't kill you. 3. Use other updates which x the units problem (later). General advice: 1. Many people are happy with TFIDF = weighting sparse features inverse to their occurrence rate. 2. Choose features for which a weight vector is easy to reach as a combination of feature vectors.

How do I choose a Loss function? Understand loss function semantics. 1. Minimizer of squared loss = conditional expectation.

f (x ) = E [y |x ] (default).

2. Minimizer of quantile = conditional quantile.

y > f (x )|x ) = τ

Pr(

3. Hinge loss = tight upper bound on 0/1 loss. 4. Minimizer of logistic = conditional probability:

y = 1|x ) = f (x ).

Pr(

Particularly useful when

probabilities are small. Hinge and logistic require labels in

{−1, 1}.

How do I choose a learning rate? 1. Are you trying to

track a changing system?

power_t 0 (forget past quickly). 2. If the world is adversarial: power_t 0.5 (default) 3. If the world is iid: power_t 1 (very aggressive) 4. If the error rate is small: -l 5. If the error rate is large: -l (for integration) 6. If power_t is too aggressive, setting initial_t softens initial decay. 7. For multiple passes decay_learning_rate in

[0.5, 1]

is sensible. values

overtting.

<1

protect against

How do I order examples?

There are two choices: 1. Time order, if the world is nonstationary. 2. Permuted order, if not. A bad choice: all label 0 examples before all label 1 examples.

How do I debug? 1. Is your progressive validation loss going down as you train? (no => malordered examples or bad choice of learning rate) 2. If you test on the train set, does it work? (no => something crazy) 3. Are the predictions sensible? 4. Do you see the right number of features coming up?

How do I gure out which features are important?

1. Save state 2. Create a super-example with all features 3. Start with audit option 4. Save printout. (Seems whacky: but this works with hashing.)

How do I eciently move/store data?

1. Use noop and cache to create cache les. 2. Use cache multiple times to use multiple caches and/or create a supercache. 3. Use port and sendto to ship data over the network. 4. compress generally saves space at the cost of time.

How do I avoid recreating cacheles as I experiment?

1. Create cache with -b , then experiment with -b . 2. Partition features intelligently across namespaces and use ignore .

The Tutorial Plan 1. Baseline online linear algorithm 2. Common Questions. 3. Importance Aware Updates 4. Adaptive updates. 5. Conjugate Gradient. 6. Active Learning. Missing: Online LDA: See Matt's slides Ask Questions!

Examples with importance weights The preceeding is not correct (use loss_function classic if you want it). The update rule is actually importance invariant, which helps substantially.

Principle Having an example with importance weight be equivalent to having the example

h should

h times in the

dataset. (Karampatziakis & Langford, http://arxiv.org/abs/1011.1576 for details.)

Learning with importance weights

y

Learning with importance weights

wt> x

y

Learning with importance weights

−η(∇`)> x

wt> x

y

Learning with importance weights

−η(∇`)> x

> wt> x wt+1 x

y

Learning with importance weights

−6η(∇`)> x

wt> x

y

Learning with importance weights

−6η(∇`)> x

wt> x

y

> wt+1 x ??

Learning with importance weights

−η(∇`)> x

wt> x

y

> wt+1 x

Learning with importance weights

> wt> x wt+1 x y

Learning with importance weights

s(h)||x||2 > wt> x wt+1 x y

What is s (·)? Take limit as update size goes to 0 but number of updates goes to

∞.

What is s (·)? Take limit as update size goes to 0 but number of updates goes to

∞.

Surprise: simplies to closed form. Loss

`(p , y )

Squared

(y − p )2

Logistic

log(1

Hinge

τ -Quantile

+ e−

max(0, 1 if if

y >p y ≤p

Update

yp )

− yp )

τ (y − p ) (1 − τ )(p − y )

s (h )

p>−y „ − e −hηx > x « x x W (ehηx > x +yp+e>yp )−hηx > x −eyp y ∈ {− , yx“ x 1−yp ” −y hη, > y ∈ {− , } x x p) y >p −τ (hη, y − τx>x p −y ) y ≤p ( − τ) (hη, (1−τ )x > x 1

1 1}

for

min

for

if if

min

1

min

+ many others worked out. Similar in eect to implicit gradient, but closed form.

1 1

Robust results for unweighted problems astro - logistic loss

spam - quantile loss

0.97

0.98

0.96

0.97 0.96 standard

standard

0.95 0.94 0.93 0.92

0.95 0.94 0.93 0.92

0.91

0.91

0.9

0.9 0.9

0.91

0.92

0.93 0.94 0.95 importance aware

0.96

0.97

0.9

1

0.945

0.99

0.94

0.98

0.935

0.97

0.93

0.96

0.925 0.92

0.93 0.94 0.95 importance aware

0.96

0.97

0.98

0.95 0.94

0.915

0.93

0.91

0.92

0.905

0.92

webspam - hinge loss

0.95

standard

standard

rcv1 - squared loss

0.91

0.91

0.9

0.9 0.9 0.905 0.91 0.915 0.92 0.925 0.93 0.935 0.94 0.945 0.95 importance aware

0.9

0.91 0.92 0.93 0.94 0.95 0.96 0.97 0.98 0.99 importance aware

The Tutorial Plan 1. Baseline online linear algorithm 2. Common Questions. 3. Importance Aware Updates 4. Adaptive updates. 5. Conjugate Gradient. 6. Active Learning. Missing: Online LDA: See Matt's slides Ask Questions!

Adaptive Updates I Adaptive, individual learning rates in VW. I It's really gradient descent separately on each

coordinate

i

with

ηt ,i = r

1

Pt

s =1



∂`(w > x ,y ) ∂w , s

s

s

2

s i

I Coordinate-wise scaling of the data less of an

issue (units issue addressed) see (Duchi, Hazan, and Singer / McMahan and Streeter, COLT 2010) I Requires x2 RAM at learning time, but learned

regressor is compatible.

Some tricks involved I Store sum of squared gradients w.r.t I

float InvSqrt(float x){ float xhalf = 0.5f * x; int i = *(int*)&x; i = 0x5f3759d5 - (i >> 1); x = *(float*)&i; x = x*(1.5f - xhalf*x*x); return x; }

wi

near

Special SSE rsqrt instruction is a little better

wi .

Experiments I

Raw Data

./vw --adaptive -b 24 --compressed -d tmp/spam_train.gz average loss = 0.02878 ./vw -b 24 --compressed -d tmp/spam_train.gz -l 100 average loss = 0.03267 I

TFIDF scaled data

./vw --adaptive --compressed -d tmp/rcv1_train.gz -l 1 average loss = 0.04079 ./vw --compressed -d tmp/rcv1_train.gz -l 256 average loss = 0.04465

The Tutorial Plan 1. Baseline online linear algorithm 2. Common Questions. 3. Importance Aware Updates 4. Adaptive updates. 5. Conjugate Gradient. 6. Active Learning. Missing: Online LDA: See Matt's slides Ask Questions!

Preconditioned Conjugate Gradient Options conjugate_gradient: Use batch mode preconditioned conjugate gradient learning. 2 passes/update. Output predictor compatible with base algorithm. Requires x5 RAM. Uses cool trick:

d Hd T

∂ 2 l (z ) = 2 hx , d i2 ∂ z

regularization : Add r time the weight magnitude to the optimization. Reasonable choice = 0.001. Works well with logistic or squared loss.

What is Conjugate Gradient? 1. Compute average gradient (one pass). 2. Mix gradient with previous step direction to get new step direction. 3. Compute step size using Newton's method. (one pass) 4. Update weights. Step 2 is particular. Precondition = reweight dimensions.

Why Conjugate Gradient?

Addresses the units problem. A decent batch algorithmrequires 10s of passes sucient. Learned regressor is compatible. See Jonathan Shewchuk's tutorial for more details.

The Tutorial Plan 1. Baseline online linear algorithm 2. Common Questions. 3. Importance Aware Updates 4. Adaptive updates. 5. Conjugate Gradient. 6. Active Learning. Missing: Online LDA: See Matt's slides Ask Questions!

Importance Weighted Active Learning (IWAL) [BDL'09] S =∅ For t

= 1, 2, . . . until no more unlabeled data

1. Receive unlabeled example

xt .

2. Choose a probability of labeling 3. With probability

(xt , yt , p ) t

4.

p get label yt , and add

S. Let ht =Learn(S ). 1

to

pt .

New instantiation of IWAL [BHLZ'10]: strong consistency / label eciency guarantees by using

pt = min where

∆t

 1,





1

∆2t

·

log

t



t −1

= increase in training error rate if learner is

forced to change its prediction on the new unlabeled point

xt .

New instantiation of IWAL [BHLZ'10]: strong consistency / label eciency guarantees by using

pt = min where

∆t

 1,





1

∆2t

·

log

t



t −1

= increase in training error rate if learner is

forced to change its prediction on the new unlabeled point

xt .

Using VW as base learner, estimate

t · ∆t

as the

importance weight required for prediction to switch. For square-loss update:

∆t :=

1

t · ηt

· log

max{

h(xt ), 1 − h(xt )} 0.5

Active learning in Vowpal Wabbit

Simulating active learning C > 0)

: (tuning paramter

vw active_simulation active_mellowness C (increasing C → ∞ = supervised learning)

Active learning in Vowpal Wabbit

Simulating active learning

: (tuning paramter

C > 0)

vw active_simulation active_mellowness C (increasing C → ∞ = supervised learning)

Deploying active learning

:

vw active_learning active_mellowness C daemon I vw interacts with an active_interactor (ai) I for each unlabeled data point, vw sends back a query decision (+importance weight) I

ai

sends labeled importance-weighted examples

as requested I

vw

trains using labeled weighted examples

Active learning in Vowpal Wabbit active_interactor

vw x_1

(query, 1/p_1)

(x_1,y_1,1/p_1) x_2

(gradient update)

(no query)

...

active_interactor.cc (in git repository) demonstrates how to implement this protocol.

Active learning simulation results RCV1 (text binary classication task):

training

:

vw active_simulation active_mellowness 0.000001 -d rcv1-train -f active.reg -l 10 initial_t 10 number of examples = 781265

(i.e., < 13% of the examples) (caveat: progressive validation loss not reective of test loss) total queries = 98074

Active learning simulation results RCV1 (text binary classication task):

training

:

vw active_simulation active_mellowness 0.000001 -d rcv1-train -f active.reg -l 10 initial_t 10 number of examples = 781265

(i.e., < 13% of the examples) (caveat: progressive validation loss not reective of test loss) total queries = 98074

testing

:

vw -t -d rcv1-test -i active.reg average loss = 0.04872 (better than supervised)

Active learning simulation results astrophysics 0.1

spam 0.1

importance aware gradient multiplication passive

0.09

0.09 0.08

0.08

0.07

0.07

error

error

importance aware gradient multiplication passive

0.06

0.06 0.05

0.05

0.04

0.04

0.03

0.03

0.02 0

0.2

0.4

0.6

0.8

1

0

0.2

fraction of labels queried rcv1 0.1

0.8

1

importance aware gradient multiplication passive

0.09 0.08

0.085

0.07

0.08

0.06 error

error

0.09

0.6

webspam 0.1

importance aware gradient multiplication passive

0.095

0.4

fraction of labels queried

0.075

0.05

0.07

0.04

0.065

0.03

0.06

0.02

0.055

0.01

0.05

0 0

0.2

0.4 0.6 fraction of labels queried

0.8

1

0

0.2

0.4 0.6 fraction of labels queried

0.8

1

Goals for Future Development

1. Finish scaling up. I want a kilonode program. 2. Native learning reductions. Just like more complicated losses. 3. Other learning algorithms, as interest dictates. 4. Persistent Daemonization.

Vowpal Wabbit 5.1 - GitHub

The Tutorial Plan. 1. Baseline online linear algorithm. 2. Common ... example_39|excuses:0.1 the:0.01 dog ate my homework |teacher male white Bagnell AI ate.

590KB Sizes 36 Downloads 257 Views

Recommend Documents

Vowpal Wabbit 2016 - GitHub
Community. 1. BSD license. 2. Mailing list >500, Github >1K forks, >1K,. >1K issues, >100 contributors. 3. The official strawman for large scale logistic regression @ NIPS :-) ...

Vowpal Wabbit - GitHub
void learn(void* d, learner& base, example* ec). { base.learn(ec); // The recursive call if ( ec-> nal_prediction > 0) //Thresholding ec-> nal_prediction = 1; else.

LDA from vowpal wabbit - GitHub
born --- 0.0975 career --- 0.0441 died --- 0.0312 worked --- 0.0287 served --- 0.0273 director --- 0.0209 member --- 0.0176 years --- 0.0167 december --- 0.0164.

Vowpal Wabbit - GitHub
QF ellredu™eF „er—s™—le le—rning p—per a most ... vw -c rcv1.train.raw.txt -b 22 --ngram 2. --skips 4 ... ƒolutionX en explor—tion li˜r—ry whi™h r—ndomizes.

Vowpal Wabbit 2015 - GitHub
iPython Notebook for Learning to Search http://tinyurl.com/ ... VW learning to search. 9. Hal Daumé III ([email protected]). Training time versus test accuracy ...

Vowpal Wabbit 6.1 - GitHub
It just works. A package in debian & R. Otherwise, users just type make , and get a working system. At least a half-dozen companies use VW. Favorite App: True ...

Vowpal Wabbit 7 Tutorial - GitHub
Weight 1 by default. – Label: use {-1,1} for classification, or any real value for regression. 1 | 1:0.43 5:2.1 10:0.1. -1 | I went to school. 10 | race=white sex=male ...

Vowpal Wabbit 2015 - PDFKUL.COM
Active Learning in VW: Simulation Mode vw --binary --active --simulation --mellowness 0.01 labeled.data. --mellowness: small value leads to few label queries vw --binary --active --cover 10 --mellowness 0.01 train.data. --cover: number of classifiers

Vowpal Wabbit 6.1 - PDFKUL.COM
What goes wrong? And xes. 2.1 Importance Aware Updates. 2.2 Adaptive updates. 3. LBFGS: Miro's turn. 4. Terascale Learning: Alekh's turn. 5. Common questions we don't have time to cover. 6. Active Learning: See ... (1). 4. Update wi ← wi+ η2(y −

Vowpal Wabbit 5.1 - PDFKUL.COM
wixi clipped to interval. [0,1]. 3. Learn truth y ∈ [0,1] with importance I or goto. (1). 4. Update wi ← wi+ η2(y − ˆy)Ixi and go to (1). ... 1 | 13:3.96e-02 24:3.47e-02 69:4.62e-02 example_39|excuses the dog ate my homework. 1 0.500000 examp

Vowpal Wabbit 7 Tutorial - PDFKUL.COM
General Options. Other Useful Options. • -b n, default is n=18: log number of weight parameters, increase to reduce collisions from hashing. • -q ab, quadratic features between all features in namespace a* and b*. • --ignore a, removes features

51 Homework 51. Temperature.pdf
Sign in. Loading… Whoops! There was a problem loading more pages. Retrying... Whoops! There was a problem previewing this document. Retrying.

51
base aperture such that the cone support protects the stud from shear forces When the stud is ... US. Patent. May 4, 2010. US H2239 H. 10. 125. Q. W 100. FIG-2 ...

Wabbit online calculator instructions Wabbitemu.pdf
Page 1 of 76. Wabbit TI-84 Plus Silver Edition Emulator Instructions. Go to this website http://wabbit.codeplex.com. Download Wabbitemu. Run Wabbitemu.exe. Select “Create a ROM image Select Calculator Type TI-84 Plus SE. using open source software.

Wabbit online calculator instructions Wabbitemu.pdf
Go to this website http://wabbit.codeplex.com. Download Wabbitemu. Run Wabbitemu.exe. Select “Create a ROM image Select Calculator Type TI-84 Plus SE.

WABBIT TI84+ SE EMULATOR.pdf
There was a problem previewing this document. Retrying... Download. Connect more apps... Try one of the apps below to open or edit this item. WABBIT TI84+ ...

Wabbit online calculator instructions Wabbitemu.pdf
Page 1 of 2. 14/11/2012 Página 1 de 9 Profesor: Luís Rodolfo Dávila Márquez CÓDIGO: 00076 UFPS. CURSO: CÁLCULO INTEGRAL. UNIDAD 2 A.

51.pdf
Statement : The economic condition continues to be critical even after a good harvest season. Assumptions : I. The economic condition was not critical before the harvest season. : II. The economic condition could not have improved without a good harv

51.pdf
dltaiunr:finurtoltamtutvrulri'16'rtuusiolu:voiufi4rdulnuh^irduFirhidruiliorJvr&.ifirrut6':yu.jrrGuu rflunr:uilrrur. nr:vrirl{drBr.rosfr'ilnn:ot ... F):alu?Ja iqvrdaqvr: Iy:, 085-ootqsqo. Facebook SNP Learningcenter, Fan page SNP Learning center. D nd

49 50 51 Staten Island Community Districts and City Council ... - GitHub
49. 50. 51. SI 1. SI 2. SI 3. Franklin D. Roosevelt. Boardwalk and Beach. Great. Kills. Park. Staten Island Community Districts and City Council Districts. Source: MapPLUTO™ V.16.2, BYTES of the Big Apple. Created: September 2017. 0. 1. 2. Miles. Â

Number:51/2016 STAFF
Aug 27, 2016 - PSC​ ​EXAM​ ​CALENDER​ ​SEPTEMBER​ ​ONLINE​ ​2016. 1.05/09/2016​ ​Monday​ ​10.00​ ​AM​ ​to​ ​12.15​ ​PM.

51.pdf
1. (a) Pyndap ïa ki jaka basuda (Jied 4 tylli) : 1×4=4. (i) Sha —— ki pyiar ki tnat. (ii) Dei ïa u —— keiñ. (iii) Ba jinghun ka —— ka long ka jingjop. (iv) Lada dei ...

TB 51.signed.pdf
Whoops! There was a problem loading more pages. Retrying... TB 51.signed.pdf. TB 51.signed.pdf. Open. Extract. Open with. Sign In. Main menu. Displaying TB ...

SDK-51.pdf
TIMING. 8051. CPU. 1 1. INTERRUPTS. 4096 BYTES. PROGRAM. MEMORY. (8051 & 8751). 64K·BYTE BUS. EXPANSION. CONTROL. CONTROL. 128 BYTES.