Predicting Response in Mobile Advertising with Hierarchical Importance-Aware Factorization Machine Richard Oentaryo, Ee-Peng Lim, Jia-Wei Low, Mike Finegold WSDM 2014
PART 1: PROBLEM
Our focus
Mobile Adver1sing Landing page
Response rates (click, conversion)
Bids
arg max (bid x rate)
Ads Page Mobile user (Adapted from Agarwal & Chen, 2011)
Publisher
Choose best ads
Ad network
Adver1sers
Response Predic1on Task
# clicks pCTR = # exposes
• Goal: Predict click-‐through rate (pCTR) for a given (page, ad) pair at a parDcular Dme • Challenges: – Temporal dynamics – Cost-‐varying – Sparse data / cold-‐start
En11es known to us: -‐ Webpage (= page ID) -‐ Ad (= campaign ID) -‐ Time (= day of week)
• Desiderata: – Accurate pCTR esDmate à crucial for ad price aucDon – Good pCTR ranking à effecDve placement of ads in a page
Related Works: A Rough Taxonomy Response predicDon
Explicit feature-‐based
Latent feature-‐based
• Richardson et al., WWW, 2007 • Dave & Varma, SIGIR, 2010
(Log)-‐linear model • Craswell et al, WSDM, 2008 • Agarwal et al, WWW, 2009 • Agarwal et al, KDD, 2010
Matrix factorizaDon
Tensor factorizaDon
• Menon et al, KDD, 2011 • Shen et al, WSDM, 2012 • Yin et al., WSDM, 2014 • This work
PART 2: DATASET
BuzzCity Ad Network
Dataset: 05-‐31 October 2012 emin = 1
emin = 10
emin = 100
emin = 1000
24,172,134
10,535,658
3,587,160
931,032
Page hierarchy
No. of webpages
244,341
138,351
55,260
16,374
No. of publishers
3,945
3,539
2,654
1,643
No. of countries
243
239
226
199
No. of channels
8
8
8
8
Ad hierarchy
En1ty
Minimum no. of ad exposes (emin)
No. of ads
23,500
18,365
15,600
10,877
No. of adverDsers
1,989
1,406
1,245
1,124
5
4
3
3
No. of records
No. of banner types
PART 3: APPROACH
day d
?
K latent factors
page p
Goal: Predict unknown pCTR Challenge: Sparse data, cold-‐start cases
page x ad page x day
ad a
ad x day
ad a
page p
K latent factors
pCTR
Tensor factorizaDon
K latent factors
Response Predic1on Framework
day d
Unifying Model: Factoriza1on Machine Factoriza1on machine is a generic bilinear model (Rendle2012)
Linear regression
Two-‐way interacDon
In this work, we use the following feature representaDon:
x = (0,..., 0,1, 0,..., 0, 0,..., 0,1, 0,..., 0, 0,...0,1, 0,..., 0) #pages
#ads
#days
Result: Pairwise tensor factoriza1on
Model Interpreta1on
x = (0,..., 0,1, 0,..., 0, 0,..., 0,1, 0,..., 0, 0,...0,1, 0,..., 0) #pages
#days
#ads K
K
K
yˆ = w0 + w p + wa + wd + ∑ vk, p vk,a + ∑ vk, p vk,d + ∑ vk,a vk,d k=1
Global bias
Local bias
k=1
Pairwise interacDons
k=1
Importance-‐Aware Learning WEIGHTED COST FUNCTIONS
Individual
Individual = (page, ad, day) triplet Share = no. of exposures
Share/weight
Handling Cold Start: Hierarchy Structure Page hierarchy Channel Publisher
Ad hierarchy AdverDser
Banner Type
Country
Page
Ad
Hierarchical structures as prior probability and a means for back-‐off
Hierarchical Learning 1
Regulariza1on
Intui1on: Each latent factor should have a prior that makes it more similar to its parents
2
Fi]ng
Idea: Reasonable prior of the parents can be obtained by aggregaDng click/expose data Click
Expose
Hierarchical + Importance-‐Aware Learning
Hierarchical regularizaDon
StochasDc gradient descent (Square and logisDc loss)
Weighted update + hierarchical regularizaDon
Coordinate descent (Cyclic and stochasDc)
PART 4: EXPERIMENTS
Experiment Setup Valida1on Evalua1on
Metrics
Training set
Test set
Trial 1
05-‐11 Oct
12-‐13 Oct
Trial 2
07-‐13 Oct
14-‐15 Oct
Trial 3
09-‐15 Oct
16-‐17 Oct
Trial 4
11-‐17 Oct
18-‐19 Oct
Trial 5
13-‐19 Oct
20-‐21 Oct
Trial 6
15-‐21 Oct
22-‐23 Oct
Trial 7
17-‐23 Oct
24-‐25 Oct
Trial 8
19-‐25 Oct
26-‐27 Oct
Trial 9
21-‐27 Oct
28-‐29 Oct
Trial 10
23-‐29 Oct
30-‐31 Oct
1
PredicDon metric: • Root mean square error • NegaDve log-‐likelihood
2
Ranking metric: • Area under ROC curve
Performance Benchmark Regression
0.013 0.011 0.009 0.007 0.005
100%
Baseline
Proposed methods
Proposed methods
80% 70% 60% 50% 40% 30%
0.003
20%
0.001
10%
-‐0.001
Baseline
90% Ad ranking (wAUC)
pCTR predic1on error (wRMSE)
0.015
Ranking
0%
Min. exposure emin = 1000
Performances in Cold Start Situa1ons Observa1ons 1) Results improve when using hierarchical regularizaDon and filng 2) Results for importance-‐aware and unweighted learning are not directly comparable
Min. exposure emin = 1000
PART 5: CONCLUSION
Main Contribu1ons • A unified tensor factorizaDon model catering for temporal dynamics, importance-‐aware and hierarchical learning • Simple yet effecDve extensions of SGD and CD algorithms for handling importance weights and hierarchical regularizaDon • PracDcal usage and result improvements showcased on real mobile adverDsing data
FUTURE WORKS
• Need to scale up with parallelizaDon or more efficient data representaDon • Enhance interpretability of the model, e.g., non-‐negaDvity constraints in the latent factors • Explore model applicaDons in other tasks, e.g., item adopDon
Varia1ons at Different Hierarchy Levels Page x adverDser
Channel x adverDser
Observa1ons
1) CTR values vary across different days x ad increases as we go up in the 2) Publisher CTR variaDons hierarchy à possibility for “back-‐off”
Channel x ad
Performance Benchmark
Min. exposure emin = 10
Performance Benchmark
Min. exposure emin = 100
Performance Benchmark
Min. exposure emin = 1000
Performances in Cold Start Situa1ons
Min. exposure emin = 100
Performances in Cold Start Situa1ons
Min. exposure emin = 10