Overview of the TREC 2016 Contextual Suggestion Track Seyyed Hadi Hashemi, Charles Clarke, Jaap Kamps, ! Julia Kiseleva, and Ellen Voorhees! !
University of Amsterdam, University of Waterloo, NIST
TREC 2016
Overview 1. Contextual Suggestion Track Tasks and Setup 2. Test Collection 3. Results 4. Analysis 5. Summary 2
Track Tasks and Setup 1_____________________________ Data Collection, Context, Profile, Tasks
3
What’s Contextual Suggestion? 4
Track Setup 1) 2) 3) 4) 5)
Context! City Trip Type Trip Duration Group Type Season
1) 2) 3) 4)
Profile! Ratings Endorsements Age Gender
Contextual Suggestion System
Attractions! ————! ————! ————! ———— 5
Track’s Aim The Contextual Suggestion Track investigates search techniques for complex information needs that are highly dependent on context and user interests.
6
5th Year! How the TREC 2016 Contextual Suggestion Track changed based on what we learned in previous years?
7
Data Collection •
2012: Open Web
•
2013: Open Web and ClueWeb12
•
2014: Open Web and ClueWeb12
•
2015: TREC CS Collection as a fixed collection
•
2016: TREC CS Web Corpus as an additional data 8
TREC CS Web Corpus •
Web crawl!
•
WARC format •
Easy to recreate
•
~1M pages
•
Copyright form 9
Overview: Context 1) 2) 3) 4) 5)
Context! City Trip Type Trip Duration Group Type Season
1) 2) 3) 4)
Profile! Ratings Endorsements Age Gender
Contextual Suggestion System
Attractions! ————! ————! ————! ———— 10
Context •
2012: Location and time
•
2013 and 2014: Location
•
2015 and 2016: Location, and several pieces of data about the trip: •
A city (e.g., Gaithersburg)
•
A trip type (e.g., Business)
•
A trip duration (e.g., Weekend trip)
•
A type of group a person travelling with (e.g., Friends)
•
A season the trip will occur in (e.g., Autumn) 11
Overview: Profile 1) 2) 3) 4) 5)
Context! City Trip Type Trip Duration Group Type Season
1) 2) 3) 4)
Profile! Ratings Endorsements Age Gender
Contextual Suggestion System
Attractions! ————! ————! ————! ———— 12
Profile •
2012, 2013 and 2014: Ratings
•
2015 and 2016: Ratings, age, and gender!
•
2016: Endorsements collected by NIST assessors! •
Family friendly, outdoor activity, nature walks, parks, museums, art galleries, cocktails, and … 13
Overview: Tasks 1) 2) 3) 4) 5)
Context! City Trip Type Trip Duration Group Type Season
1) 2) 3) 4)
Profile! Ratings Endorsements Age Gender
Contextual Suggestion System
Attractions! ————! ————! ————! ———— 14
TREC 2016 CS Had 2 Phases •
•
Phase 1 •
A Collection-based task
•
Participants had to retrieve 50 suggestions from the TREC CS Collection
Phase 2 •
A reranking task
•
Suggestions candidates set is provided for each request 15
Participants’ statistics over the last 5 years •
2012: 14 teams - 27 runs
•
2013: 19 teams - 34 runs
•
2014: 17 teams - 21 runs
•
2015: Live (6 teams - 9 runs), and Batch (17 teams - 30 runs); 39 runs in total
•
2016: Phase 1 (8 teams - 15 runs), and Phase 2 (13 teams - 30 runs); 45 runs in total 16
Test Collection 2_______________________ Qrel, Test Collection Statistics
17
Test Collection •
# Requests : 424
•
# Persons: 241
•
# Cities: 215
•
TREC 2016 official test collection: •
# Requests: 61 (Phase 1) and 58 (Phase 2)
•
# Persons: 27
•
# Cities: 48 + 2 (Seed cities)
•
# Unique Judged Venues: 5,162
•
Avg # Venues / Request: 95
•
# Judgments: 5,782 unique judgments 18
Test Collection Statistics
19
Test Collection Statistics
20
Endorsements •
How many judged tags/categories? 133
•
Relevance probability of each category in the qrel: ! ! ! ! !
21
Results 3______________________________ Metrics, Phase 1 Results, Phase 2 Results
22
Metrics! NDCG@5! P@5! MRR
23
Phase 1 Results
24
Top-3 Phase 1 Submissions •
#1: USI (Universite della Svizzera italiana) •
They crawled 600K venues from Foursquare to create positive and negative category profiles
•
They created venue taste keyword profiles
•
Final score is a linear combination of the venue category and taste keyword scores.
•
#2: IAPLab (Nanjing University)
•
#3: ADAPT_TCD (ADAPT Centre, Trinity College Dublin) •
Ontology-based approach
•
They created the ontology using the Foursquare Category Hierarchy 25
Phase 2 Results
26
Top-2 Phase 2 Submissions •
•
#1: DUTH (Democritus University of Thrace) •
They used Rocchio-like classifier using users’ rated venues as train set.
•
They build a custom query for the user using a modified Rocchio relevance feedback method
#2: LavalLakehead (Laval University & Lakehead University) •
Global interests model, regressor trained on the TREC 2015 data
•
Contextual individual preference model, using word embeddings
•
Final ranking is based on the combination of the above models 27
Analaysis 4_______________________________ Multi-depth Pooling, Reusability, Overlap@N
28
Is the TREC CS Test Collection Reusable?
29
How Difficult the Reusability Problem is! ! ! ! ! ! ! •
Just 22% of suggestions are mutual! 30
Reusability based on P@5 •
Leave One Team Out (LOTO)! !
2014
2015
2016
! !
31
Multi-depth Pooling •
Hard pool Cut-off = 5, soft pool cut-off = 25, and very soft pool cut-off = 50 !
Ranked List ! ! ! !
Hard Pool Cut-off
5. !
25.
Soft Pool Cut-off
!
Very Soft Pool Cut-off
50.
32
Multi-depth Pooling Cost •
We have more stable metrics in deep ranks without spending that much effort.
•
It costs even less than traditional pooling using 10 as the pool depth
! ! ! ! ! !
33
Reusability of TREC 2016 CS Test Collection
34
Overlap@N •
Overlap@N is mean fraction of judged documents over requests at rank N of runs
•
Previously, overlap@N drops dramatically after pool cut-off! ! 2015
2016
! ! !
35
Reusability of TREC 2016 CS Test Collection •
Phase 1 experiments is reusable for a new research group if they use either bpref or MAP metrics and they do statistical significance test!
•
bpref: 67 / 105 (64%) of differences are significant in 2016! !
2015
2016
! ! ! !
36
Reusability of TREC 2016 CS Test Collection •
Phase 1 experiments is reusable for a new research group if they use either bpref or MAP metrics and they do statistical significance test!
•
MAP: 57 / 105 (54%) of differences are significant in 2016! ! !
2016
2015
! ! !
37
Phase 2 VS. Pooling Bias •
There is not any pooling bias in Phase 2.
•
None of the runs are pooled.
•
Fair to be used by research groups, which have not participated in the track.
•
All the measures are practical for evaluation and system rankings.
38
Summary 5__________________ Summary, Discussion
39
Summary •
An overview of the Contextual Suggestion Track (Phase 1 and Phase 2).
•
TREC CS Test Collection.
•
TREC CS Phase 1 and Phase 2 results.
•
TREC CS Web Corpus and a set of endorsement released.
•
Multi-depth pooling provides a pool that creates reusable test collection, which is also more stable in ranks deeper than the traditional pool cut-off.
•
Phase 1 test collection is reusable based on more stable metrics like bpref and MAP.
•
Phase 2 test collection does not have pooling bias and is fair to be used to evaluate non-pooled runs. 40
Do You Want to Continue Working on Contextual Suggestion?
Come to the Task Track Planning Session 41
Thanks for Your Attention!
42