Encoding Video and Label Priors for Multi-label Video Classification on YouTube-8M dataset Team SNUVL X SKT (8th

Seil Na1 1

Youngjae Yu1

Sangho Lee1


Jisung Kim2

Gunhee Kim1


Code : https://github.com/seilna/youtube8m

Contents • YouTube-8M Video Multi-label Classification • Our approach • • • •

Video Pooling Layer Classification Layer Label Processing Layer Loss Function

• Results

YouTube-8M Video Multi-label Classification • Input: videos (with audio) with maximum 300 seconds long • Video and audio are given in feature form, extracted using Inception Network and VGG

Inception VGG

Video Audio

Inception VGG

Inception VGG

Inception VGG

Inception VGG

Inception VGG

YouTube-8M Video Multi-label Classification • Output: given a test video and audio feature, model produces a multi-label prediction score for 4,716 classes

Video Feature Audio Feature


Car Racing Race Track Vehicle

YouTube-8M Video Multi-label Classification • Evaluation: among scores for all classes, only top 20 scores are considered • Google Average Precision (GAP) is used to evaluate performance of model ,

𝐺𝐴𝑃 = % 𝑝 𝑖 ∆𝑟(𝑖) -./

Three Key Issues • Our approach tackles THREE issues i) Video pooling method (representation) ii) Label imbalance problem iii) Correlation between labels

Three Key Issues • Our approach tackles THREE issues i) Video pooling method (Representation) • Encode T frame features into a compact vector • Encoder should capture the content distribution of frames and temporal information of the sequence

ii) Label imbalance problem iii) Correlation between labels

Three Key Issues • Our approach tackles THREE issues i) Video pooling method ii) Label imbalance problem • In YouTube-8M dataset, the numbers of instances for each class are very different • How can we generalize well on small sets in the validation/test dataset?

Three Key Issues • Our approach tackles THREE issues i) Video pooling method ii) Label imbalance problem iii) Correlation between labels

Three Key Issues • Our approach tackles THREE issues i) Video pooling method ii) Label imbalance problem iii) Correlation between labels • Some labels are semantically interrelated • Connected labels tend to appear in the same video • How can we use this prior to improve classification performance?

Our approach • Our model consists of FOUR components I. II. III. IV.

Video pooling layer Classification layer Label processing layer Loss function

Our approach • Our model consists of FOUR components I. II. III. IV.

Video pooling layer 1,2 Classification layer Label processing layer 3 Loss function 2

1. Video pooling method 2. Label imbalance problem 3. Correlation between labels

Video Pooling Layer • Video pooling layer 𝑔1 : ℝ5 × /,/89 → ℝ; encodes 𝑇 frame vectors into a compact vector • Experiment following 5 methods


Position Encoding

!# Adaptive Noise

!" CNN

Indirect Clustering

(a) Video Pooling Layer

%$ Video Pooling Layer 1. LSTM • Each frame vector is the input of LSTM • All states vectors and the average of input vectors are used LSTM


LSTM pooling feature

Video Feature Audio Feature


Video Pooling Layer 2. CNN • Use convolution operation like [Kim 2014]. • Adjacent frame vectors are regarded together


𝑐> convolution

max pool over time

Kim, Yoon. "Convolutional neural networks for sentence classification."arXiv:1408.5882, 2014

Video Pooling Layer 3. Position Encoding • Use the position encoding matrix [E2EMN] to represent the sequence order


mean pool PE Matrix

An improved sentence representation over BOW by considering word order

Sukhbaatar et al. "End-to-end memory networks." NIPS 2015.

Video Pooling Layer 4. Indirect Clustering • We implicitly cluster frames via self-attention mechanism

Self Attention

Weighted Sum

Video Pooling Layer 5. Adaptive Noise • To deal with label imbalance, inject more noise to features of a video with rare labels, and less noise to videos with common labels Mean pool

Car, Game, Football

DJ Hero 2, Slipper, Audi Q5 Gaussian Noise

Classification Layer • Given pooled video features, the Classification Layer ℎ1 : ℝ; → ℝA,B/C outputs a class score • Experiment following 3 methods

Classification Layer 1. Multi-layer Mixture of Experts • Simply expand the existing MoE model softmax



+ softmax

pooling feature




Classification Layer 1. Multi-layer Mixture of Experts • Simply expand the existing MoE model softmax



+ softmax

pooling feature



Multi-layer MoE

Classification Layer 2. N-Layer MLP • A stack of fully connected layer • Empirically, three layers with layer normalization softmax

N-Layer MLP







FC pooling feature

Classification Layer 3. Many-to-Many • Each frame vector is the input of LSTM • Output is an average of score for each time step +



Video Feature Audio Feature



Label Processing Layer • Label Processing Layer 𝐶1 update the class score using prior for correlation between labels • Experiment following 1 method

Label Processing Layer 1. Encoding Label Correlation • Construct a correlation matrix by counting the labels that appear in the same videos

Car Racing Sports Car Car Wash

Label Processing Layer 1. Encoding Label Correlation • Update the score using the correlation matrix

𝑂H = 𝛼 J 𝑂> + 𝛽 J 𝑀H 𝑂> + 𝛾 J 𝑀H ′𝑂> Forward prop Backward prop

× prediction


Loss Function 1. Center Loss • Assign a penalty for the embedding of video belonging to the same label • Add the center loss term to cross-entropy label loss at a predefined rate

Wen et al. "A discriminative feature learning approach for deep face recognition." ECCV 2016.

Loss Function 2. Huber Loss • A combination of L1 and L2 loss to be robust against noisy labels • Use pseudo-huber loss of cross entropy for fully-differentiable form

ℒ = 𝛿9


ℒRS 9 − T


Results – Video Pooling Layer

• The LSTM family showed the best accuracies • The more the distribution information is in the LSTM state, the better the performance is

Results – Classification Layer

• Multi-layer MLP showed the best performance • LN made an improvement unlike LSTM in the video pooling layer

Results – Label Processing Layer

• In all combinations, label processing had little impact on performance improvement • It implies that a more sophisticated model is needed to deal with correlation between labels

Results – Loss Function

• The Huber loss is helpful to handle noisy labels or label imbalance problems

Conclusion Video Pooling Layer • Even for the "video" classification, the content distribution information of the frame vectors had a great impact on performance • Future Work 1. How to incorporate temporal information well? 2. A better pooling method for both distribution and temporal information (e.g. RNN-FV)?

Lev et al. "RNN Fisher Vectors for Action Recognition and Image Annotation." ECCV 2016.

Conclusion Label Processing Layer • Correlation between labels was treated too naively in our work • Future work 1. A more sophisticated approach for it?

Loss function • With the same label distribution in the current train/val/test split, there may be no need to address the label imbalance issue (for final accuracy)

slide - Research at Google

Gunhee Kim1. Seil Na1. Jisung Kim2. Sangho Lee1. Youngjae Yu1. Code : https://github.com/seilna/youtube8m. Team SNUVL X SKT (8th Ranked). 1 ... Page 9 ...

19MB Sizes 3 Downloads 1016 Views

Recommend Documents

Mathematics at - Research at Google
Index. 1. How Google started. 2. PageRank. 3. Gallery of Mathematics. 4. Questions ... http://www.google.es/intl/es/about/corporate/company/history.html. ○.

Faucet - Research at Google
infrastructure, allowing new network services and bug fixes to be rapidly and safely .... as shown in figure 1, realizing the benefits of SDN in that network without ...

BeyondCorp - Research at Google
41, NO. 1 www.usenix.org. BeyondCorp. Design to Deployment at Google ... internal networks and external networks to be completely untrusted, and ... the Trust Inferer, Device Inventory Service, Access Control Engine, Access Policy, Gate-.

VP8 - Research at Google
coding and parallel processing friendly data partitioning; section 8 .... 4. REFERENCE FRAMES. VP8 uses three types of reference frames for inter prediction: ...

JSWhiz - Research at Google
Feb 27, 2013 - and delete memory allocation API requiring matching calls. This situation is further ... process to find memory leaks in Section 3. In this section we ... bile devices, such as Chromebooks or mobile tablets, which typically have less .

Yiddish - Research at Google
translation system for these language pairs, although online dictionaries exist. ..... http://www.unesco.org/culture/ich/index.php?pg=00206. Haifeng Wang, Hua ...

traits.js - Research at Google
on the first page. To copy otherwise, to republish, to post on servers or to redistribute ..... quite pleasant to use as a library without dedicated syntax. Nevertheless ...

sysadmin - Research at Google
On-call/pager response is critical to the immediate health of the service, and ... Resolving each on-call incident takes between minutes ..... The conference has.

Introduction - Research at Google
Although most state-of-the-art approaches to speech recognition are based on the use of. HMMs and .... Figure 1.1 Illustration of the notion of margin. additional ...

References - Research at Google
A. Blum and J. Hartline. Near-Optimal Online Auctions. ... Sponsored search auctions via machine learning. ... Envy-Free Auction for Digital Goods. In Proc. of 4th ...

BeyondCorp - Research at Google
Dec 6, 2014 - Rather, one should assume that an internal network is as fraught with danger as .... service-level authorization to enterprise applications on a.

Browse - Research at Google
tion rates, including website popularity (top web- .... Several of the Internet's most popular web- sites .... can't capture search, e-mail, or social media when they ..... 10%. N/A. Table 2: HTTPS support among each set of websites, February 2017.

Continuous Pipelines at Google - Research at Google
May 12, 2015 - Origin of the Pipeline Design Pattern. Initial Effect of Big Data on the Simple Pipeline Pattern. Challenges to the Periodic Pipeline Pattern.

Accuracy at the Top - Research at Google
We define an algorithm optimizing a convex surrogate of the ... as search engines or recommendation systems, since most users of these systems browse or ...

1 - Research at Google
nated marketing areas (DMA, [3]), provides a significant qual- ity boost to the LM, ... geo-LM in Eq. (1). The direct use of Stolcke entropy pruning [8] becomes far from straight- .... 10-best hypotheses output by the 1-st pass LM. Decoding each of .

1 - Research at Google
circles on to a nD grid, as illustrated in Figure 6 in 2D. ... Figure 6: Illustration of the simultaneous rasterization of ..... 335373), and gifts from Adobe Research.

Condor - Research at Google
1. INTRODUCTION. During the design of a datacenter topology, a network ar- chitect must balance .... communication with applications and services located on.

practice - Research at Google
used software such as OpenSSL or Bash, or celebrity photographs stolen and ... because of ill-timed software updates ... passwords, but account compromise.

bioinformatics - Research at Google
studied ten host-pathogen protein-protein interactions using structu- .... website. 2.2 Partial Positive Labels from NIAID. The gold standard positive set we used in (Tastan et ..... were shown to give the best performance for yeast PPI prediction.

Natural Language Processing Research - Research at Google
Used numerous well known systems techniques. • MapReduce for scalability. • Multiple cores and threads per computer for efficiency. • GFS to store lots of data.

Online panel research - Research at Google
Jan 16, 2014 - social research – Vocabulary and Service Requirements,” as “a sample ... using general population panels are found in Chapters 5, 6, 8, 10, and 11 .... Member-get-a-member campaigns (snowballing), which use current panel members

article - Research at Google
Jan 27, 2015 - free assemblies is theoretically possible.41 Though the trends show a marked .... loop of Tile A, and the polymerase extends the strand, unravelling the stem ..... Reif, J. Local Parallel Biomolecular Computation. In DNA-.

Theory Research at Google
Jun 28, 2008 - three sections: ACM SIGACT News. 10. June 2008, vol. 39, no. 2 ..... and other graphs such as social networks, such solutions typically ignore the explicit information .... The best example for learning ranking is information retrieval