Tera-scale deep learning Quoc V. Le
Stanford University and Google
Joint work with
Kai Chen
Greg Corrado
Rajat Monga Andrew Ng
AddiNonal Thanks:
Jeff Dean MaQhieu Devin
Marc Aurelio Paul Tucker Ranzato
Ke Yang
Samy Bengio, Zhenghao Chen, Tom Dean, Pangwei Koh, Mark Mao, Jiquan Ngiam, Patrick Nguyen, Andrew Saxe, Mark Segal, Jon Shlens, Vincent Vanhouke, Xiaoyun Wu, Peng Xe, Serena Yeung, Will Zou
Machine Learning successes
Face recogniNon
OCR
RecommendaNon systems
Autonomous car
Email classificaNon
Web page ranking
Feature ExtracNon
Classifier
Feature extracNon (Mostly hand-‐cra]ed features)
Hand-‐Cra]ed Features Computer vision: … SIFT/HOG
SURF Speech RecogniNon:
…
MFCC
Spectrogram
ZCR
New feature-‐designing paradigm
Unsupervised Feature Learning / Deep Learning ReconstrucNon ICA Expensive and typically applied to small problems
The Trend of BigData
Outline No maQer the algorithm, more features always more successful. -‐ ReconstrucNon ICA
-‐ ApplicaNons to videos, cancer images -‐ Ideas for scaling up -‐ Scaling up Results
Topographic Independent Component Analysis (TICA)
1. Feature computaNon
( W T 1
2
)
(
W T 9
2
)
2. Learning
W T 9
W T 1 W 1
W 9
W =
Input data:
W 1 W 2 . . W 10000
Topographic Independent Component Analysis (TICA)
Invariance explained Images Features
F1
F2
Pooled feature of F1 and F2
Image1
Image2 Loc1
Loc2
1
0
0
1
sqrt(1 2 + 02 ) = 1
sqrt(0 2 + 12 ) = 1
Same value regardless the locaNon of the edge
TICA:
ReconstrucNon ICA:
Equivalence between Sparse Coding, Autoencoders, RBMs and ICA Build deep architecture by treaNng the output of one layer as input to another layer Le, et al., ICA with Reconstruc1on Cost for Efficient Overcomplete Feature Learning. NIPS 2011
ReconstrucNon ICA:
Le, et al., ICA with Reconstruc1on Cost for Efficient Overcomplete Feature Learning. NIPS 2011
ReconstrucNon ICA:
Data whitening
Le, et al., ICA with Reconstruc1on Cost for Efficient Overcomplete Feature Learning. NIPS 2011
TICA:
ReconstrucNon ICA:
Data whitening
Le, et al., ICA with Reconstruc1on Cost for Efficient Overcomplete Feature Learning. NIPS 2011
Why RICA?
Algorithms
Speed
Ease of training
Invariant Features
Sparse Coding RBMs/Autoencoders TICA ReconstrucNon ICA
Le, et al., ICA with Reconstruc1on Cost for Efficient Overcomplete Feature Learning. NIPS 2011
Summary of RICA -‐ Two-‐layered network -‐ ReconstrucNon cost instead of orthogonality constraints -‐ Learns invariant features
ApplicaNons of RICA
AcNon recogniNon
Sit up
Eat
Run
Drive Car
Answer phone
Stand up
Le, et al., Learning hierarchical spa1o-‐temporal features for ac1on recogni1on with independent subspace analysis. CVPR 2011
Get Out of Car
Kiss
Shake hands
Le, et al., Learning hierarchical spa1o-‐temporal features for ac1on recogni1on with independent subspace analysis. CVPR 2011
94
55
KTH
92
51
90
49
88
47
86
43
45 41
84
39
82
37 35
80 Hessian/SURF
pLSA
HOF
GRBMs
3DCNN HMAX
HOG
76
UCF
85
Hessian/SURF
Learned Features
87
75
83
74
81
73
79
72
77
71
75
70
Hessian/SURF
Hollywood2
53
HOG
Hessian HOG.HOF
HOF
HOG3D
Learned Features
HOG/HOF
HOG3D
GRBMS
HOF
Learned Features
YouTube
Combined Engineered Features
Le, et al., Learning hierarchical spa1o-‐temporal features for ac1on recogni1on with independent subspace analysis. CVPR 2011
Learned Features
Cancer classificaNon
92%
ApoptoNc
90%
Viable tumor region
88%
86%
Necrosis
84% Hand engineered Features
… Le, et al., Learning Invariant Features of Tumor Signatures. ISBI 2012
RICA
Scaling up deep RICA networks
Scaling up Deep Learning
Deep learning data
Real data
It’s beQer to have more features! No maQer the algorithm, more features always more successful.
Coates, et al., An Analysis of Single-‐Layer Networks in Unsupervised Feature Learning. AISTATS’11
Most are local features
Local recepNve field networks Machine #1
Machine #2
RICA features
Image
Le, et al., Tiled Convolu1onal Neural Networks. NIPS 2010
Machine #3
Machine #4
Challenges with 1000s of machines
Asynchronous Parallel SGDs
Parameter server
Le, et al., Building high-‐level features using large-‐scale unsupervised learning. ICML 2012
Asynchronous Parallel SGDs
Parameter server
Le, et al., Building high-‐level features using large-‐scale unsupervised learning. ICML 2012
Summary of Scaling up -‐ Local connecNvity -‐ Asynchronous SGDs
… And more -‐ RPC vs MapReduce -‐ Prefetching -‐ Single vs Double -‐ Removing slow machines -‐ OpNmized So]max -‐ …
10 million 200x200 images 1 billion parameters
Training RICA
RICA
Dataset: 10 million 200x200 unlabeled images from YouTube/Web Train on 2000 machines (16000 cores) for 1 week 1.15 billion parameters -‐ 100x larger than previously reported -‐ Small compared to visual cortex
RICA
Image
Le, et al., Building high-‐level features using large-‐scale unsupervised learning. ICML 2012
The face neuron
Top sNmuli from the test set
OpNmal sNmulus by numerical opNmizaNon
Le, et al., Building high-‐level features using large-‐scale unsupervised learning. ICML 2012
Random distractors Faces
Frequency
Feature value Le, et al., Building high-‐level features using large-‐scale unsupervised learning. ICML 2012
0 pixels
20 pixels
Feature response
Feature response
Invariance properNes
0 pixels VerNcal shi]s
o 90
o 0 3D rotaNon angle
Feature response
Feature response
Horizontal shi]s
20 pixels
0.4x
1x
1.6x
Scale factor
Le, et al., Building high-‐level features using large-‐scale unsupervised learning. ICML 2012
Top sNmuli from the test set
OpNmal sNmulus by numerical opNmizaNon
Le, et al., Building high-‐level features using large-‐scale unsupervised learning. ICML 2012
Random distractors Pedestrians
Frequency
Feature value Le, et al., Building high-‐level features using large-‐scale unsupervised learning. ICML 2012
Top sNmuli from the test set
OpNmal sNmulus by numerical opNmizaNon
Le, et al., Building high-‐level features using large-‐scale unsupervised learning. ICML 2012
Random distractors Cat faces
Frequency
Feature value Le, et al., Building high-‐level features using large-‐scale unsupervised learning. ICML 2012
ImageNet classificaNon 22,000 categories 14,000,000 images Hand-‐engineered features (SIFT, HOG, LBP), SpaNal pyramid, SparseCoding/Compression
Le, et al., Building high-‐level features using large-‐scale unsupervised learning. ICML 2012
22,000 is a lot of categories… … smoothhound, smoothhound shark, Mustelus mustelus American smooth dogfish, Mustelus canis Florida smoothhound, Mustelus norrisi whiteNp shark, reef whiteNp shark, Triaenodon obseus AtlanNc spiny dogfish, Squalus acanthias Pacific spiny dogfish, Squalus suckleyi hammerhead, hammerhead shark smooth hammerhead, Sphyrna zygaena smalleye hammerhead, Sphyrna tudes shovelhead, bonnethead, bonnet shark, Sphyrna Nburo angel shark, angelfish, SquaNna squaNna, monkfish electric ray, crampfish, numbfish, torpedo smalltooth sawfish, PrisNs pecNnatus guitarfish roughtail sNngray, DasyaNs centroura buQerfly ray eagle ray spoQed eagle ray, spoQed ray, Aetobatus narinari cownose ray, cow-‐nosed ray, Rhinoptera bonasus manta, manta ray, devilfish AtlanNc manta, Manta birostris devil ray, Mobula hypostoma grey skate, gray skate, Raja baNs liQle skate, Raja erinacea …
SNngray
Mantaray
Best sNmuli
Feature 1
Feature 2
Feature 3
Feature 4
Feature 5
Le, et al., Building high-‐level features using large-‐scale unsupervised learning. ICML 2012
Best sNmuli
Feature 6
Feature 7
Feature 8
Feature 9
Le, et al., Building high-‐level features using large-‐scale unsupervised learning. ICML 2012
Best sNmuli
Feature 10
Feature 11
Feature 12
Feature 13
Le, et al., Building high-‐level features using large-‐scale unsupervised learning. ICML 2012
0.005% 9.5% Random guess
State-‐of-‐the-‐art (Weston, Bengio ‘11)
?
Feature learning From raw pixels
Le, et al., Building high-‐level features using large-‐scale unsupervised learning. ICML 2012
0.005% 9.5% 15.8% Random guess
State-‐of-‐the-‐art (Weston, Bengio ‘11)
Feature learning From raw pixels
ImageNet 2009 (10k categories): Best published result: 17% (Sanchez & Perronnin ‘11 ), Our method: 20% Using only 1000 categories, our method > 50%
Le, et al., Building high-‐level features using large-‐scale unsupervised learning. ICML 2012
Other results No maQer the algorithm, more features always more successful. -‐ We also have great features for -‐ Speech recogniNon -‐ Word-‐vector embedding for NLPs
Conclusions • RICA learns invariant features • Face neuron with totally unlabeled data with enough training and data • State-‐of-‐the-‐art performances on – AcNon RecogniNon – Cancer image classificaNon – ImageNet
0.005%
ImageNet 9.5%
Random guess
Best published result
15.8% Our method
94 92 90 88 86 84 82 80 Cancer classificaNon
Feature visualizaNon
AcNon recogniNon benchmarks
AcNon recogniNon
Face neuron
References • Q.V. Le, M.A. Ranzato, R. Monga, M. Devin, G. Corrado, K. Chen, J. Dean, A.Y. Ng. Building high-‐level features using large-‐scale unsupervised learning. ICML, 2012. • Q.V. Le, J. Ngiam, Z. Chen, D. Chia, P. Koh, A.Y. Ng. Tiled Convolu8onal Neural Networks. NIPS, 2010. • Q.V. Le, W.Y. Zou, S.Y. Yeung, A.Y. Ng. Learning hierarchical spa8o-‐temporal features for ac8on recogni8on with independent subspace analysis. CVPR, 2011. • Q.V. Le, J. Ngiam, A. Coates, A. Lahiri, B. Prochnow, A.Y. Ng. On op8miza8on methods for deep learning. ICML, 2011. • Q.V. Le, A. Karpenko, J. Ngiam, A.Y. Ng. ICA with Reconstruc8on Cost for Efficient Overcomplete Feature Learning. NIPS, 2011. • Q.V. Le, J. Han, J. Gray, P. Spellman, A. Borowsky, B. Parvin. Learning Invariant Features for Tumor Signatures. ISBI, 2012. • I.J. Goodfellow, Q.V. Le, A.M. Saxe, H. Lee, A.Y. Ng, Measuring invariances in deep networks. NIPS, 2009.
hQp://ai.stanford.edu/~quocle