Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN Heiga Zen Google July 9th, 2015
Outline
Basics of HMM-based speech synthesis Background HMM-based speech synthesis Advanced topics in HMM-based speech synthesis Flexibility Improve naturalness Neural network-based speech synthesis Feed-forward neural network (DNN & DMDN) Recurrent neural network (RNN & LSTM-RNN) Results
Lecturer
• Heiga Zen
• PhD from Nagoya Institute of Technology, Japan (2006)
• Intern, IBM T.J. Watson Research, New York (2004–2005)
• Research engineer, Toshiba Research Europe, Cambridge (2009–2011) • Research scientist, Google, London (2011–Present) Heiga Zen
Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN
July 9th, 2015
1 of 104
Outline
Basics of HMM-based speech synthesis Background HMM-based speech synthesis Advanced topics in HMM-based speech synthesis Flexibility Improve naturalness Neural network-based speech synthesis Feed-forward neural network (DNN & DMDN) Recurrent neural network (RNN & LSTM-RNN) Results
Text-to-speech as sequence-to-sequence mapping
Automatic speech recognition (ASR) Speech (real-valued time series) → Text (discrete symbol sequence)
Heiga Zen
Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN
July 9th, 2015
3 of 104
Text-to-speech as sequence-to-sequence mapping
Automatic speech recognition (ASR) Speech (real-valued time series) → Text (discrete symbol sequence) Statistical machine translation (SMT) Text (discrete symbol sequence) → Text (discrete symbol sequence)
Heiga Zen
Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN
July 9th, 2015
3 of 104
Text-to-speech as sequence-to-sequence mapping
Automatic speech recognition (ASR) Speech (real-valued time series) → Text (discrete symbol sequence) Statistical machine translation (SMT) Text (discrete symbol sequence) → Text (discrete symbol sequence) Text-to-speech synthesis (TTS) Text (discrete symbol sequence) → Speech (real-valued time series)
Heiga Zen
Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN
July 9th, 2015
3 of 104
Speech production process
modulation of carrier wave by speech information
freq transfer char
voiced/unvoiced
fundamental freq
text (concept)
speech
frequency transfer characteristics magnitude start--end
Sound source voiced: pulse unvoiced: noise
fundamental frequency
air flow
Heiga Zen
Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN
July 9th, 2015
4 of 104
Typical flow of TTS system
TEXT Sentence segmentaiton Word segmentation Text normalization Part-of-speech tagging Pronunciation
discrete ⇒ discrete NLP Frontend
Text analysis Speech synthesis
Prosody prediction Waveform generation
SYNTHESIZED discrete ⇒ continuous Speech SPEECH Backend
This presentation mainly talks about backend Heiga Zen
Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN
July 9th, 2015
5 of 104
Concatenative, unit selection speech synthesis All segments
Target cost
Concatenation cost
• Concatenate actual instances of speech from database • Large data + automatic learning → High-quality synthetic voices can be built automatically • Single inventory per unit → diphone synthesis [1] • Multiple inventory per unit → unit selection synthesis [2] Heiga Zen
Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN
July 9th, 2015
6 of 104
Statistical parametric speech synthesis (SPSS) [3] Speech
Speech analysis
Text
Text analysis
y
Model training
x
Parameter generation
ˆl
yˆ
x
Speech synthesis Text analysis
Speech Text
Training • Extract linguistic features x & acoustic features y
• Train acoustic model λ given (x, y)
ˆ = arg max p(y | x, λ) λ
Heiga Zen
Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN
July 9th, 2015
7 of 104
Statistical parametric speech synthesis (SPSS) [3] Speech
Speech analysis
Text
Text analysis
y
Model training
x
Parameter generation
ˆl
yˆ
x
Speech synthesis Text analysis
Speech Text
Training • Extract linguistic features x & acoustic features y
• Train acoustic model λ given (x, y)
ˆ = arg max p(y | x, λ) λ Synthesis • Extract x from text to be synthesized ˆ then reconstruct waveform • Generate most probable y from λ ˆ yˆ = arg max p(y | x, λ) Heiga Zen
Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN
July 9th, 2015
7 of 104
Statistical parametric speech synthesis (SPSS) [3]
Speech
Speech analysis
Text
Text analysis
y
Model training
x
Parameter generation
ˆl
x
yˆ
Speech synthesis Text analysis
Speech Text
• Vocoded speech (buzzy or muffled) • Small footprint
Hidden Markov model (HMM) as its acoustic model → HMM-based speech synthesis system (HTS) [4]
Heiga Zen
Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN
July 9th, 2015
8 of 104
HMM-based speech synthesis [4] SPEECH Speech signal DATABASE Excitation
parameter extraction Excitation parameters
TEXT
Text analysis Excitation parameters
Synthesis part Heiga Zen
Spectral parameter extraction Spectral parameters
Training HMMs
Labels
Labels
Training part
Context-dependent HMMs & state duration models Parameter generation from HMMs Spectral parameters
Excitation Excitation Synthesis generation Filter
Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN
SYNTHESIZED SPEECH
July 9th, 2015
9 of 104
HMM-based speech synthesis [4]
SPEECH Speech signal DATABASE Excitation
parameter extraction Excitation parameters
TEXT
Text analysis Excitation parameters
Synthesis part Heiga Zen
Spectral parameter extraction Spectral parameters
Training HMMs
Labels
Labels
Training part
Context-dependent HMMs & state duration models Parameter generation from HMMs Spectral parameters
Excitation Excitation Synthesis generation Filter
Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN
SYNTHESIZED SPEECH July 9th, 2015
10 of 104
Speech production process
modulation of carrier wave by speech information
freq transfer char
voiced/unvoiced
fundamental freq
text (concept)
speech
frequency transfer characteristics magnitude start--end
Sound source voiced: pulse unvoiced: noise
fundamental frequency
air flow
Heiga Zen
Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN
July 9th, 2015
11 of 104
Source-filter model Source excitation part
Vocal tract resonance part
pulse train e(n)
white noise
excitation
linear time-invariant system h(n)
speech x(n) = h(n) ∗ e(n)
x(n) = h(n) ∗ e(n) ↓ Fourier transform
X(ejω ) = H (ejω )E(ejω )
H ejω should be defined by HMM state-output vectors e.g., mel-cepstrum, line spectral pairs Heiga Zen
Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN
July 9th, 2015
12 of 104
Parametric models of speech signal
Autoregressive (AR) model K
H(z) = 1−
M X
Exponential (EX) model M X c(m)z −m H(z) = exp m=0
c(m)z −m
m=0
Estimate model parameters based on ML c = arg max p(x | c) c
• p(x | c): AR model → Linear predictive analysis [5]
• p(x | c): EX model → (ML-based) cepstral analysis [6]
Heiga Zen
Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN
July 9th, 2015
13 of 104
80
80
60
60
Log magnitude (dB)
Log magnitude (dB)
Examples of speech spectra
40 20 0 -20
0
1
2 3 4 Frequency (kHz)
(a) ML-based cepstral analysis
Heiga Zen
5
40 20 0 -20
0
1
2 3 4 Frequency (kHz)
5
(b) Linear prediction
Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN
July 9th, 2015
14 of 104
HMM-based speech synthesis [4]
SPEECH Speech signal DATABASE Excitation
parameter extraction Excitation parameters
TEXT
Text analysis Excitation parameters
Synthesis part Heiga Zen
Spectral parameter extraction Spectral parameters
Training HMMs
Labels
Labels
Training part
Context-dependent HMMs & state duration models Parameter generation from HMMs Spectral parameters
Excitation Excitation Synthesis generation Filter
Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN
SYNTHESIZED SPEECH July 9th, 2015
15 of 104
Structure of state-output (observation) vectors ot ct Spectrum part
Excitation part
Heiga Zen
Mel-cepstral coefficients
D ct
D Mel-cepstral coefficients
D2c t
DD Mel-cepstral coefficients
pt
log F0
δpt
D log F0
δ 2 pt
DD log F0
Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN
July 9th, 2015
16 of 104
Hidden Markov model (HMM)
a11 π1
1
a22 a12
b1 (ot ) Observation sequence State sequence
Heiga Zen
O o1 o2 o3 o4 o5 Q
2 b2 (ot )
a33 a23
3 b3 (ot )
... . . ...
1 1 1 1 2 ...
Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN
2 3 ...
oT
3
July 9th, 2015
17 of 104
Multi-stream HMM structure ot bj (ot ) Spectrum
o1t
b2j (o2t ) b3j (o3t ) b4j (o4t )
4
b1j (o1t )
D2c t
Excitation Heiga Zen
D ct
3
s=1
bj (ot )
ct
Stream 1 2
S Y ¡ s s ¢ws = bj (ot )
pt
o2t
δ pt
o3t
δ 2 pt
o4t
Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN
July 9th, 2015
18 of 104
Training process data & labels
Compute variance floor (HCompV)
Reestimate CD-HMMs by EM algorithm (HERest)
Estimate CD-dur. models from FB stats (HERest)
Initialize CI-HMMs by segmental k-means (HInit)
Decision tree-based clustering (HHEd TB)
Decision tree-based clustering (HHEd TB)
Reestimate CI-HMMs by EM algorithm (HRest & HERest)
Reestimate CD-HMMs by EM algorithm (HERest)
Copy CI-HMMs to CD-HMMs (HHEd CL)
Untie parameter tying structure (HHEd UT)
monophone (context-independent, CI) Heiga Zen
Estimated dur models Estimated HMMs
fullcontext (context-dependent, CD)
Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN
July 9th, 2015
19 of 104
Context-dependent acoustic modeling • • • • • • • • • • • • •
{preceding, succeeding} two phonemes Position of current phoneme in current syllable # of phonemes at {preceding, current, succeeding} syllable {accent, stress} of {preceding, current, succeeding} syllable Position of current syllable in current word # of {preceding, succeeding} {stressed, accented} syllables in phrase # of syllables {from previous, to next} {stressed, accented} syllable Guess at part of speech of {preceding, current, succeeding} word # of syllables in {preceding, current, succeeding} word Position of current word in current phrase # of {preceding, succeeding} content words in current phrase # of words {from previous, to next} content word # of syllables in {preceding, current, succeeding} phrase
...
Impossible to have all possible models Heiga Zen
Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN
July 9th, 2015
20 of 104
Decision tree-based state clustering [7] k-a+b t-a+n L=voice?
R=silence? yes
L="w" ? yes
yes
no
no
yes
no
R=silence? no yes
L="gy" ? no
leaf nodes
synthesized states
w-a+t
w-a+sil
Heiga Zen
gy-a+sil
w-a+sh
g-a+sil
gy-a+pau
Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN
July 9th, 2015
21 of 104
Stream-dependent tree-based clustering
Decision trees for mel-cepstrum Decision trees for F0 Spectrum & excitation can have different context dependency → Build decision trees individually Heiga Zen
Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN
July 9th, 2015
22 of 104
State duration models [8] t1
i
t0
1
2
3
4
5
6
7
T=8
t
Probability to enter state i at t0 then leave at t1 + 1 χt0 ,t1 (i) ∝
X
αt0 −1 (j)aji atii1 −t0
j6=i
→ estimate state duration models
Heiga Zen
t1 Y
t=t0
bi (ot )
X
aik bk (ot1 +1 )βt1 +1 (k)
k6=i
Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN
July 9th, 2015
23 of 104
Stream-dependent tree-based clustering
State duration model HMM Decision trees for mel-cepstrum
Decision tree for state dur. models
Decision trees for F0 Heiga Zen
Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN
July 9th, 2015
24 of 104
HMM-based speech synthesis [4] SPEECH Speech signal DATABASE Excitation
parameter extraction Excitation parameters
TEXT
Text analysis Excitation parameters
Synthesis part Heiga Zen
Spectral parameter extraction Spectral parameters
Training HMMs
Labels
Labels
Training part
Context-dependent HMMs & state duration models Parameter generation from HMMs Spectral parameters
Excitation Excitation Synthesis generation Filter
Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN
SYNTHESIZED SPEECH
July 9th, 2015
25 of 104
Speech parameter generation algorithm [9] Generate most probable state outputs given HMM and words ˆ oˆ = arg max p(o | w, λ) o X ˆ = arg max p(o, q | w, λ) o
∀q
ˆ ≈ arg max max p(o, q | w, λ) o
q
ˆ (q | w, λ) ˆ = arg max max p(o | q, λ)P o
Heiga Zen
q
Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN
July 9th, 2015
26 of 104
Speech parameter generation algorithm [9] Generate most probable state outputs given HMM and words ˆ oˆ = arg max p(o | w, λ) o X ˆ = arg max p(o, q | w, λ) o
∀q
ˆ ≈ arg max max p(o, q | w, λ) o
q
ˆ (q | w, λ) ˆ = arg max max p(o | q, λ)P o
q
Determine the best state sequence and outputs sequentially ˆ qˆ = arg max P (q | w, λ) q
ˆ ˆ λ) oˆ = arg max p(o | q, o
Heiga Zen
Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN
July 9th, 2015
26 of 104
Best state sequence
a11 π1
1
a22 a12
b1 (ot ) Observation sequence
Heiga Zen
O o1 o2 o3 o4 o5
State sequence
Q
State duration
D
2 b2 (ot )
a23
3 b3 (ot )
... . . ...
1 1 1 1 2 ... 4
a33
10
Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN
2 3 ...
oT
3
5
July 9th, 2015
27 of 104
Best state outputs w/o dynamic features
Mean
Variance
oˆ becomes step-wise mean vector sequence
Heiga Zen
Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN
July 9th, 2015
28 of 104
Using dynamic features State output vectors include static & dynamic features
£ ¤ > > ot = c> t , D ct M
D ct = ct − ct−1 c t-2
c t-1
ct
c t+1
c t+2
Dct-2
Dc t-1
Dc t
Dc t+1
Dct+2
M
2M
Relationship between static and dynamic features can be arranged as
Heiga Zen
o .. .
ct−1 ot−1 D ct−1 ct o t D c t ct+1 ot+1 D ct+1 .. .
· · · · · · · · · · · · = · · · · · · · · · ···
.. . 0 −I 0 0 0 0 .. .
W .. . I I 0 −I 0 0 .. .
.. . 0 0 I I 0 −I .. .
.. . 0 0 0 0 I I .. .
· · · · · · · · · · · · · · · · · · · · · ···
Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN
c
.. . ct−2 ct−1 ct ct+1 .. . July 9th, 2015
29 of 104
Speech parameter generation algorithm [9]
Introduce dynamic feature constraints ˆ ˆ λ) oˆ = arg max p(o | q, o
Heiga Zen
subject to
Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN
o = Wc
July 9th, 2015
30 of 104
Speech parameter generation algorithm [9]
Introduce dynamic feature constraints ˆ ˆ λ) oˆ = arg max p(o | q, o
subject to
o = Wc
If state-output distribution is single Gaussian ˆ = N (o; µ ˆ qˆ) ˆ λ) ˆ qˆ, Σ p(o | q,
Heiga Zen
Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN
July 9th, 2015
30 of 104
Speech parameter generation algorithm [9]
Introduce dynamic feature constraints ˆ ˆ λ) oˆ = arg max p(o | q, o
subject to
o = Wc
If state-output distribution is single Gaussian ˆ = N (o; µ ˆ qˆ) ˆ λ) ˆ qˆ, Σ p(o | q, ˆ qˆ)/∂c = 0 ˆ qˆ, Σ By setting ∂ log N (W c; µ ˆ −1 W c = W > Σ ˆ −1 µ W >Σ qˆ qˆ ˆ qˆ
Heiga Zen
Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN
July 9th, 2015
30 of 104
Speech parameter generation algorithm [9] Σ−1 qˆ
c c1 c2
0 1 0 ... -1 1 0 ...
...
1 0 0 ... 0 0 0 ...
...
W 0
1 0 0 ... 1 -1 0 ...
0 1 0 ... 0 1 -1 ...
... 0 1 0 ... 0 1 -1
...
... 0 0 1 ... 0 0 0
W>
cT
... 0 1 0 ... -1 1 0 ... 0 0 1
0
... 0 -1 1
Σ−1 qˆ
µqˆ 0
1 0 0 ... 1 -1 0 ...
0 1 0 ... 0 1 -1 ...
... 0 1 0 ... 0 1 -1
...
=
... 0 0 1 ... 0 0 0
W>
µq1 µq2
0 µqT Heiga Zen
Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN
July 9th, 2015
31 of 104
Dynamic
Static
Generated speech parameter trajectory
Mean
Heiga Zen
Variance
c
Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN
July 9th, 2015
32 of 104
HMM-based speech synthesis [4] SPEECH Speech signal DATABASE Excitation
parameter extraction Excitation parameters
TEXT
Text analysis Excitation parameters
Synthesis part Heiga Zen
Spectral parameter extraction Spectral parameters
Training HMMs
Labels
Labels
Training part
Context-dependent HMMs & state duration models Parameter generation from HMMs Spectral parameters
Excitation Excitation Synthesis generation Filter
Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN
SYNTHESIZED SPEECH
July 9th, 2015
33 of 104
Waveform reconstruction
Generated excitation parameter (log F0 with V/UV)
Generated spectral parameter (cepstrum, LSP)
pulse train e(n)
white noise
Heiga Zen
excitation
linear time-invariant system h(n)
Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN
synthesized speech x(n) = h(n) ∗ e(n)
July 9th, 2015
34 of 104
Synthesis filter
• Cepstrum → LMA filter
• Generalized cepstrum → GLSA filter • Mel-cepstrum → MLSA filter
• Mel-generalized cepstrum → MGLSA filter • LSP → LSP filter
• PARCOR → all-pole lattice filter • LPC → all-pole filter
Heiga Zen
Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN
July 9th, 2015
35 of 104
Any questions?
Heiga Zen
Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN
July 9th, 2015
37 of 104
Outline
Basics of HMM-based speech synthesis Background HMM-based speech synthesis Advanced topics in HMM-based speech synthesis Flexibility Improve naturalness Neural network-based speech synthesis Feed-forward neural network (DNN & DMDN) Recurrent neural network (RNN & LSTM-RNN) Results
Advantages
• Flexibility to change voice characteristics
• Small footprint • More data
Heiga Zen
Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN
July 9th, 2015
39 of 104
Adaptation (mimicking voice) [10]
Average-voice model
Training speakers
Adaptive Training
Adaptation Target speakers
• Train average voice model (AVM) from training speakers using SAT • Adapt AVM to target speakers
• Requires small data from target speaker/speaking style → Small cost to create new voices Heiga Zen
Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN
July 9th, 2015
40 of 104
Adaptation demo · Speaker adaptation - VIP voice: GWB - Child voice:
BHO
· Style adaptation (in Japanese) - Joyful - Sad - Rough
From http://homepages.inf.ed.ac.uk/jyamagis/Demo-html/demo.html Heiga Zen
Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN
July 9th, 2015
41 of 104
Interpolation (mixing voice) [11, 12, 13, 14] λ2 λ1 I(λ0 , λ2)
I(λ0 , λ1)
λ : HMM set
I(λ0 , λ ) : Interpolation ratio
λ0 I(λ0 , λ3) I(λ0 , λ4)
λ3
λ4
• Interpolate representive HMM sets
• Can obtain new voices w/o adaptation data
• Eigenvoice / CAT / multiple regression → estimate representative HMM sets from data Heiga Zen
Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN
July 9th, 2015
42 of 104
Interpolation demo (1) · Speaker interpolation (in Japanese) - Male & Female
Male
Female
· Style interpolation - Neutral → Angry - Neutral → Happy
From http://www.sp.nitech.ac.jp/ & http://homepages.inf.ed.ac.uk/jyamagis/Demo-html/demo.html Heiga Zen
Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN
July 9th, 2015
43 of 104
Interpolation demo (2) Speaker characteristics modification Weights for eigenvectors +30
1st
2nd
3rd
4th
5th
Weights for eigenvectors +30
0
0
-30
-30 Weights for eigenvectors
+30
1st
2nd
3rd
4th
5th
1st
2nd
3rd
4th
5th
Weights for eigenvectors +30
0
0
-30
-30
1st
2nd
3rd
4th
5th
From http://www.sp.nitech.ac.jp/~demo/synthesis_demo_2001/
Heiga Zen
Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN
July 9th, 2015
44 of 104
Interpolation demo (3) Style-control Rough
Sad
Joyful From http://homepages.inf.ed.ac.uk/jyamagis/Demo-html/demo.html Heiga Zen
Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN
July 9th, 2015
45 of 104
Drawbacks
• Quality buzzy, muffled synthetic speech • Major factors for quality degradation [3] − Vocoder (speech analysis & synthesis) − Acoustic model (HMM) − Oversmoothing (parameter generation)
Heiga Zen
Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN
July 9th, 2015
46 of 104
Vocoding issues • Simple pulse / noise excitation Difficult to model mix of V/UV sounds (e.g., voiced fricatives) pulse train e(n)
white noise
excitation Unvoiced
Voiced
• Spectral envelope extraction Harmonic effect often cause problem Power [dB]
80 40
0 0
2
4
6
8 [kHz]
• Phase Important but usually ignored Heiga Zen
Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN
July 9th, 2015
47 of 104
Better vocoding
• Mixed excitation linear prediction (MELP)
• STRAIGHT
• Multi-band excitation
• Harmonic + noise model (HNM) • Harmonic / stochastic model • LF model
• Glottal waveform
• Residual codebook • ML excitation
Heiga Zen
Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN
July 9th, 2015
48 of 104
Heiga Zen
70
30 20 0
0
1
2 3 Frequency (kHz)
4
80 70 60 50 40 30 20
50 40 30 20
80
10 0
0
1 2 3 Frequency (kHz)
4
80 70
70 60 50 40 30 20 10
60
0
50 40
0
1 2 3 Frequency (kHz)
4
Mixed Excitation
30 20 10
10 0
⇓ Bandpass filtering ⇓
40
60
⇓ Mix ⇓
50
Log magnitude (dB)
60
Log magnitude (dB)
80
70
Log magnitude (dB)
Log magnitude (dB)
80
10
Log magnitude (dB)
Noise excitation
Pulse excitation
MELP-style mixed excitation [15]
0
1
2 3 Frequency (kHz)
4
0
0
1 2 3 Frequency (kHz)
4
Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN
July 9th, 2015
49 of 104
MELP-style mixed excitation [15]
Amplitude
12
-12
0
1144
2288
3432
4576
5720
6864
8008
9152
10296 sample
2288
3432
4576
5720
6864
8008
9152
10296 sample
z
u
Amplitude
12
-12
0
1144
s
Heiga Zen
U
k
o
sh
I
Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN
ts
u
July 9th, 2015
50 of 104
STRAIGHT [16]
Waveform
Synthetic waveform
F0 extraction
Synthesis
Fixed-point analysis Analysis F0 adaptive spectral smoothing in the time-frequency region
Heiga Zen
F0
Mixed excitation with phase manipulation
Smoothed spectrum Aperiodic factors
Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN
July 9th, 2015
51 of 104
STRAIGHT [16] 120
FFT power spectrum FFT + mel-cepstral analysis STRAIGHT + mel-cepstral analysis
100
Power [dB]
80 60 40 20 0 -20
Heiga Zen
0
2
4
Frequency [kHz]
Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN
6
8 July 9th, 2015
52 of 104
Trainable excitation model [17] Sentence HMM Mel-cepstral coefficients
ct-2
ct-1
ct
c t+1
c t+2
log F0 values
pt-2
pt-1
pt
p t+1
p t+2
Filters
Hv (z), Hu (z)
Pulse train t(n) generator
White noise
Heiga Zen
w(n)
Hv (z)
Hu (z)
v(n) Voiced excitation
u(n)
e(n) Mixed excitation
H(z)
Synthesized speech
Unvoiced excitation
Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN
July 9th, 2015
53 of 104
ML excitation STRAIGHT Pulse/noise
Natural
Trainable excitation model [17]
0 0
0
0 0 0
0 0
Upper: Waveform
Heiga Zen
Lower: excitation (residual)
Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN
July 9th, 2015
54 of 104
Limitations of HMMs for acoustic modeling
• Piece-wise constatnt statistics Statistics do not vary within an HMM state • Conditional independence assumption State output probability depends only on the current state • Weak duration modeling State duration probability decreases exponentially with time None of them hold for real speech
Heiga Zen
Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN
July 9th, 2015
55 of 104
Better acoustic modeling
• Piece-wise constatnt statistics → Dynamical model − Trended HMM, autoregressive HMM (ARHMM) − Polynomial segment model, hidden trajectory model (HTM) − Trajectory HMM • Conditional independence assumption → Graphical model − Buried Markov model, ARHMM, linear dynamical model (LDM) − HTM, Gaussian process (GP) − Trajectory HMM • Weak duration modeling → Explicit duration model − Hidden semi-Markov model Heiga Zen
Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN
July 9th, 2015
56 of 104
Trajectory HMM [18] • Derived from HMM by imposing dynamic feature constraints
• Underlying generative model in HMM-based speech synthesis p(c | λ) =
X ∀q
p(c | q, λ)P (q | λ)
p(c | q, λ) = N (c; c¯q , Pq ) where Pq−1 = Rq = W > Σ−1 q W rq = W > Σ−1 q µq c¯q = Pq rq
Heiga Zen
Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN
July 9th, 2015
57 of 104
Trajectory HMM [18] mean trajectory c¯q
sil
a
i
d
a
sil sil
5 10
a
15
25 i
30 35
d
Time (frame)
20
40 45 a
50 55 sil
5
10
15
20
25 30 35 Time (frame)
40
45
50
55
Temporal covariance matrix Pq Heiga Zen
Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN
July 9th, 2015
58 of 104
Relation to HMM-based speech synthesis
• Mean vector of trajectory HMM ¯q = W > Σ−1 W > Σ−1 q Wc q µq • Speech parameter trajectory used in HMM-based speech synthesis > −1 W > Σ−1 q W c = W Σq µq
ML estimation of trajectory HMM → Make training & synthesis consistent
Heiga Zen
Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN
July 9th, 2015
59 of 104
Oversmoothing • Speech parameter generation algorithm
− Dynamic feature constraints make generated parameters smooth − Often too smooth → sounds muffled
0 4 8 Frequency (kHz)
Generated
4 8 Frequency (kHz)
Natural
0
• Why? − Details of spectral (formant) structure disappear − Use of better AM relaxes the issue, but not enough Heiga Zen
Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN
July 9th, 2015
60 of 104
Oversmoothing compensation
• Postfiltering
− Mel-cepstrum − LSP
• Nonparametric approach − Conditional parameter generation − Discrete HMM-based speech synthesis • Combine multiple-level statistics − Global variance (intra-utterance variance) − Modulation spectrum (intra-utterance frequency components)
Heiga Zen
Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN
July 9th, 2015
61 of 104
Global variance [19]
Generated
1
0 v(m)
2nd mel-cepstral coefficient
Natural
-1 0
1
2
3
Time [sec]
GVs of synthesized speech are typically narrower Heiga Zen
Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN
July 9th, 2015
62 of 104
Speech parameter generation with GV [19]
• Speech parameter generation cˆ = arg maxc log N (W c; µq , Σq ) • Speech parameter generation w/ GV cˆ = arg maxc log N (W c; µq , Σq ) + ω log N (v(c); µv , Σv ) 2nd term works as a penalty for oversmoothing
Heiga Zen
Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN
July 9th, 2015
63 of 104
Effect of GV
4 8 Frequency (kHz)
Generated (standard)
0 0 4 8 Frequency (kHz)
Generated (w/ GV)
4 8 Frequency (kHz)
Natural
0
Heiga Zen
Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN
July 9th, 2015
64 of 104
Any questions?
Heiga Zen
Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN
July 9th, 2015
65 of 104
Outline
Basics of HMM-based speech synthesis Background HMM-based speech synthesis Advanced topics in HMM-based speech synthesis Flexibility Improve naturalness Neural network-based speech synthesis Feed-forward neural network (DNN & DMDN) Recurrent neural network (RNN & LSTM-RNN) Results
Characteristics of SPSS • Advantages − Flexibility to change voice characteristics ◦ Adaptation ◦ Interpolation / eigenvoice / CAT / multiple regression − Small footprint − Robustness • Drawback − Quality • Major factors for quality degradation [3] − Vocoder (speech analysis & synthesis) − Acoustic model (HMM) → Neural networks − Oversmoothing (parameter generation) Heiga Zen
Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN
July 9th, 2015
67 of 104
Linguistic → acoustic mapping • Training Learn relationship between linguistic & acoustic features
Heiga Zen
Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN
July 9th, 2015
68 of 104
Linguistic → acoustic mapping • Training Learn relationship between linguistic & acoustic features • Synthesis Map linguistic features to acoustic ones
Heiga Zen
Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN
July 9th, 2015
68 of 104
Linguistic → acoustic mapping • Training Learn relationship between linguistic & acoustic features • Synthesis Map linguistic features to acoustic ones • Linguistic features used in SPSS − Phoneme, syllable, word, phrase, utterance-level features − e.g., phone identity, POS, stress, # of words in a phrase − Around 50 different types, much more than ASR (typically 3–5) Effective modeling is essential
Heiga Zen
Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN
July 9th, 2015
68 of 104
HMM-based acoustic modeling for SPSS [4]
Acoustic space yes yes yes
no no
no yes
...
no yes
no
Decision tree-clustered HMM w/ GMM state-output distributions
Heiga Zen
Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN
July 9th, 2015
69 of 104
NN-based acoustic modeling for SPSS [20] Acoustic features y
h3 h2 h1
Linguistic features x
NN output → E [yt | xt ] → replace decision trees & GMMs Heiga Zen
Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN
July 9th, 2015
70 of 104
Advantages of NN-based acoustic modeling for SPSS
• Integrating feature extraction − Efficiently model high-dimensional, highly correlated features − Layered architecture w/ non-linear operations → Integrated linguistic feature extraction to acoustic modeling
Heiga Zen
Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN
July 9th, 2015
71 of 104
Advantages of NN-based acoustic modeling for SPSS
• Integrating feature extraction − Efficiently model high-dimensional, highly correlated features − Layered architecture w/ non-linear operations → Integrated linguistic feature extraction to acoustic modeling • Distributed representation More efficient than localist one if data has componential structure → Better modeling / Fewer parameters
Heiga Zen
Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN
July 9th, 2015
71 of 104
Advantages of NN-based acoustic modeling for SPSS
• Integrating feature extraction − Efficiently model high-dimensional, highly correlated features − Layered architecture w/ non-linear operations → Integrated linguistic feature extraction to acoustic modeling • Distributed representation More efficient than localist one if data has componential structure → Better modeling / Fewer parameters • Layered hierarchical structure in speech production concept → linguistic → articulatory → vocal tract → waveform
Heiga Zen
Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN
July 9th, 2015
71 of 104
Framework Binary features
Duration prediction
Input features including binary & numeric features at frame T
...
Waveform synthesis
Spectral features
Output layer
...
SPEECH
Heiga Zen
...
...
...
Duration feature Frame position feature
Hidden layers
TEXT
Statistics (mean & var) of speech parameter vector sequence
Numeric features
Text analysis
Input features including binary & numeric features at frame 1
Input layer
Input feature extraction
Excitation features V/UV feature
Parameter generation
Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN
July 9th, 2015
72 of 104
Framework
Is this new? . . . no • NN [21]
• RNN [22]
Heiga Zen
Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN
July 9th, 2015
73 of 104
Framework
Is this new? . . . no • NN [21]
• RNN [22] What’s the difference? • More layers, data, computational resources • Better learning algorithm
• Statistical parametric speech synthesis techniques
Heiga Zen
Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN
July 9th, 2015
73 of 104
Experimental setup Database Training / test data Sampling rate Analysis window Linguistic features Acoustic features HMM topology DNN architecture Postprocessing
Heiga Zen
US English female speaker 33000 & 173 sentences 16 kHz 25-ms width / 5-ms shift 11 categorical features 25 numeric features 0–39 mel-cepstrum log F0 , 5-band aperiodicity, ∆, ∆2 5-state, left-to-right HSMM [23], MSD F0 [24], MDL [25] 1–5 layers, 256/512/1024/2048 units/layer sigmoid, continuous F0 [26] Postfiltering in cepstrum domain [15]
Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN
July 9th, 2015
74 of 104
Example of speech parameter trajectories
5-th Mel-cepstrum
w/o grouping questions, numeric contexts, silence frames removed
Natural speech HMM (α=1) DNN (4x512)
1
0
-1 0
100
200
300
400
500
Frame
Heiga Zen
Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN
July 9th, 2015
75 of 104
Subjective evaluations Compared HMM-based systems with DNN-based ones with similar # of parameters • Paired comparison test
• 173 test sentences, 5 subjects per pair • Up to 30 pairs per subject • Crowd-sourced HMM (α) 15.8 (16) 16.1 (4) 12.7 (1)
Heiga Zen
DNN (#layers × #units) 38.5 (4 × 256) 27.2 (4 × 512) 36.6 (4 × 1 024)
Neutral 45.7 56.8 50.7
p value < 10−6 < 10−6 < 10−6
Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN
z value -9.9 -5.1 -11.5
July 9th, 2015
76 of 104
Limitations of DNN-based acoustic modeling y2 Data samples NN prediction
y1
• Unimodality − Human can speak in different ways → one-to-many mapping − NN trained by MSE loss → approximates conditional mean
Heiga Zen
Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN
July 9th, 2015
77 of 104
Limitations of DNN-based acoustic modeling y2 Data samples NN prediction
y1
• Unimodality − Human can speak in different ways → one-to-many mapping − NN trained by MSE loss → approximates conditional mean • Lack of variance − DNN-based SPSS uses variances computed from all training data − Parameter generation algorithm utilizes variances Heiga Zen
Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN
July 9th, 2015
77 of 104
Limitations of DNN-based acoustic modeling y2 Data samples NN prediction
y1
• Unimodality − Human can speak in different ways → one-to-many mapping − NN trained by MSE loss → approximates conditional mean • Lack of variance − DNN-based SPSS uses variances computed from all training data − Parameter generation algorithm utilizes variances Linear output layer → Mixture density output layer [27]
Heiga Zen
Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN
July 9th, 2015
77 of 104
Mixture density network [27] w2 (x1 ) w1 (x1 )
σ2 (x1 )
σ1 (x1 ) µ1 (x1 )
µ2 (x1 )
y
w1 (x1 ) w2 (x1 ) µ1 (x1 ) µ2 (x1 )σ1 (x1 ) σ2 (x1 )
Inputs of activation function 4 X zj = hi wij i=1
: Weights → Softmax activation function w1 (x) = P2
exp(z1 )
m=1 exp(zm )
w2 (x) = P2
exp(z2 )
m=1
exp(zm )
: Means → Linear activation function
1-dim, 2-mix MDN
µ1 (x) = z3
µ1 (x) = z4
: Variances → Exponential activation function σ1 (x) = exp(z5 )
σ2 (x) = exp(z6 )
NN + mixture model (GMM) → NN outputs GMM weights, means, & variances
Heiga Zen
Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN
July 9th, 2015
78 of 104
TEXT
DMDN-based SPSS [28]
w2 (x1 ) w1 (x1 )
σ2 (x1 )
µ1 (x1 )
µ2 (x1 )
σ1 (x2 )
y
...
σ2 (x2 )
µ1 (x2 )
µ2 (x2 )
σ1 (xT )
y
µ1 (xT )
w1 (x1 ) w2 (x1 ) µ1 (x1 ) µ2 (x1 ) σ1 (x1 ) σ2 (x1 ) w1 (x2 ) w2 (x2 ) µ1 (x2 ) µ2 (x2 ) σ1 (x2 ) σ2 (x2 )
w2 (xT ) σ2 (xT ) µ2 (xT )
y
Duration prediction
x1
x2
...
Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN
xT
July 9th, 2015
SPEECH
Heiga Zen
Waveform synthesis
Input feature extraction
w1 (xT ) w2 (xT ) µ1 (xT ) µ2 (xT ) σ1(xT ) σ2 (xT )
Parameter generation
Text analysis
σ1 (x1 )
w1 (xT )
w1 (x2 ) w2 (x2 )
79 of 104
Experimental setup
• Almost the same as the previous setup
• Differences:
DNN architecture DMDN architecture
Optimization
Heiga Zen
4–7 hidden layers, 1024 units/hidden layer ReLU (hidden) / Linear (output) 4 hidden layers, 1024 units/ hidden layer ReLU [29] (hidden) / Mixture density (output) 1–16 mix AdaDec [30] (variant of AdaGrad [31]) on GPU
Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN
July 9th, 2015
80 of 104
Subjective evaluation • 5-scale mean opinion score (MOS) test (1: unnatural – 5: natural)
• 173 test sentences, 5 subjects per pair • Up to 30 pairs per subject • Crowd-sourced
HMM DNN
DMDN (4×1024)
Heiga Zen
1 mix 2 mix 4×1024 5×1024 6×1024 7×1024 1 mix 2 mix 4 mix 8 mix 16 mix
3.537 3.397 3.635 3.681 3.652 3.637 3.654 3.796 3.766 3.805 3.791
± ± ± ± ± ± ± ± ± ± ±
0.113 0.115 0.127 0.109 0.108 0.129 0.117 0.107 0.113 0.113 0.102
Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN
July 9th, 2015
81 of 104
Limitations of DNN/MDN-based acoustic modeling Fixed time span for input features • Fixed number of preceding / succeeding contexts
• Difficult to incorporate long time span contextual effect
Heiga Zen
Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN
July 9th, 2015
82 of 104
Limitations of DNN/MDN-based acoustic modeling Fixed time span for input features • Fixed number of preceding / succeeding contexts
• Difficult to incorporate long time span contextual effect
Frame-by-frame mapping • Each frame is mapped independently • Smoothing is still essential
DNN w/ dyn 67.8
Heiga Zen
Preference score (%) DNN w/o dyn No pref 12.0
20.0
Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN
July 9th, 2015
82 of 104
Limitations of DNN/MDN-based acoustic modeling Fixed time span for input features • Fixed number of preceding / succeeding contexts
• Difficult to incorporate long time span contextual effect
Frame-by-frame mapping • Each frame is mapped independently • Smoothing is still essential
DNN w/ dyn 67.8
Preference score (%) DNN w/o dyn No pref 12.0
20.0
Recurrent connections → Recurrent NN (RNN) [32]
Heiga Zen
Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN
July 9th, 2015
82 of 104
Simple Recurrent Network (SRN) Output y
y t-1
yt
y t+1
Input x
xt-1
xt
xt+1
Recurrent connections
SRN-based acoustic modeling ht = f (Whx xt + Whh ht−1 + bh ) ,
Heiga Zen
yt = φ (Wyh ht + by )
Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN
July 9th, 2015
83 of 104
Simple Recurrent Network (SRN) Output y
y t-1
yt
y t+1
Input x
xt-1
xt
xt+1
Recurrent connections
SRN-based acoustic modeling ht = f (Whx xt + Whh ht−1 + bh ) ,
yt = φ (Wyh ht + by )
With squared loss. . . • DNN output (prediction) yˆt → E [yt | xt ]
• RNN output (prediction) yˆt → E [yt | x1 , . . . , xt ]
Heiga Zen
Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN
July 9th, 2015
83 of 104
Simple Recurrent Network (SRN) Output y
y t-1
yt
y t+1
Input x
xt-1
xt
xt+1
Recurrent connections
• Only able to use previous contexts → bidirectional RNN [32]
Heiga Zen
Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN
July 9th, 2015
84 of 104
Simple Recurrent Network (SRN) Output y
y t-1
yt
y t+1
Input x
xt-1
xt
xt+1
Recurrent connections
• Only able to use previous contexts → bidirectional RNN [32] • Trouble accessing long-range contexts − Information in hidden layers loops through recurrent connections → Quickly decay over time − Prone to being overwritten by new information arriving from inputs → long short-term memory (LSTM) RNN [34]
Heiga Zen
Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN
July 9th, 2015
84 of 104
Long short-term memory (LSTM) [34] • RNN architecture designed to have better memory • Uses linear memory cells surrounded by multiplicative gate units bi
Input gate
h t-
bo
sigm
Output gate
it
bc xt
xt
xt
h t-
Input gate: Write
sigm
Output gate: Read
Memory cell
ct
tanh
tanh
ht
Forget gate: Reset
h t-
sigm
Block
bf Heiga Zen
xt
Forget gate
h t-
Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN
July 9th, 2015
85 of 104
Advantages of RNN-based acoustic modeling for SPSS
• Model dependency between frames − HMM: discontinuous (step-wise) → smoothing − DNN: discontinuous (frame-by-frame mapping) [35] → smoothing − RNN: smooth [36, 35]
Heiga Zen
Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN
July 9th, 2015
86 of 104
Advantages of RNN-based acoustic modeling for SPSS
• Model dependency between frames − HMM: discontinuous (step-wise) → smoothing − DNN: discontinuous (frame-by-frame mapping) [35] → smoothing − RNN: smooth [36, 35] • Low latency − Unidirectional structure allows fully frame-level streaming [35]
Heiga Zen
Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN
July 9th, 2015
86 of 104
Advantages of RNN-based acoustic modeling for SPSS
• Model dependency between frames − HMM: discontinuous (step-wise) → smoothing − DNN: discontinuous (frame-by-frame mapping) [35] → smoothing − RNN: smooth [36, 35] • Low latency − Unidirectional structure allows fully frame-level streaming [35] • More efficient representation − RNN offers more efficient representation than DNN for time series
Heiga Zen
Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN
July 9th, 2015
86 of 104
Synthesis pipeline
Duration prediction
Linguistic feature extraction
Acoustic feature prediction
Text analysis
Vocoder synthesis
TEXT
SPEECH
Duration & acoustic feature prediction blocks involve NN
Heiga Zen
Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN
July 9th, 2015
87 of 104
Duration modeling
Acoustic features Alignments Durations (targets)
9
12
10
10
Duration prediction LSTM
phoneme syllable
h
e
l
h e2
⇒
⇒
⇒
Feature functions
⇒
Linguistic features (phoneme)
ou l ou1
hello
word
Linguistic Structure
Feature function examples phoneme == ’h’ ? syllable stress == ’2’ ? Heiga Zen
# of syllables in word?
Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN
July 9th, 2015
88 of 104
Acoustic modeling Acoustic features (targets)
Acoustic feature prediction LSTM
phoneme syllable
h
e
l
h e2
word
⇒ ⇒
⇒ ⇒
Feature functions
⇒ ⇒
Append frame-level features Linguistic features (phoneme)
⇒ ⇒
Linguistic features (input)
ou l ou1
hello Linguistic Structure
Append frame-level features Relative position of frame in phoneme Heiga Zen
Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN
July 9th, 2015
89 of 104
Streaming synthesis
Acoustic feature prediction LSTM
Duration prediction LSTM
phoneme syllable word
h
e
l
h e2
ou l ou1
hello Linguistic Structure
Heiga Zen
Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN
July 9th, 2015
90 of 104
Streaming synthesis
Acoustic feature prediction LSTM
Duration prediction LSTM
Feature functions phoneme syllable word
⇒
Linguistic features (phoneme)
h
e
l
h e2
ou l ou1
hello Linguistic Structure
Heiga Zen
Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN
July 9th, 2015
90 of 104
Streaming synthesis
Acoustic feature prediction LSTM
Durations (targets)
9
Duration prediction LSTM
Feature functions phoneme syllable word
⇒
Linguistic features (phoneme)
h
e
l
h e2
ou l ou1
hello Linguistic Structure
Heiga Zen
Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN
July 9th, 2015
90 of 104
Streaming synthesis
Acoustic feature prediction LSTM
Linguistic features (frame)
Durations (targets)
9
Duration prediction LSTM
Feature functions phoneme syllable word
⇒
Linguistic features (phoneme)
h
e
l
h e2
ou l ou1
hello Linguistic Structure
Heiga Zen
Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN
July 9th, 2015
90 of 104
Streaming synthesis
Acoustic features (targets)
Acoustic feature prediction LSTM
Linguistic features (frame)
Durations (targets)
9
Duration prediction LSTM
Feature functions phoneme syllable word
⇒
Linguistic features (phoneme)
h
e
l
h e2
ou l ou1
hello Linguistic Structure
Heiga Zen
Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN
July 9th, 2015
90 of 104
Streaming synthesis Waveform
Acoustic features (targets)
Acoustic feature prediction LSTM
Linguistic features (frame)
Durations (targets)
9
Duration prediction LSTM
Feature functions phoneme syllable word
⇒
Linguistic features (phoneme)
h
e
l
h e2
ou l ou1
hello Linguistic Structure
Heiga Zen
Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN
July 9th, 2015
90 of 104
Streaming synthesis Waveform
Acoustic features (targets)
Acoustic feature prediction LSTM
Linguistic features (frame)
Durations (targets)
9
Duration prediction LSTM
Feature functions phoneme syllable word
⇒
Linguistic features (phoneme)
h
e
l
h e2
ou l ou1
hello Linguistic Structure
Heiga Zen
Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN
July 9th, 2015
90 of 104
Streaming synthesis Waveform
Acoustic features (targets)
Acoustic feature prediction LSTM
Linguistic features (frame)
Durations (targets)
9
Duration prediction LSTM
Feature functions phoneme syllable word
⇒
Linguistic features (phoneme)
h
e
l
h e2
ou l ou1
hello Linguistic Structure
Heiga Zen
Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN
July 9th, 2015
90 of 104
Streaming synthesis Waveform
Acoustic features (targets)
Acoustic feature prediction LSTM
Linguistic features (frame)
Durations (targets)
9
Duration prediction LSTM
Feature functions phoneme syllable word
⇒
Linguistic features (phoneme)
h
e
l
h e2
ou l ou1
hello Linguistic Structure
Heiga Zen
Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN
July 9th, 2015
90 of 104
Streaming synthesis Waveform
Acoustic features (targets)
Acoustic feature prediction LSTM
Linguistic features (frame)
Durations (targets)
9
12
Duration prediction LSTM
phoneme syllable word
h
⇒
Feature functions
⇒
Linguistic features (phoneme)
e
l
h e2
ou l ou1
hello Linguistic Structure
Heiga Zen
Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN
July 9th, 2015
90 of 104
Streaming synthesis Waveform
Acoustic features (targets)
Acoustic feature prediction LSTM
Linguistic features (frame)
Durations (targets)
9
12
10
Duration prediction LSTM
phoneme syllable word
h
⇒
⇒
Feature functions
⇒
Linguistic features (phoneme)
e
l
h e2
ou l ou1
hello Linguistic Structure
Heiga Zen
Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN
July 9th, 2015
90 of 104
Streaming synthesis Waveform
Acoustic features (targets)
Acoustic feature prediction LSTM
Linguistic features (frame)
Durations (targets)
9
12
10
10
Duration prediction LSTM
phoneme syllable word
h
e
l
h e2
⇒
⇒
⇒
Feature functions
⇒
Linguistic features (phoneme)
ou l ou1
hello Linguistic Structure
Heiga Zen
Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN
July 9th, 2015
90 of 104
Data & speech analysis
Heiga Zen
Database
US English female speaker 34 632 utterances
Speech analysis
16 kHz sampling 25-ms width / 5-ms shift
Synthesis
Vocaine [?] Postfiltering-based enhancement
Input
DNN: 442 linguistic features ULSTM: 291 linguistic features
Target
0–39 mel-cepstrum features continuous log F0 [26] 5-band aperiodicity optionally ∆, ∆2
Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN
July 9th, 2015
91 of 104
Training
Heiga Zen
Preprocessing
Acoustic: removed 80% silence Duration: removed first/last silence
Normalization
Input: mean / standard deviations Output: 0.01 – 0.99
Architecture
DNN: 4 × 1024 units, ReLU [29] ULSTM: 1 × 256 cells
Output layer
Acoustic: feed-forward or recurrent Duration: feed-forward
Initialization
DNN: random + layer-wise BP [?] ULSTM: random
Optimization
Common: squared loss, SGD DNN: GPU, AdaDec [?] ULSTM: distributed CPU [?]
Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN
July 9th, 2015
92 of 104
Subjective tests
Common
MOS
Preference
Heiga Zen
100 sentences Crowd-sourcing Using head-phones 7 evaluations per sample Up to 30 stimuli per subject 5-scale score in naturalness (1: Bad – 5: Excellent) 5 evaluations per pair Up to 30 pairs per subject Chose prefered one or “neutral”
Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN
July 9th, 2015
93 of 104
# of future contexts
# of future contexts 0 1 2 3 4
Heiga Zen
5-scale MOS 3.571 3.751 3.812 3.779 3.753
± ± ± ± ±
Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN
0.121 0.119 0.115 0.118 0.115
July 9th, 2015
94 of 104
Preference scores
DNN Feed-forward w/
w/o
67.8 18.4
12.0
ULSTM Feed-forward w/ 34.9 21.0 21.8
w/o
Recurrent w/
Heiga Zen
w/o
12.2 16.6
Neutral
21.0 29.2
Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN
20.0 47.6 66.8 57.2 54.2
July 9th, 2015
95 of 104
MOS
• DNN w/ dynamic features
• ULSTM w/o dynamic features, w/ recurrent output layer
Heiga Zen
Model
# params
5-scale MOS
DNN ULSTM
3,747,979 476,435
3.370 ± 0.114 3.723 ± 0.105
Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN
July 9th, 2015
96 of 104
Latency • Nexus 7 2013
• Use Advanced SIMD (NEON), single thread • Audio buffer size: 1024
• HMM one used time-recursive version w/ L = 15
• HMM & ULSTM used the same text analysis front-end Average latency (ms)
chars short long
Heiga Zen
HMM
ULSTM
26 123 311
25 55 115
Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN
July 9th, 2015
97 of 104
Summary Statistical parametric speech synthesis • Vocoding + acoustic model • HMM-based SPSS − Flexible (e.g., adaptation, interpolation) − Improvements ◦ Vocoding ◦ Acoustic modeling ◦ Oversmoothing compensation • NN-based SPSS − Learn mapping from linguistic features to acoustic ones − Static network (DNN, DMDN) → dynamic ones (LSTM) Heiga Zen
Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN
July 9th, 2015
98 of 104
Google academic program • Award programs − Google Faculty Research Awards Provides unrestricted gifts to support fulltime faculty members − Google Focused Research Awards Fund specific key research areas − Visiting Faculty Program Support full-time faculty in research areas of mutual interest • Student support programs − Graduate Fellowships Recognize outstanding graduate students − Internships Work on real-world problems with Google’s data & infrastructure
Heiga Zen
Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN
July 9th, 2015
99 of 104
References I [1]
E. Moulines and F. Charpentier. Pitch synchronous waveform processing techniques for text-to-speech synthesis using diphones. Speech Commun., 9:453–467, 1990.
[2]
A. Hunt and A. Black. Unit selection in a concatenative speech synthesis system using a large speech database. In Proc. ICASSP, pages 373–376, 1996.
[3]
H. Zen, K. Tokuda, and A. Black. Statistical parametric speech synthesis. Speech Commun., 51(11):1039–1064, 2009.
[4]
T. Yoshimura, K. Tokuda, T. Masuko, T. Kobayashi, and T. Kitamura. Simultaneous modeling of spectrum, pitch and duration in HMM-based speech synthesis. In Proc. Eurospeech, pages 2347–2350, 1999.
[5]
F. Itakura and S. Saito. A statistical method for estimation of speech spectral density and formant frequencies. Trans. IEICE, J53–A:35–42, 1970.
[6]
S. Imai. Cepstral analysis synthesis on the mel frequency scale. In Proc. ICASSP, pages 93–96, 1983.
[7]
J. Odell. The use of context in large vocabulary speech recognition. PhD thesis, Cambridge University, 1995.
[8]
T. Yoshimura, K. Tokuda, T. Masuko, T. Kobayashi, and T. Kitamura. Duration modeling for HMM-based speech synthesis. In Proc. ICSLP, pages 29–32, 1998.
Heiga Zen
Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN
July 9th, 2015
100 of 104
References II [9]
K. Tokuda, T. Yoshimura, T. Masuko, T. Kobayashi, and T. Kitamura. Speech parameter generation algorithms for HMM-based speech synthesis. In Proc. ICASSP, pages 1315–1318, 2000.
[10] J. Yamagishi. Average-Voice-Based Speech Synthesis. PhD thesis, Tokyo Institute of Technology, 2006. [11] T. Yoshimura, K. Tokuda, T. Masuko, T. Kobayashi, and T. Kitamura. Speaker interpolation in HMM-based speech synthesis system. In Proc. Eurospeech, pages 2523–2526, 1997. [12] K. Shichiri, A. Sawabe, K. Tokuda, T. Masuko, T. Kobayashi, and T. Kitamura. Eigenvoices for HMM-based speech synthesis. In Proc. ICSLP, pages 1269–1272, 2002. [13] H. Zen, N. Braunschweiler, S. Buchholz, M. Gales, K. Knill, S. Krstulovic, and J. Latorre. Statistical parametric speech synthesis based on speaker and language factorization. IEEE Trans. Acoust. Speech Lang. Process., 20(6):1713–1724, 2012. [14] T. Nose, J. Yamagishi, T. Masuko, and T. Kobayashi. A style control technique for HMM-based expressive speech synthesis. IEICE Trans. Inf. Syst., E90-D(9):1406–1413, 2007. [15] T. Yoshimura, K. Tokuda, T. Masuko, T. Kobayashi, and T. Kitamura. Incorporation of mixed excitation model and postfilter into HMM-based text-to-speech synthesis. IEICE Trans. Inf. Syst., J87-D-II(8):1563–1571, 2004. [16] H. Kawahara, I. Masuda-Katsuse, and A.de Cheveign´ e. Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based f0 extraction: possible role of a repetitive structure in sounds. Speech Commun., 27:187–207, 1999.
Heiga Zen
Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN
July 9th, 2015
101 of 104
References III [17] R. Maia, T. Toda, H. Zen, Y. Nankaku, and K. Tokuda. An excitation model for HMM-based speech synthesis based on residual modeling. In Proc. ISCA SSW6, pages 131–136, 2007. [18] H. Zen, K. Tokuda, and T. Kitamura. Reformulating the HMM as a trajectory model by imposing explicit relationships between static and dynamic features. Comput. Speech Lang., 21(1):153–173, 2007. [19] T. Toda and K. Tokuda. A speech parameter generation algorithm considering global variance for HMM-based speech synthesis. IEICE Trans. Inf. Syst., E90-D(5):816–824, 2007. [20] H. Zen, A. Senior, and M. Schuster. Statistical parametric speech synthesis using deep neural networks. In Proc. ICASSP, pages 7962–7966, 2013. [21] O. Karaali, G. Corrigan, and I. Gerson. Speech synthesis with neural networks. In Proc. World Congress on Neural Networks, pages 45–50, 1996. [22] C. Tuerk and T. Robinson. Speech synthesis using artificial network trained on cepstral coefficients. In Proc. Eurospeech, pages 1713–1716, 1993. [23] H. Zen, K. Tokuda, T. Masuko, T. Kobayashi, and T. Kitamura. A hidden semi-Markov model-based speech synthesis system. IEICE Trans. Inf. Syst., E90-D(5):825–834, 2007. [24] K. Tokuda, T. Masuko, N. Miyazaki, and T. Kobayashi. Multi-space probability distribution HMM. IEICE Trans. Inf. Syst., E85-D(3):455–464, 2002.
Heiga Zen
Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN
July 9th, 2015
102 of 104
References IV [25] K. Shinoda and T. Watanabe. Acoustic modeling based on the MDL criterion for speech recognition. In Proc. Eurospeech, pages 99–102, 1997. [26] K. Yu and S. Young. Continuous F0 modelling for HMM based statistical parametric speech synthesis. IEEE Trans. Audio Speech Lang. Process., 19(5):1071–1079, 2011. [27] C. Bishop. Mixture density networks. Technical Report NCRG/94/004, Neural Computing Research Group, Aston University, 1994. [28] H. Zen and A. Senior. Deep mixture density networks for acoustic modeling in statistical parametric speech synthesis. In Proc. ICASSP, pages 3872–3876, 2014. [29] M. Zeiler, M. Ranzato, R. Monga, M. Mao, K. Yang, Q.-V. Le, P. Nguyen, A. Senior, V. Vanhoucke, J. Dean, and G. Hinton. On rectified linear units for speech processing. In Proc. ICASSP, pages 3517–3521, 2013. [30] A. Senior, G. Heigold, M. Ranzato, and K. Yang. An empirical study of learning rates in deep neural networks for speech recognition. In Proc. ICASSP, pages 6724–6728, 2013. [31] J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and stochastic optimization. The Journal of Machine Learning Research, pages 2121–2159, 2011. [32] M. Schuster and K. Paliwal. Bidirectional recurrent neural networks. IEEE Trans. Signal Process., 45(11):2673–2681, 1997.
Heiga Zen
Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN
July 9th, 2015
103 of 104
References V
[33] S. Hochreiter, Y. Bengio, P. Frasconi, and J. Schmidhuber. Gradient flow in recurrent nets: the difficulty of learning long-term dependencies. In S. Kremer and J. Kolen, editors, A field guide to dynamical recurrent neural networks. IEEE Press, 2001. [34] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997. [35] H. Zen and H. Sak. Unidirectional long short-term memory recurrent neural network with recurrent output layer for low-latency speech synthesis. In Proc. ICASSP, pages 4470–4474, 2015. [36] Y. Fan, Y. Qian, F. Xie, and F. Soong. TTS synthesis with bidirectional LSTM based recurrent neural networks. In Proc. Interspeech, 2014. (Submitted) http://research.microsoft.com/en-us/projects/dnntts/.
Heiga Zen
Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN
July 9th, 2015
104 of 104