CROSS-VALIDATION BASED DECISION TREE CLUSTERING FOR HMM-BASED TTS Yu Zhang 1
Introduction
Microsoft Research Asia, Beijing, China
2
1,2
1
, Zhi-Jie Yan and Frank K. Soong
Shanghai Jiao Tong University, Shanghai, China
Cross-validation based decision tree clustering
◮ Conventional HMM-based speech synthesis ⊲ spectrum, excitation, and duration features are modeled and generated in a unified HMM-based framework ⊲ decision tree along with ML and MDL criteria is used for parameter tying ◮ Conventional decision tree based context clustering ⊲ ML-based greedy tree growing algorithm ⊲ MDL-based stopping criterion ◮ Cross validation-based decision tree ⊲ improve the conventional greedy splitting criterion ⊲ propose a new stopping criterion in node splitting
◮ Divide training data D
Yes
D
Λ
m
D
m
Yes
... m DK
Experiments
No
Smqy Smqn
m
\
m D1 m D2
m Λ1 m Λ2
m DK
...... m ΛK
◮ Determining the number of cross-validation folds K Table: The log spectral distortion for different K on the development set
m δ(D1 )q m δ(D2 )q
K 4 6 8 10 14 LSD (dB) 5.32 5.33 5.32 5.32 5.31 ◮ Objective Test Results
m δ(DK )q
Log spectral distance
Root mean square error of F0
Root mean square error of durations 30.8000
23.4000
m
δ(D )q
◮ MDL criterion for stopping m M DL m ML δ(D )q = δ(D )q − αL log G
⊲ likelihood increased by node splitting mqy mqn CV m CV CV CV m δ (Dk )q = Lk (Dk ) + Lk (Dk ) − Lk (Dk ) ⊲ select the best question over all validation sets X CV m qm = arg max δ (Dk )q q
23.2000
CV
5.8500
RMSE of F0 (Hz/frame)
α=1.0
5.8000 5.7500 5.7000 5.6500
α=0.5 5.6000
23.0000
α=1.0
22.8000
α=1.0
30.6000
MDL
RMSE of duraon (ms/phone)
CV
5.9000
α=0.5
22.6000
22.4000
α=0.8
30.4000
30.2000
30.0000
MDL 29.8000
CV 29.6000
5.5500
Stop automatically
Stop automatically
22.2000
Stop automatically
5.5000
29.4000
0
2000
4000
6000
8000
State number
10000
12000
22.0000
0
0
2000
4000
6000
State Number
8000
1000
2000
3000
4000
5000
10000
State Number
◮ Subjective Test Results
k
◮ Node stopping criteria ⊲ node splitting intuitively stops when X CV m δ (Dk )qm < 0
Main problems ◮ Splitting criterion: greedy search is sensitive to the biased training set ◮ Stopping criterion: not effective when training data is not asymptotically large / manually-tuned threshold
Log Spectral Distance (dB)
D
x∈Dkm
m
MDL
5.9500
◮ Node splitting criteria ⊲ evaluate likelihood on K validation sets X CV m m Lk (Dk ) = P (x|Λk )
Yes No Smqy Smqn m
◮ Training database ◮ MDL-based decision tree ⊲ Mandarin corpus, 16 kHz, 1,000 sentences, female ⊲ standard MDL method: α = 1.0 ⊲ tuning α on development set speaker ⊲ 40th-order LSP + gain, f0 ◮ Cross-Validation based decision tree ⊲ first- and second-order dynamic features ⊲ stop node splitting using intuitive criterion ⊲ 25,761 different rich context phone models ⊲ MDL-based criterion (Eq.(1))
6.0000
No R-voiced? Sm
Likelihood Increase
No R-voiced? Sm
D
⇒ Yes
into K subsets at each node
m D2
D \ m D \
◮ ML criterion for node splitting m ML δ(D )q = L(Smqy ) + L(Smqn) − L(Sm)
Experimental Setup
m D1
m
MDL-based decision tree clustering
m
1
Conclusions
k
⊲ MDL criterion can also be used X CV m δ (Dk )qm + αL log G < 0 k
⊲ in our experiments, the first criterion gives good results
(1)
◮ Use cross validation in decision tree clustering for HMM-based TTS ◮ Propose a splitting and a stopping criterion in tree building ◮ Compared with conventional method, cross-validation yields better performance given similar model size
cross-validation based decision tree clustering for hmm ...
CROSS-VALIDATION BASED DECISION TREE CLUSTERING FOR HMM-BASED TTS. Yu Zhang. 1,2. , Zhi-Jie Yan. 1 and Frank K. Soong. 1. 1. Microsoft ...