Supplemental materials for ”Decomposed Normalized Maximum Likelihood Codelength Criterion for Selecting Hierarchical Latent Variable Models” Tianyi Wu∗, Sugawara Shinya†and Kenji Yamanishi‡
1 1.1
Proofs Proof Sketch of Theorem 3.3
We begin with deriving LN M L (xn |z n ; M ). The maximum of the likelihood ˆ n , z n ), M ) can be written as function P (xn |z n ; Φ(x ˆ n , z n ), M ) = P (x |z ; Φ(x n
n
K Y V Y nkv nkv ) . ( nk
k=1 v=1
Following similar computation in NB, we can obtain the first two terms in the main equation for Theorem 3.3. Next, we consider LN M L (z n ; M ). Because each document has a mixture of topics in LDA, P (z n ; Θ) can be deQ n composed into d P (zd ; θd ), where zdn allocates data to document d. Under this decomposition, P (zdn ; θd ) for each d comprises P a finite mixture model. Then, the NML codelength can be obtained as d LN M L (zdn ; M ), which is the last two terms in the main equation for Theorem 3.3. ∗ Corresponding author. University of Tokyo, 7-3-1 Hongo, Bunkyo, Tokyo 113-0033, Japan Email: tianyi
[email protected] † University of Tokyo, 7-3-1 Hongo, Bunkyo, Tokyo 113-0033, Japan Email:
[email protected] ‡ University of Tokyo, 7-3-1 Hongo, Bunkyo, Tokyo 113-0033, Japan Email:
[email protected]
1
1.2
Proof Sketch for Theorem 3.4
We begin with deriving the expression for LNML (xn |z n ; M ). Notice that when the latent variable z are given, the conditional distribution P (x|z; θ) is the same as that of SBM, thus the conditional maximum likelihood P (xn |z n ; ηˆ) is the same in Theorem 3.2, which is XX XX log CM N (nk1 k2 , 2). nk1 k2 log nk1 k2 − n1k1 k2 log n1k1 k2 − n0k1 k2 log n0k1 k2 + k1
k1
k2
k2
Next, we consider LN M L (z n ; M ). MMSBM is a variant model of LDA, documents in LDA are corresponding to vertices and word in document d are corresponding to links and no-links begin from vertex i. Therefore, we can plug ni into nd from LN M L (z n ; M ) for LDA. Using the result of Theorem 3.3, we can get LN M L (z n ; M ) for MMSBM for as follows: XX X nik (log ni − log nik ) + log CM N (ni , K), i
2 2.1
i
k
Detailed Designs of Experiments Experiment using the NB Model
For the experiments using the NB models, to guarantee the generality of experiments, we generate multiple datasets using different hyper-parameters. NB has hyper-parameters α, β and M where π ∼ Dir(α), φk ∼ Dir(β) and M = (M1 , ..., MD ). We generate eight datasets for each combination of hyper-parameters from the following candidates: α ∈ {2, 3, 4, 5}, β ∈ {0.05, 0.15, 0.3, 0.5}, m ∈ {16, 16, 16), (4, 4, 4, 4), (12, 12, 12, 12), (8, 8, 8, 8, 8), (6, 6, 6, 6, 6), (4, 4, 4, 4, 4, 4)} and n ∈ {10, 37, 138, 268, 1000}. As a result, we generated 4 × 4 × 6 × 5 × 8 = 3840 simulation datasets in total.
2.2
Experiment using SBM
The hyper-parameters for SBM are α, β and ρ where π ∼ Dir(α), ηk1 k2 ∼ Ber(β). We generated eight synthetic datasets using each combination of hyper-parameters from the following candidates: α ∈ {1, 4}, β ∈ {0.1, 0.3, 0.6, 1, 3}, ρ ∈ {1.0, 0.75} and n ∈ {7, 12, 19, 29, 45, 70, 108, 167, 258, 400}. As a result, we obtained 2 × 5 × 2 × 10 × 8 = 1600 datasets in total. The best model was selected from candidates of (1, 2, 3, 4, 5, 6, 7, 8, 9, 10) latent components.
2
2.3
Experiment using LDA
The hyper-parameters in LDA are α, β and V where θd ∼ Dir(α), φk ∼ Dir(β) and V is the number of unique words. We generated eight synthetic datasets using each combination of hyper-parameters from the following candidates: α ∈ {0.1, 0.2, 0.25, 0.3, 0.35, 0.4}, β ∈ {0.1, 0.2, 0.25, 0.3, 0.35, 0.4}, V ∈ {200, 400, 600] and n ∈ {5, 7, 12, 19, 30, 48, 76, 120, 190, 300]. As a result, we obtained 6 × 6 × 3 × 10 × 8 = 8640 datasets in total. The best model was selected from candidates with (1, 2, 3, 4, 5, 6, 7, 8, 9, 10) latent components.
2.4
Real data experiments
For preprocessing on two datasets, following the previous studies, we omitted terms which had a lower term-frequency inverse document frequency (TFIDF) score than 0.1 [3], and selected only terms which appeared in five documents or more [2]. Since a single label was assigned to a document, all words in the document shared this label. For the 20 newsgroups data, the categories for each dataset are listed in Table 1. Table 1: 20 Newsgroups: Categories for 5 datasets Labels 2 3 4 5 6
Categories atheism, space atheism, graphics, baseball atheism, graphics, baseball, space graphics, baseball, space, christian, guns graphics, forsale, baseball, space, christian, guns
References [1] E. M. Airoldi, D. M. Blei, S. E. Fienberg, and E. P. Xing. Mixed membership stochastic blockmodels. Journal of Machine Learning Research, 9:1981–2014, 2008. [2] D. M. Blei and J. D. Lafferty. Topic models. Text mining: Classification, Clustering, and Applications, 10(71):34, 2009.
3
[3] T. L. Griffiths and M. Steyvers. Finding scientific topics. Proceedings of the National Academy of Sciences, 101:5228–5235, 2004. [4] P. Kontkanen and P. Myllym¨aki. A linear-time algorithm for computing the multinomial stochastic complexity. Information Processing Letters, 103(6):227–233, 2007.
4