Outline
Introduction
Regularized K-means
Tuning
Asymptotic
Simulation
Real Data
Regularized k-means clustering of high-dimensional data and its asymptotic consistency Wei
Sun
Department of Statistics Purdue University Jan 25, 2012
Joint with Junhui Wang (UIC) and Yixin Fang (NYU) Wei
Sun
Regularized K-means
Outline bg=white
Introduction
Regularized K-means
Tuning
Asymptotic
Outline
1
Cluster analysis and K-means clustering
2
Regularized K-means and its implementation
3
Tuning via clustering stability
4
Estimation and selection consistency
5
Simulation study
6
Applications to gene microarray data
Wei
Sun
Regularized K-means
Simulation
Real Data
Outline bg=white
Introduction
Regularized K-means
Tuning
Asymptotic
Simulation
Real Data
Cluster Analysis Goal: Assign observations into a number of clusters such that observations in the same cluster are similar.
Wei
Sun
Regularized K-means
Outline bg=white
Introduction
Regularized K-means
Tuning
Asymptotic
Simulation
Real Data
Cluster Analysis Goal: Assign observations into a number of clusters such that observations in the same cluster are similar.
Wei
Sun
Regularized K-means
Outline bg=white
Introduction
Regularized K-means
Tuning
Asymptotic
K-means Cluster Analysis–STAT 598G n observations: X1 , · · · , Xn with Xi ∈ Rp Number of clusters: K.
The K clusters: A1 , · · · , AK .
Centers: C1 , · · · , CK with Ck ∈ Rp .
Wei
Sun
Regularized K-means
Simulation
Real Data
Outline bg=white
Introduction
Regularized K-means
Tuning
Asymptotic
Simulation
Real Data
K-means Cluster Analysis–STAT 598G n observations: X1 , · · · , Xn with Xi ∈ Rp Number of clusters: K.
The K clusters: A1 , · · · , AK .
Centers: C1 , · · · , CK with Ck ∈ Rp . K-means clustering solves min
Ak ,Ck
K X X
k=1 Xi ∈Ak
(t)
kXi − Ck k2 ,
(t)
Given CK , assign AK . (t) (t+1) Fix AK , update CK . Wei
Sun
Regularized K-means
(1)
Outline bg=white
Introduction
Regularized K-means
Tuning
Asymptotic
Simulation
An illustrative example–n = 12, p = 2, K = 3 (Wiki)
Step 1. Randomly select 3 initial centers C1 , C2 , C3 .
Wei
Sun
Regularized K-means
Real Data
Outline bg=white
Introduction
Regularized K-means
Tuning
Asymptotic
Simulation
An illustrative example–n = 12, p = 2, K = 3 (Wiki)
Step 1. Randomly select 3 initial centers C1 , C2 , C3 . Step 2. Create 3 clusters A1 , A2 , and A3 by assigning observations to the nearest center.
Wei
Sun
Regularized K-means
Real Data
Outline bg=white
Introduction
Regularized K-means
Tuning
Asymptotic
Simulation
An illustrative example–n = 12, p = 2, K = 3 (Wiki)
Step 1. Randomly select 3 initial centers C1 , C2 , C3 . Step 2. Create 3 clusters A1 , A2 , and A3 by assigning observations to the nearest center. Step 3. Mean of each cluster becomes new center.
Wei
Sun
Regularized K-means
Real Data
Outline bg=white
Introduction
Regularized K-means
Tuning
Asymptotic
Simulation
An illustrative example–n = 12, p = 2, K = 3 (Wiki)
Step 1. Randomly select 3 initial centers C1 , C2 , C3 . Step 2. Create 3 clusters A1 , A2 , and A3 by assigning observations to the nearest center. Step 3. Mean of each cluster becomes new center. Step 4. Repeat steps 2 and 3 until convergent. Wei
Sun
Regularized K-means
Real Data
Outline bg=white
Introduction
Regularized K-means
Tuning
Asymptotic
When p ≫ n
Euclidean distance becomes less sensitive. Hall et.al.(2005)
Wei
Sun
Regularized K-means
Simulation
Real Data
Outline bg=white
Introduction
Regularized K-means
Tuning
Asymptotic
Simulation
When p ≫ n
Euclidean distance becomes less sensitive. Hall et.al.(2005) Many variables are redundant and contain no information of the clustering structure, but k-means clustering tends to include all the variables.
Wei
Sun
Regularized K-means
Real Data
Outline bg=white
Introduction
Regularized K-means
Tuning
Asymptotic
Simulation
Lymphoma Dataset (Alizadeh et al. 2000)
n = 62 samples, p = 4026 genes, 3 types of lymphomas. Diffuse large B-cell lymphoma (DLBCL): 42 samples Follicular lymphoma (FL): 9 samples B-cell chronic lymphocytic leukemia (CLL): 11 samples.
Wei
Sun
Regularized K-means
Real Data
Outline bg=white
Introduction
Regularized K-means
Tuning
Asymptotic
Simulation
Lymphoma Dataset (Alizadeh et al. 2000)
n = 62 samples, p = 4026 genes, 3 types of lymphomas. Diffuse large B-cell lymphoma (DLBCL): 42 samples Follicular lymphoma (FL): 9 samples B-cell chronic lymphocytic leukemia (CLL): 11 samples.
Goals: Cluster the samples and identify the significant genes for detecting DLBCL, FL and CLL respectively.
Wei
Sun
Regularized K-means
Real Data
Outline bg=white
Introduction
Regularized K-means
Tuning
Asymptotic
Heatmap of Lymphoma Dataset
Wei
Sun
Regularized K-means
Simulation
Real Data
Outline bg=white
Introduction
Regularized K-means
Tuning
Asymptotic
Select informative variables
Multiple testing. Donoho and Jin (2008)
Wei
Sun
Regularized K-means
Simulation
Real Data
Outline bg=white
Introduction
Regularized K-means
Tuning
Asymptotic
Select informative variables
Multiple testing. Donoho and Jin (2008)
Regularization. LASSO(Tibshirani 1996) Adaptive LASSO(Zou 2006) Group LASSO(Yuan and Lin 2006) Adaptive group LASSO (Wang and Leng 2008)
Wei
Sun
Regularized K-means
Simulation
Real Data
Outline bg=white
Introduction
Regularized K-means
Tuning
Asymptotic
Simulation
Regularized K-means
Our regularized k-means clustering: min
Ak ,Ck
K X X
k=1 Xi ∈Ak
Wei
kXi − Ck k2 +
Sun
p X
J(C(j) ).
j=1
Regularized K-means
Real Data
Outline bg=white
Introduction
Regularized K-means
Tuning
Asymptotic
Simulation
Real Data
Regularized K-means
Our regularized k-means clustering: min
Ak ,Ck
K X X
k=1 Xi ∈Ak
kXi − Ck k2 +
p X
J(C(j) ).
j=1
C(j) is the jth variable across all the centers. b(j) ||−1 , and C b(j) is the J(C(j) ) = λj ||C(j) ||, where λj = λ||C unpenalized estimator from standard K-means. b(j) → large λ||C b(j) ||−1 → smaller C(j) . Small C
Wei
Sun
Regularized K-means
Outline bg=white
Introduction
Regularized K-means
Tuning
Asymptotic
Simulation
Implementation
An iterative scheme: Ck is fixed, Ak is updated by assigning Xi to the closest cluster. Ak is fixed, the following Lemma suggests that Ck can be solved in a componentwise fashion.
Wei
Sun
Regularized K-means
Real Data
Outline bg=white
Introduction
Regularized K-means
Tuning
Asymptotic
Simulation
Lemma Lemma 1 K X X
k=1 Xi ∈Ak
2
kXi − Ck k +
p X
J(C(j) ) =
j=1
p X (X(j) − LC(j) )T (X(j) − LC(j) ) + J(C(j) ) , j=1
X(j) is jth variable and L is cluster assignment matrix, Lik = I(Xi ∈ Ak ).
When L is fixed, solving the regularized k-means can be simplified to: minC(j) (X(j) − LC(j) )T (X(j) − LC(j) ) + J(C(j) ), ∀j. Wei
Sun
Regularized K-means
Real Data
Outline bg=white
Introduction
Regularized K-means
Tuning
Asymptotic
Simulation
Algorithm of Regularized K-means
Algorithm (0)
(0)
Step 1. Initialize centers C1 , · · · , CK by standard K-means. Step 2. Until the termination condition is met, repeat (t−1)
Given C1
(t−1)
, · · · , CK
, find L(t) .
Given L(t) , update C (t) by minimization on each j.
Wei
Sun
Regularized K-means
Real Data
Outline bg=white
Introduction
Regularized K-means
Tuning
Asymptotic
Simulation
Real Data
Algorithm of Regularized K-means
Algorithm (0)
(0)
Step 1. Initialize centers C1 , · · · , CK by standard K-means. Step 2. Until the termination condition is met, repeat (t−1)
Given C1
(t−1)
, · · · , CK
, find L(t) .
Given L(t) , update C (t) by minimization on each j. Initialization by multiple starts. The iteration stops when L(t) does not change any more. The number of iterations is often ≤ 5.
Wei
Sun
Regularized K-means
Outline bg=white
Introduction
Regularized K-means
Tuning
Asymptotic
Simulation
Model tuning–K and λ
Difficulty of tuning clustering algorithm. Absence of an objective judgement (Wang, 2010).
Wei
Sun
Regularized K-means
Real Data
Outline bg=white
Introduction
Regularized K-means
Tuning
Asymptotic
Simulation
Model tuning–K and λ
Difficulty of tuning clustering algorithm. Absence of an objective judgement (Wang, 2010).
Clustering stability. Key idea: If we repeatedly draw samples from the same population and apply the clustering algorithm on the same sample with the given K and λ, a good choice of K and λ should produce stable clustering.
Wei
Sun
Regularized K-means
Real Data
Outline bg=white
Introduction
Regularized K-means
Tuning
Asymptotic
Simulation
Real Data
Clustering Stability (Wang, 2010) Assume Z = (X1 , · · · , Xn ) ∼ F (x) with x ∈ Rp . Clustering assignment ψ(x): Rp → {1, · · · , K}. Clustering algorithm Ψ(Z; K, λ) yields a clustering assignment ψ(Z) when applied to a sample Z.
Wei
Sun
Regularized K-means
Outline bg=white
Introduction
Regularized K-means
Tuning
Asymptotic
Simulation
Real Data
Clustering Stability (Wang, 2010) Assume Z = (X1 , · · · , Xn ) ∼ F (x) with x ∈ Rp . Clustering assignment ψ(x): Rp → {1, · · · , K}. Clustering algorithm Ψ(Z; K, λ) yields a clustering assignment ψ(Z) when applied to a sample Z. Clustering distance between ψ1 (x) and ψ2 (x) h i d(ψ1 , ψ2 ) = P r I{ψ1 (X) = ψ1 (Y )} + I{ψ2 (X) = ψ2 (Y )} = 1 where I(·) is an indicator function, and X and Y are random samples from F (x).
Wei
Sun
Regularized K-means
Outline bg=white
Introduction
Regularized K-means
Tuning
Asymptotic
Simulation
Real Data
Clustering Stability (Wang, 2010) Assume Z = (X1 , · · · , Xn ) ∼ F (x) with x ∈ Rp . Clustering assignment ψ(x): Rp → {1, · · · , K}. Clustering algorithm Ψ(Z; K, λ) yields a clustering assignment ψ(Z) when applied to a sample Z. Clustering distance between ψ1 (x) and ψ2 (x) h i d(ψ1 , ψ2 ) = P r I{ψ1 (X) = ψ1 (Y )} + I{ψ2 (X) = ψ2 (Y )} = 1 where I(·) is an indicator function, and X and Y are random samples from F (x). d(ψ1 , ψ2 ) measures the probability of their disagreement. Example: ψ1 = (1, 1, 1), ψ2 = (1, 2, 2), d(ψ1 , ψ2 ) = 23 . Wei
Sun
Regularized K-means
Outline bg=white
Introduction
Regularized K-means
Tuning
Asymptotic
Simulation
Real Data
Clustering Stability (Wang, 2010)
Clustering instability of Ψ(·; K, λ) S(Ψ, K, λ, n) = E(d{Ψ(Z1 ; K, λ), Ψ(Z2 ; K, λ)}), where Ψ(Z1 ; K, λ) and Ψ(Z2 ; K, λ) are clusterings obtained by applying Ψ(·; K, λ) to two samples Z1 and Z2 .
Wei
Sun
Regularized K-means
(2)
Outline bg=white
Introduction
Regularized K-means
Tuning
Asymptotic
Simulation
Real Data
Clustering Stability (Wang, 2010)
Clustering instability of Ψ(·; K, λ) S(Ψ, K, λ, n) = E(d{Ψ(Z1 ; K, λ), Ψ(Z2 ; K, λ)}), where Ψ(Z1 ; K, λ) and Ψ(Z2 ; K, λ) are clusterings obtained by applying Ψ(·; K, λ) to two samples Z1 and Z2 . Distribution F(x) is unknown.
Wei
Sun
Regularized K-means
(2)
Outline bg=white
Introduction
Regularized K-means
Tuning
Asymptotic
Simulation
Real Data
Clustering Stability (Wang, 2010)
Clustering instability of Ψ(·; K, λ) S(Ψ, K, λ, n) = E(d{Ψ(Z1 ; K, λ), Ψ(Z2 ; K, λ)}), where Ψ(Z1 ; K, λ) and Ψ(Z2 ; K, λ) are clusterings obtained by applying Ψ(·; K, λ) to two samples Z1 and Z2 . Distribution F(x) is unknown. The sample size n is relatively small compared with p.
Wei
Sun
Regularized K-means
(2)
Outline bg=white
Introduction
Regularized K-means
Tuning
Asymptotic
Simulation
Estimation based on Bootstrap
Generate bootstrap samples of same size n, Z1∗b , Z2∗b , Z3∗b .
Wei
Sun
Regularized K-means
Real Data
Outline bg=white
Introduction
Regularized K-means
Tuning
Asymptotic
Simulation
Estimation based on Bootstrap
Generate bootstrap samples of same size n, Z1∗b , Z2∗b , Z3∗b . Construct clusterings Ψ(Z1∗b ; K, λ) and Ψ(Z2∗b ; K, λ).
Wei
Sun
Regularized K-means
Real Data
Outline bg=white
Introduction
Regularized K-means
Tuning
Asymptotic
Simulation
Estimation based on Bootstrap
Generate bootstrap samples of same size n, Z1∗b , Z2∗b , Z3∗b . Construct clusterings Ψ(Z1∗b ; K, λ) and Ψ(Z2∗b ; K, λ). Estimate S(Ψ, K, λ, n) on Z3∗b as the distance between Ψ(Z1∗b ; K, λ) and Ψ(Z2∗b ; K, λ)
Wei
Sun
Regularized K-means
Real Data
Outline bg=white
Introduction
Regularized K-means
Tuning
Asymptotic
Simulation
Estimation based on Bootstrap
Generate bootstrap samples of same size n, Z1∗b , Z2∗b , Z3∗b . Construct clusterings Ψ(Z1∗b ; K, λ) and Ψ(Z2∗b ; K, λ). Estimate S(Ψ, K, λ, n) on Z3∗b as the distance between Ψ(Z1∗b ; K, λ) and Ψ(Z2∗b ; K, λ) b = mode{K b λ , λ > 0} where K b λ = mode{K b ∗1 , · · · , K b ∗B }, K λ λ ∗b ∗b b = argmin2≤K≤K.maxSb (Ψ, K, λ, n). and K λ
Wei
Sun
Regularized K-means
Real Data
Outline bg=white
Introduction
Regularized K-means
Tuning
Asymptotic
Simulation
Estimation based on Bootstrap
Generate bootstrap samples of same size n, Z1∗b , Z2∗b , Z3∗b . Construct clusterings Ψ(Z1∗b ; K, λ) and Ψ(Z2∗b ; K, λ). Estimate S(Ψ, K, λ, n) on Z3∗b as the distance between Ψ(Z1∗b ; K, λ) and Ψ(Z2∗b ; K, λ) b = mode{K b λ , λ > 0} where K b λ = mode{K b ∗1 , · · · , K b ∗B }, K λ λ ∗b ∗b b = argmin2≤K≤K.maxSb (Ψ, K, λ, n). and K λ
b = mode{λ b∗1 , · · · , λ b∗B }, where b λ Given K, ∗b ∗b b = argminλ Sb (Ψ, K, b λ, n). λ
Wei
Sun
Regularized K-means
Real Data
Outline bg=white
Introduction
Regularized K-means
Tuning
Asymptotic
Simulation
Asymptotic Consistency–Fixed p
Theorem 1. Estimation Consistency √ Under regularity assumptions, if nλ → 0 and nλ → ∞, then b → C¯ a.s. and kC b − Ck ¯ = Op (n1/2 λ). C Theorem 2. Selection Consistency √ Under regularity assumptions, if nλ → 0 and nλ → ∞, then b(j) k = 0) → 1 for any j ∈ Ac , where Ac is the P (kC non-informative variable set.
Wei
Sun
Regularized K-means
Real Data
Outline bg=white
Introduction
Regularized K-means
Tuning
Asymptotic
Simulation
Real Data
Asymptotic Consistency–Diverging p: p < O(n1/3)
Theorem 3. Estimation Consistency √ Under regularity assumptions, if nλp → 0 and n−2 λ−2 p → 0 as b → C¯ almost surely and kC b − Ck ¯ = Op (n1/2 λp−1 ). n → ∞, then C Theorem 4. Selection Consistency Under regularity assumptions, if n1/2 λp → 0 and n−2 λ−2 p → 0 as b(j) k = 0) → 1 for any j ∈ Ac . n → ∞, then P (kC
Wei
Sun
Regularized K-means
Outline bg=white
Introduction
Regularized K-means
Tuning
Asymptotic
Simulation
Simulation 1: K is known n = 80, K = 4, p = 50, 200, 500, 1000, µ = 0.4, 0.6, 0.8. First 50 informative variables ∼ N (µkj , 1) Variable 1-25 26-50
Cluster 1 µ −µ
Cluster 2 −µ −µ
Cluster 3 −µ µ
Cluster 4 µ µ
The remaining p − 50 uninformative variables ∼ N (0, 1).
Wei
Sun
Regularized K-means
Real Data
Outline bg=white
Introduction
Regularized K-means
Tuning
Asymptotic
Simulation
Simulation 1: K is known n = 80, K = 4, p = 50, 200, 500, 1000, µ = 0.4, 0.6, 0.8. First 50 informative variables ∼ N (µkj , 1) Variable 1-25 26-50
Cluster 1 µ −µ
Cluster 2 −µ −µ
Cluster 3 −µ µ
Cluster 4 µ µ
The remaining p − 50 uninformative variables ∼ N (0, 1).
Compare with K-means and sparse k-means (Witten and Tibshirani, 2010) All algorithms are randomly started 100 times. Grid search for λ.
Wei
Sun
Regularized K-means
Real Data
Outline bg=white
Introduction
Regularized K-means
Tuning
Asymptotic
Case I: µ = 0.4
Wei
Sun
Regularized K-means
Simulation
Real Data
Outline bg=white
Introduction
Regularized K-means
Tuning
Asymptotic
Case II: µ = 0.6
Wei
Sun
Regularized K-means
Simulation
Real Data
Outline bg=white
Introduction
Regularized K-means
Tuning
Asymptotic
Case III: µ = 0.8
Wei
Sun
Regularized K-means
Simulation
Real Data
Outline bg=white
Introduction
Regularized K-means
Tuning
Asymptotic
Simulation
Real Data
Performance of variable selection
µ 0.4
0.6
0.8
Methods K-means Sparse Regularized K-means Sparse Regularized K-means Sparse Regularized
p=50 50(0) 33.3(3.05) 36.6(1.68) 50(0) 45.2(0.99) 49.8(0.12) 50(0) 46.4(1.09) 50(0)
Wei
p=200 200(0) 84.6(15.80) 35.9(4.20) 200(0) 128.3(9.57) 52.1(1.77) 200(0) 157.1(7.53) 65.5(1.08)
Sun
p=500 500(0) 127.0(39.30) 45.1(9.20) 500(0) 182.8(41.46) 47.3(2.30) 500(0) 126.8(30.40) 53.2(1.85)
Regularized K-means
p=1000 1000(0) 362.6(87.80) 60.3(11.90) 1000(0) 43.6(6.04) 64.8(9.80) 1000(0) 44.9(4.41) 65.3(7.03)
Outline bg=white
Introduction
Regularized K-means
Tuning
Asymptotic
Simulation
Simulation 2: K is unknown
The same setup as in simulation 1 with p = 200, µ = 0.8. Grid search for K and λ.
Wei
Sun
Regularized K-means
Real Data
Outline bg=white
Introduction
Regularized K-means
Tuning
Asymptotic
Simulation
Real Data
Simulation 2: K is unknown
The same setup as in simulation 1 with p = 200, µ = 0.8. Grid search for K and λ.
Methods Standard k-means Sparse k-means Regularized k-means
K=2 0 18 0
Wei
Sun
K=4 20 2 20
Number 200(0) 138.0(4.11) 50.0(0.05)
Regularized K-means
Error 0.001(0.001) 0.228(0.017) 0(0)
Outline bg=white
Introduction
Regularized K-means
Tuning
Asymptotic
Performance of tuning
Wei
Sun
Regularized K-means
Simulation
Real Data
Outline bg=white
Introduction
Regularized K-means
Tuning
Asymptotic
Simulation
Real Data
Two gene microarray examples
Lymphoma: n = 62 and number of genes p = 4026. 42 samples of DLBCL, 9 samples of FL, and 11 samples of CLL. Leukemia: n = 72 and number of genes p = 6817. Two types of human acute leukemias: 25 patients with AML and 47 patients with ALL.
Wei
Sun
Regularized K-means
Outline bg=white
Introduction
Regularized K-means
Tuning
Asymptotic
Simulation
Real Data
Two gene microarray examples
Lymphoma: n = 62 and number of genes p = 4026. 42 samples of DLBCL, 9 samples of FL, and 11 samples of CLL. Leukemia: n = 72 and number of genes p = 6817. Two types of human acute leukemias: 25 patients with AML and 47 patients with ALL. Clustering errors are estimated by comparing the clustering assignments to the available cancer types.
Wei
Sun
Regularized K-means
Outline bg=white
Introduction
Regularized K-means
Tuning
Asymptotic
Simulation
Performance
Data Leukemia
Lymphoma
Methods k-means Sparse k-means Regularized k-means k-means Sparse k-means Regularized k-means
Wei
Sun
K 2 4 2 2 3 3
Genes 3571 2577 211 4206 3025 66
Regularized K-means
Error 2/72 2/72 2/72 4/62 2/62 1/62
Real Data
Outline bg=white
Introduction
Regularized K-means
Tuning
Asymptotic
Heatmap of Lymphoma on selected genes
Wei
Sun
Regularized K-means
Simulation
Real Data