Regularized k-means clustering of high-dimensional ...

Viewer
Transcript

Outline

Introduction

Regularized K-means

Tuning

Asymptotic

Simulation

Real Data

Regularized k-means clustering of high-dimensional data and its asymptotic consistency Wei

Sun

Department of Statistics Purdue University Jan 25, 2012

Joint with Junhui Wang (UIC) and Yixin Fang (NYU) Wei

Sun

Regularized K-means

Outline bg=white

Introduction

Regularized K-means

Tuning

Asymptotic

Outline

1

Cluster analysis and K-means clustering

2

Regularized K-means and its implementation

3

Tuning via clustering stability

4

Estimation and selection consistency

5

Simulation study

6

Applications to gene microarray data

Wei

Sun

Regularized K-means

Simulation

Real Data

Outline bg=white

Introduction

Regularized K-means

Tuning

Asymptotic

Simulation

Real Data

Cluster Analysis Goal: Assign observations into a number of clusters such that observations in the same cluster are similar.

Wei

Sun

Regularized K-means

Outline bg=white

Introduction

Regularized K-means

Tuning

Asymptotic

Simulation

Real Data

Cluster Analysis Goal: Assign observations into a number of clusters such that observations in the same cluster are similar.

Wei

Sun

Regularized K-means

Outline bg=white

Introduction

Regularized K-means

Tuning

Asymptotic

K-means Cluster Analysis–STAT 598G n observations: X1 , · · · , Xn with Xi ∈ Rp Number of clusters: K.

The K clusters: A1 , · · · , AK .

Centers: C1 , · · · , CK with Ck ∈ Rp .

Wei

Sun

Regularized K-means

Simulation

Real Data

Outline bg=white

Introduction

Regularized K-means

Tuning

Asymptotic

Simulation

Real Data

K-means Cluster Analysis–STAT 598G n observations: X1 , · · · , Xn with Xi ∈ Rp Number of clusters: K.

The K clusters: A1 , · · · , AK .

Centers: C1 , · · · , CK with Ck ∈ Rp . K-means clustering solves min

Ak ,Ck

K X X

k=1 Xi ∈Ak

(t)

kXi − Ck k2 ,

(t)

Given CK , assign AK . (t) (t+1) Fix AK , update CK . Wei

Sun

Regularized K-means

(1)

Outline bg=white

Introduction

Regularized K-means

Tuning

Asymptotic

Simulation

An illustrative example–n = 12, p = 2, K = 3 (Wiki)

Step 1. Randomly select 3 initial centers C1 , C2 , C3 .

Wei

Sun

Regularized K-means

Real Data

Outline bg=white

Introduction

Regularized K-means

Tuning

Asymptotic

Simulation

An illustrative example–n = 12, p = 2, K = 3 (Wiki)

Step 1. Randomly select 3 initial centers C1 , C2 , C3 . Step 2. Create 3 clusters A1 , A2 , and A3 by assigning observations to the nearest center.

Wei

Sun

Regularized K-means

Real Data

Outline bg=white

Introduction

Regularized K-means

Tuning

Asymptotic

Simulation

An illustrative example–n = 12, p = 2, K = 3 (Wiki)

Step 1. Randomly select 3 initial centers C1 , C2 , C3 . Step 2. Create 3 clusters A1 , A2 , and A3 by assigning observations to the nearest center. Step 3. Mean of each cluster becomes new center.

Wei

Sun

Regularized K-means

Real Data

Outline bg=white

Introduction

Regularized K-means

Tuning

Asymptotic

Simulation

An illustrative example–n = 12, p = 2, K = 3 (Wiki)

Step 1. Randomly select 3 initial centers C1 , C2 , C3 . Step 2. Create 3 clusters A1 , A2 , and A3 by assigning observations to the nearest center. Step 3. Mean of each cluster becomes new center. Step 4. Repeat steps 2 and 3 until convergent. Wei

Sun

Regularized K-means

Real Data

Outline bg=white

Introduction

Regularized K-means

Tuning

Asymptotic

When p ≫ n

Euclidean distance becomes less sensitive. Hall et.al.(2005)

Wei

Sun

Regularized K-means

Simulation

Real Data

Outline bg=white

Introduction

Regularized K-means

Tuning

Asymptotic

Simulation

When p ≫ n

Euclidean distance becomes less sensitive. Hall et.al.(2005) Many variables are redundant and contain no information of the clustering structure, but k-means clustering tends to include all the variables.

Wei

Sun

Regularized K-means

Real Data

Outline bg=white

Introduction

Regularized K-means

Tuning

Asymptotic

Simulation

Lymphoma Dataset (Alizadeh et al. 2000)

n = 62 samples, p = 4026 genes, 3 types of lymphomas. Diffuse large B-cell lymphoma (DLBCL): 42 samples Follicular lymphoma (FL): 9 samples B-cell chronic lymphocytic leukemia (CLL): 11 samples.

Wei

Sun

Regularized K-means

Real Data

Outline bg=white

Introduction

Regularized K-means

Tuning

Asymptotic

Simulation

Lymphoma Dataset (Alizadeh et al. 2000)

n = 62 samples, p = 4026 genes, 3 types of lymphomas. Diffuse large B-cell lymphoma (DLBCL): 42 samples Follicular lymphoma (FL): 9 samples B-cell chronic lymphocytic leukemia (CLL): 11 samples.

Goals: Cluster the samples and identify the significant genes for detecting DLBCL, FL and CLL respectively.

Wei

Sun

Regularized K-means

Real Data

Outline bg=white

Introduction

Regularized K-means

Tuning

Asymptotic

Heatmap of Lymphoma Dataset

Wei

Sun

Regularized K-means

Simulation

Real Data

Outline bg=white

Introduction

Regularized K-means

Tuning

Asymptotic

Select informative variables

Multiple testing. Donoho and Jin (2008)

Wei

Sun

Regularized K-means

Simulation

Real Data

Outline bg=white

Introduction

Regularized K-means

Tuning

Asymptotic

Select informative variables

Multiple testing. Donoho and Jin (2008)

Regularization. LASSO(Tibshirani 1996) Adaptive LASSO(Zou 2006) Group LASSO(Yuan and Lin 2006) Adaptive group LASSO (Wang and Leng 2008)

Wei

Sun

Regularized K-means

Simulation

Real Data

Outline bg=white

Introduction

Regularized K-means

Tuning

Asymptotic

Simulation

Regularized K-means

Our regularized k-means clustering: min

Ak ,Ck

K X X

k=1 Xi ∈Ak

Wei

kXi − Ck k2 +

Sun

p X

J(C(j) ).

j=1

Regularized K-means

Real Data

Outline bg=white

Introduction

Regularized K-means

Tuning

Asymptotic

Simulation

Real Data

Regularized K-means

Our regularized k-means clustering: min

Ak ,Ck

K X X

k=1 Xi ∈Ak

kXi − Ck k2 +

p X

J(C(j) ).

j=1

C(j) is the jth variable across all the centers. b(j) ||−1 , and C b(j) is the J(C(j) ) = λj ||C(j) ||, where λj = λ||C unpenalized estimator from standard K-means. b(j) → large λ||C b(j) ||−1 → smaller C(j) . Small C

Wei

Sun

Regularized K-means

Outline bg=white

Introduction

Regularized K-means

Tuning

Asymptotic

Simulation

Implementation

An iterative scheme: Ck is fixed, Ak is updated by assigning Xi to the closest cluster. Ak is fixed, the following Lemma suggests that Ck can be solved in a componentwise fashion.

Wei

Sun

Regularized K-means

Real Data

Outline bg=white

Introduction

Regularized K-means

Tuning

Asymptotic

Simulation

Lemma Lemma 1 K X X

k=1 Xi ∈Ak

2

kXi − Ck k +

p X

J(C(j) ) =

j=1

p X (X(j) − LC(j) )T (X(j) − LC(j) ) + J(C(j) ) , j=1

X(j) is jth variable and L is cluster assignment matrix, Lik = I(Xi ∈ Ak ).

When L is fixed, solving the regularized k-means can be simplified to: minC(j) (X(j) − LC(j) )T (X(j) − LC(j) ) + J(C(j) ), ∀j. Wei

Sun

Regularized K-means

Real Data

Outline bg=white

Introduction

Regularized K-means

Tuning

Asymptotic

Simulation

Algorithm of Regularized K-means

Algorithm (0)

(0)

Step 1. Initialize centers C1 , · · · , CK by standard K-means. Step 2. Until the termination condition is met, repeat (t−1)

Given C1

(t−1)

, · · · , CK

, find L(t) .

Given L(t) , update C (t) by minimization on each j.

Wei

Sun

Regularized K-means

Real Data

Outline bg=white

Introduction

Regularized K-means

Tuning

Asymptotic

Simulation

Real Data

Algorithm of Regularized K-means

Algorithm (0)

(0)

Step 1. Initialize centers C1 , · · · , CK by standard K-means. Step 2. Until the termination condition is met, repeat (t−1)

Given C1

(t−1)

, · · · , CK

, find L(t) .

Given L(t) , update C (t) by minimization on each j. Initialization by multiple starts. The iteration stops when L(t) does not change any more. The number of iterations is often ≤ 5.

Wei

Sun

Regularized K-means

Outline bg=white

Introduction

Regularized K-means

Tuning

Asymptotic

Simulation

Model tuning–K and λ

Difficulty of tuning clustering algorithm. Absence of an objective judgement (Wang, 2010).

Wei

Sun

Regularized K-means

Real Data

Outline bg=white

Introduction

Regularized K-means

Tuning

Asymptotic

Simulation

Model tuning–K and λ

Difficulty of tuning clustering algorithm. Absence of an objective judgement (Wang, 2010).

Clustering stability. Key idea: If we repeatedly draw samples from the same population and apply the clustering algorithm on the same sample with the given K and λ, a good choice of K and λ should produce stable clustering.

Wei

Sun

Regularized K-means

Real Data

Outline bg=white

Introduction

Regularized K-means

Tuning

Asymptotic

Simulation

Real Data

Clustering Stability (Wang, 2010) Assume Z = (X1 , · · · , Xn ) ∼ F (x) with x ∈ Rp . Clustering assignment ψ(x): Rp → {1, · · · , K}. Clustering algorithm Ψ(Z; K, λ) yields a clustering assignment ψ(Z) when applied to a sample Z.

Wei

Sun

Regularized K-means

Outline bg=white

Introduction

Regularized K-means

Tuning

Asymptotic

Simulation

Real Data

Clustering Stability (Wang, 2010) Assume Z = (X1 , · · · , Xn ) ∼ F (x) with x ∈ Rp . Clustering assignment ψ(x): Rp → {1, · · · , K}. Clustering algorithm Ψ(Z; K, λ) yields a clustering assignment ψ(Z) when applied to a sample Z. Clustering distance between ψ1 (x) and ψ2 (x) h i d(ψ1 , ψ2 ) = P r I{ψ1 (X) = ψ1 (Y )} + I{ψ2 (X) = ψ2 (Y )} = 1 where I(·) is an indicator function, and X and Y are random samples from F (x).

Wei

Sun

Regularized K-means

Outline bg=white

Introduction

Regularized K-means

Tuning

Asymptotic

Simulation

Real Data

Clustering Stability (Wang, 2010) Assume Z = (X1 , · · · , Xn ) ∼ F (x) with x ∈ Rp . Clustering assignment ψ(x): Rp → {1, · · · , K}. Clustering algorithm Ψ(Z; K, λ) yields a clustering assignment ψ(Z) when applied to a sample Z. Clustering distance between ψ1 (x) and ψ2 (x) h i d(ψ1 , ψ2 ) = P r I{ψ1 (X) = ψ1 (Y )} + I{ψ2 (X) = ψ2 (Y )} = 1 where I(·) is an indicator function, and X and Y are random samples from F (x). d(ψ1 , ψ2 ) measures the probability of their disagreement. Example: ψ1 = (1, 1, 1), ψ2 = (1, 2, 2), d(ψ1 , ψ2 ) = 23 . Wei

Sun

Regularized K-means

Outline bg=white

Introduction

Regularized K-means

Tuning

Asymptotic

Simulation

Real Data

Clustering Stability (Wang, 2010)

Clustering instability of Ψ(·; K, λ) S(Ψ, K, λ, n) = E(d{Ψ(Z1 ; K, λ), Ψ(Z2 ; K, λ)}), where Ψ(Z1 ; K, λ) and Ψ(Z2 ; K, λ) are clusterings obtained by applying Ψ(·; K, λ) to two samples Z1 and Z2 .

Wei

Sun

Regularized K-means

(2)

Outline bg=white

Introduction

Regularized K-means

Tuning

Asymptotic

Simulation

Real Data

Clustering Stability (Wang, 2010)

Clustering instability of Ψ(·; K, λ) S(Ψ, K, λ, n) = E(d{Ψ(Z1 ; K, λ), Ψ(Z2 ; K, λ)}), where Ψ(Z1 ; K, λ) and Ψ(Z2 ; K, λ) are clusterings obtained by applying Ψ(·; K, λ) to two samples Z1 and Z2 . Distribution F(x) is unknown.

Wei

Sun

Regularized K-means

(2)

Outline bg=white

Introduction

Regularized K-means

Tuning

Asymptotic

Simulation

Real Data

Clustering Stability (Wang, 2010)

Clustering instability of Ψ(·; K, λ) S(Ψ, K, λ, n) = E(d{Ψ(Z1 ; K, λ), Ψ(Z2 ; K, λ)}), where Ψ(Z1 ; K, λ) and Ψ(Z2 ; K, λ) are clusterings obtained by applying Ψ(·; K, λ) to two samples Z1 and Z2 . Distribution F(x) is unknown. The sample size n is relatively small compared with p.

Wei

Sun

Regularized K-means

(2)

Outline bg=white

Introduction

Regularized K-means

Tuning

Asymptotic

Simulation

Estimation based on Bootstrap

Generate bootstrap samples of same size n, Z1∗b , Z2∗b , Z3∗b .

Wei

Sun

Regularized K-means

Real Data

Outline bg=white

Introduction

Regularized K-means

Tuning

Asymptotic

Simulation

Estimation based on Bootstrap

Generate bootstrap samples of same size n, Z1∗b , Z2∗b , Z3∗b . Construct clusterings Ψ(Z1∗b ; K, λ) and Ψ(Z2∗b ; K, λ).

Wei

Sun

Regularized K-means

Real Data

Outline bg=white

Introduction

Regularized K-means

Tuning

Asymptotic

Simulation

Estimation based on Bootstrap

Generate bootstrap samples of same size n, Z1∗b , Z2∗b , Z3∗b . Construct clusterings Ψ(Z1∗b ; K, λ) and Ψ(Z2∗b ; K, λ). Estimate S(Ψ, K, λ, n) on Z3∗b as the distance between Ψ(Z1∗b ; K, λ) and Ψ(Z2∗b ; K, λ)

Wei

Sun

Regularized K-means

Real Data

Outline bg=white

Introduction

Regularized K-means

Tuning

Asymptotic

Simulation

Estimation based on Bootstrap

Generate bootstrap samples of same size n, Z1∗b , Z2∗b , Z3∗b . Construct clusterings Ψ(Z1∗b ; K, λ) and Ψ(Z2∗b ; K, λ). Estimate S(Ψ, K, λ, n) on Z3∗b as the distance between Ψ(Z1∗b ; K, λ) and Ψ(Z2∗b ; K, λ) b = mode{K b λ , λ > 0} where K b λ = mode{K b ∗1 , · · · , K b ∗B }, K λ λ ∗b ∗b b = argmin2≤K≤K.maxSb (Ψ, K, λ, n). and K λ

Wei

Sun

Regularized K-means

Real Data

Outline bg=white

Introduction

Regularized K-means

Tuning

Asymptotic

Simulation

Estimation based on Bootstrap

Generate bootstrap samples of same size n, Z1∗b , Z2∗b , Z3∗b . Construct clusterings Ψ(Z1∗b ; K, λ) and Ψ(Z2∗b ; K, λ). Estimate S(Ψ, K, λ, n) on Z3∗b as the distance between Ψ(Z1∗b ; K, λ) and Ψ(Z2∗b ; K, λ) b = mode{K b λ , λ > 0} where K b λ = mode{K b ∗1 , · · · , K b ∗B }, K λ λ ∗b ∗b b = argmin2≤K≤K.maxSb (Ψ, K, λ, n). and K λ

b = mode{λ b∗1 , · · · , λ b∗B }, where b λ Given K, ∗b ∗b b = argminλ Sb (Ψ, K, b λ, n). λ

Wei

Sun

Regularized K-means

Real Data

Outline bg=white

Introduction

Regularized K-means

Tuning

Asymptotic

Simulation

Asymptotic Consistency–Fixed p

Theorem 1. Estimation Consistency √ Under regularity assumptions, if nλ → 0 and nλ → ∞, then b → C¯ a.s. and kC b − Ck ¯ = Op (n1/2 λ). C Theorem 2. Selection Consistency √ Under regularity assumptions, if nλ → 0 and nλ → ∞, then b(j) k = 0) → 1 for any j ∈ Ac , where Ac is the P (kC non-informative variable set.

Wei

Sun

Regularized K-means

Real Data

Outline bg=white

Introduction

Regularized K-means

Tuning

Asymptotic

Simulation

Real Data

Asymptotic Consistency–Diverging p: p < O(n1/3)

Theorem 3. Estimation Consistency √ Under regularity assumptions, if nλp → 0 and n−2 λ−2 p → 0 as b → C¯ almost surely and kC b − Ck ¯ = Op (n1/2 λp−1 ). n → ∞, then C Theorem 4. Selection Consistency Under regularity assumptions, if n1/2 λp → 0 and n−2 λ−2 p → 0 as b(j) k = 0) → 1 for any j ∈ Ac . n → ∞, then P (kC

Wei

Sun

Regularized K-means

Outline bg=white

Introduction

Regularized K-means

Tuning

Asymptotic

Simulation

Simulation 1: K is known n = 80, K = 4, p = 50, 200, 500, 1000, µ = 0.4, 0.6, 0.8. First 50 informative variables ∼ N (µkj , 1) Variable 1-25 26-50

Cluster 1 µ −µ

Cluster 2 −µ −µ

Cluster 3 −µ µ

Cluster 4 µ µ

The remaining p − 50 uninformative variables ∼ N (0, 1).

Wei

Sun

Regularized K-means

Real Data

Outline bg=white

Introduction

Regularized K-means

Tuning

Asymptotic

Simulation

Simulation 1: K is known n = 80, K = 4, p = 50, 200, 500, 1000, µ = 0.4, 0.6, 0.8. First 50 informative variables ∼ N (µkj , 1) Variable 1-25 26-50

Cluster 1 µ −µ

Cluster 2 −µ −µ

Cluster 3 −µ µ

Cluster 4 µ µ

The remaining p − 50 uninformative variables ∼ N (0, 1).

Compare with K-means and sparse k-means (Witten and Tibshirani, 2010) All algorithms are randomly started 100 times. Grid search for λ.

Wei

Sun

Regularized K-means

Real Data

Outline bg=white

Introduction

Regularized K-means

Tuning

Asymptotic

Case I: µ = 0.4

Wei

Sun

Regularized K-means

Simulation

Real Data

Outline bg=white

Introduction

Regularized K-means

Tuning

Asymptotic

Case II: µ = 0.6

Wei

Sun

Regularized K-means

Simulation

Real Data

Outline bg=white

Introduction

Regularized K-means

Tuning

Asymptotic

Case III: µ = 0.8

Wei

Sun

Regularized K-means

Simulation

Real Data

Outline bg=white

Introduction

Regularized K-means

Tuning

Asymptotic

Simulation

Real Data

Performance of variable selection

µ 0.4

0.6

0.8

Methods K-means Sparse Regularized K-means Sparse Regularized K-means Sparse Regularized

p=50 50(0) 33.3(3.05) 36.6(1.68) 50(0) 45.2(0.99) 49.8(0.12) 50(0) 46.4(1.09) 50(0)

Wei

p=200 200(0) 84.6(15.80) 35.9(4.20) 200(0) 128.3(9.57) 52.1(1.77) 200(0) 157.1(7.53) 65.5(1.08)

Sun

p=500 500(0) 127.0(39.30) 45.1(9.20) 500(0) 182.8(41.46) 47.3(2.30) 500(0) 126.8(30.40) 53.2(1.85)

Regularized K-means

p=1000 1000(0) 362.6(87.80) 60.3(11.90) 1000(0) 43.6(6.04) 64.8(9.80) 1000(0) 44.9(4.41) 65.3(7.03)

Outline bg=white

Introduction

Regularized K-means

Tuning

Asymptotic

Simulation

Simulation 2: K is unknown

The same setup as in simulation 1 with p = 200, µ = 0.8. Grid search for K and λ.

Wei

Sun

Regularized K-means

Real Data

Outline bg=white

Introduction

Regularized K-means

Tuning

Asymptotic

Simulation

Real Data

Simulation 2: K is unknown

The same setup as in simulation 1 with p = 200, µ = 0.8. Grid search for K and λ.

Methods Standard k-means Sparse k-means Regularized k-means

K=2 0 18 0

Wei

Sun

K=4 20 2 20

Number 200(0) 138.0(4.11) 50.0(0.05)

Regularized K-means

Error 0.001(0.001) 0.228(0.017) 0(0)

Outline bg=white

Introduction

Regularized K-means

Tuning

Asymptotic

Performance of tuning

Wei

Sun

Regularized K-means

Simulation

Real Data

Outline bg=white

Introduction

Regularized K-means

Tuning

Asymptotic

Simulation

Real Data

Two gene microarray examples

Lymphoma: n = 62 and number of genes p = 4026. 42 samples of DLBCL, 9 samples of FL, and 11 samples of CLL. Leukemia: n = 72 and number of genes p = 6817. Two types of human acute leukemias: 25 patients with AML and 47 patients with ALL.

Wei

Sun

Regularized K-means

Outline bg=white

Introduction

Regularized K-means

Tuning

Asymptotic

Simulation

Real Data

Two gene microarray examples

Lymphoma: n = 62 and number of genes p = 4026. 42 samples of DLBCL, 9 samples of FL, and 11 samples of CLL. Leukemia: n = 72 and number of genes p = 6817. Two types of human acute leukemias: 25 patients with AML and 47 patients with ALL. Clustering errors are estimated by comparing the clustering assignments to the available cancer types.

Wei

Sun

Regularized K-means

Outline bg=white

Introduction

Regularized K-means

Tuning

Asymptotic

Simulation

Performance

Data Leukemia

Lymphoma

Methods k-means Sparse k-means Regularized k-means k-means Sparse k-means Regularized k-means

Wei

Sun

K 2 4 2 2 3 3

Genes 3571 2577 211 4206 3025 66

Regularized K-means

Error 2/72 2/72 2/72 4/62 2/62 1/62

Real Data

Outline bg=white

Introduction

Regularized K-means

Tuning

Asymptotic

Heatmap of Lymphoma on selected genes

Wei

Sun

Regularized K-means

Simulation

Real Data