Community Recovery in Hypergraphs - Changho Suh

Viewer
Transcript

Community Recovery in Hypergraphs Kwangjun Ahn

Kangwook Lee

Changho Suh

Dept. of Mathematical Sciences KAIST [email protected]

Dept. of Electrical Engineering KAIST [email protected]

Dept. of Electrical Engineering KAIST [email protected]

Abstract—Data clustering is a core problem in many fields of science and engineering. Community recovery in graphs is one popular approach to data clustering, and it has received significant attention due to its wide applicability to social network applications, protein complex detection, shape matching, image segmentation, etc. While the community recovery in graphs has been extensively studied in the literature, the problem of community recovery in hypergraphs has not been studied much. In this paper, we study the generalized Censored Block Model (CBM), where observations consist of randomly chosen hyperedges of size d, each of which is associated with the modulo2 sum of the values of the nodes in the hyperedge, corrupted by Bernoulli noise. We characterize the information-theoretic limit of the community recovery in hypergraphs. Our results are for the general cases of arbitrarily scaling d.

I. I NTRODUCTION Clustering of data is one of the central problems that arises in many fields of science and engineering. Among many related setups, community recovery [1], [2] has received significant attention due to its wide applicability to social network applications, protein complex detection [3], shape matching [4], image segmentation [5], etc. The goal of the community recovery problem is to cluster data points into different communities based on whether two data points belong to the same community or not. There have been a proliferation of works on various models of the community recovery problem, and the Censored Block Model (CBM) is one of the most popular models in the literature [6]–[8]. We illustrate the generalized CBM for the problem of community recovery in hypergraphs. In this model, the n individuals, each of which belongs to either group 0 or group 1, are modeled as nodes of a hypergraph. The goal is to cluster the n nodes (or find the hidden communities) using noisy parity measurements obtained from a random d-uniform hypergraph [9]. More specifically, a random hypergraph (of hyperedge size d) with the n nodes is given as observation, in which each hyperedge exists with probability p. Further, each observed hyperedge is associated with the modulo-2 sum of the nodes of the hyperedge, i.e., the parity of the hyperedge. Further, such parity measurements are noisy in a way that each edge value is flipped with probability θ. For instance, when d = 3, if each of the three nodes of a hyperedge belongs to group-1,0,0, respectively, the value of the hyperedge is 1 with probability 1 − θ, and 0 with probability θ. See Fig. 1 for illustrations. For the special case of d = 2, the information-theoretic limits as well as matching computation limits are characterized in [7], [8]. The prior works reveal that the minimum number

Fig. 1: Community recovery in hypergraphs. We illustrate the community recovery problem in hypergraphs under the generalized Censored Block Model (CBM). Shown on the left hand side is the group of n nodes belonging to group-0 (dotted circles) or group-1 (solid circles). The goal of the problem is to cluster nodes into two groups from an observed hypergraph. Under the generalized CBM, an observed hypergraph consists of randomly chosen hyperedges of size d. Each observed hyperedge is associated with the modulo2 sum of the values of the nodes in the hyperedge, corrupted by Bernoulli noise. For instance, the second hyperedge E2 connects two group-0 nodes (X2 , X3 ) and one group-1 node (X4 ), and hence the corresponding parity is 1, but it is corrupted by noise, making the actual observed value 0. Our goal is to characterize when the exact community recovery in hypergraphs is feasible in terms of p, θ, n, and d. Further, as illustrated in the middle of the figure, we note that the community recovery problem can be seen as a certain channel coding problem with a fixed encoding scheme.

of samples required for reliable community recovery is about p n2 ' √ n log n√ 2 . However, for the case of general d, 2( 1−θ− θ ) such characterization has been unknown in the literature, and this precisely sets the goal of our paper. More precisely, we seek to characterize the information-theoretic limit of the minimum number of samples (hyperedges) required for reliable community recovery. A summary of important findings in this paper is as follows. • if d grows asymptotically slower than log n, the minimum number ofsamples required for reliable community n recovery is p nd ' β n log ; and d • if d grows asymptotically faster than or equal to log n, the minimum number of samples required for reliable community recovery is p nd ' γn, where β and γ are some unknown constants, which can be bounded below and above. Thus, to make reliable community recovery with a linear number of samples possible, the size of hyperedges needs to scale at least as fast as log n.

The rest of the paper is organized as follows. In the rest of this section, we relate our problem to a d-right-degree linear code, and then discuss related works; Sec. II introduces the problem formulation; in Sec. III, our main theorem is presented, along with some implications and remarks; in Sec. IV, we prove the achievability statement as well as the converse statement, deferring the proofs of some key lemmas to Appendices; Sec. V presents numerical simulation results that corroborate our theoretical findings; and in Sec. VI, we conclude the paper. A. Connection with channel coding The community recovery problem has an inherent connection with channel coding problems [7]. In order to see such connection, consider the following point-to-point communication setup. The encoder employs the following random linear code. It first draws a random d-uniform hypergraph with n nodes, where each hyperedge exists with probability p. Given the input sequence of n bits, the parity bits corresponding to all the existing hyperedges are concatenated, forming a codeword. Note that the expected rate of this random code is p nn . (d) A codeword chosen from this code is transmitted through a Binary Symmetric Channel (BSC) with error probability θ, whose capacity is 1 − H(θ) [10]. Given the received symbols, the decoder wishes to infer the n input bits. By associating 0, 1 symbols with labels of the observed hyperedges, one can see that recovering communities in hypergraphs is equivalent to the above channel coding problem. See Fig. 1. A natural channel coding question arises: “How far is the rate of the random code from the capacity of the BSC channel”. Due to the equivalence, the information-theoretic limits of community recovery in hypergraphs can help immediately answer the above question. For instance, when d = 2, as mentioned earlier in this section, it is shown in [7], [8] that the exact community recovery is possible if p n2 & √ n log n√ 2 , implying that the following expected 2( 1−θ− θ ) code rate can be achieved: √ √ n 2( 1 − θ − θ)2 . . log n p n2 That is, while the capacity of a BSC channel is a fixed constant 1 − H(θ), the rate of the hypergraph-based random code vanishes as n → ∞. One natural question is whether one can achieve a nonvanishing code rate by increasing d. Our results on the information-theoretic limits of community recovery in hypergraphs answer this coding-theoretic question: when d scales as fast as log n, the random-hypergraph-based linear code can achieve a constant rate. B. Related Work Community recovery in standard graphs has been extensively studied in the literature. Under the Stochastic Block Model (SBM) [11], [12], the probability of an edge appearing in the observed graph is assumed to depend on whether the edge is connecting the nodes in the same group or not. For instance, the SBM can capture the case where random graphs have statistically more edges between nodes within the same

community than between nodes across two different communities. The information-theoretic limits as well as matching computation limits are characterized for the SBM [13]–[16]. Under the Censored Block Model (CBM) [6], [7], each edge is associated with a random label, whose distribution depends on whether the edge is connecting the nodes in the same group or not. The information-theoretic limits as well as matching computation limits are characterized in [7], [8]. Abbe et al. [7] show exact community recovery that the log n √ 2 for > 0. Hajek et is impossible if p n2 < (1−)n √ 2( 1−θ− θ ) al. show that the exact community recovery via is possible log n √ 2 an efficient algorithm is possible if p n2 > (1+)n √ 2( 1−θ− θ ) in [8]. In [17], the labeled stochastic block model is proposed as a general observation model that includes both SBM and CBM as special cases. We focus on the generalized CBM for community recovery in hypergraphs, and to the best of our knowledge, we are the first to characterize the informationtheoretic limits for one of such generalized models. The community recovery problem in graphs or hypergraphs is closely related to MLS-dLIN problems [18], [19], of which the goal is to find a binary vector x that is maximally consistent with a given set of parities of d variables. Under this context, the case of d = 3 has been well-studied. For d = 3, it is shown that the maximum likelihood decoder can succeed if 12 log n p ≥ (0.5−θ) 2 n2 [19]. Unlike the prior result, our upper bounds on p, to be formally stated in Corollary 1, are for arbitrary constant d. Further, we provide the matching lower bounds as well. Among a few efficient algorithms for the MLS-3LIN problem, one proposed in [20], based on an efficient low-rank tensor factorization algorithm, is shown to find the optimal 4 n solution if p = Ω( log n1.5 ). C. Notations For any two sequences f (n) and g(n): f (n) = Ω(g(n)) if there exists a positive constant c such that f (n) ≥ cg(n); f (n) = O(g(n)) if there exists a positive constant c such (n) that f (n) ≤ cg(n); f (n) = ω(g(n)) if limn→∞ fg(n) = ∞; (n) f (n) = o(g(n)) if limn→∞ fg(n) = 0; and f (n) g(n) or f (n) = Θ(g(n)) if there exist positive constants c1 and c2 such that c1 g(n) ≤ f (n) ≤ c2g(n). For a set A and an A def integer m ≤ |A|, we denote m = {B ⊂ A | |B| = m}. Let [n] denotes {1, · · · , n}. Let ei be the ith standard unit vector. Let 0 be the all-zero-vector and 1 be the all-onevector. Let D(0.5kθ) be the Kullback-Liebler (KL) divergence def between Bernoulli(0.5) and Bernoulli(θ), i.e., D(0.5kθ) = 0.5 0.5 0.5 log( θ )+0.5 log( 1−θ ). We shall use log(·) to indicate the natrual logarithm. We use H(·) to denote the binary entropy function.

II. P ROBLEM F ORMULATION A. Sampling Model: the Generalized CBM Consider a collection of n vertices V = [n], each represented by a binary variable Xi ∈ {0, 1}, 1 ≤ i ≤ n. Let def X = [Xi ]1≤i≤n be the ground truth vector. Suppose that d is given as a monotone function of n satisfying 2 ≤ d ≤ n/2; for instance, we may choose d to be constants or be functions

√ that scale with n such as d = d n + 1e. Samples are obtained according H = (V, E) where to a measurement hypergraph d E ⊂ [n] . We assume H ∼ H , i.e., each element in [n] n,p d d belongs to E with probability p ∈ [0, 1]. Fix a hypergraph H = (V, E) and a ground truth vector X. For each edge E ∈ E, a noisy binary observation is given by " # M YE = Xi ⊕ ZE , i∈E

where ⊕ denotes modulo-2 addition, and ZE ∼ Bernoulli(θ) for 0 ≤ θ < 12 . We assume that {ZE }E∈E is a collection of mutually independent random variables. We define the observation vector Y as follows: def

Y = [YE ]E∈E . Note that when d = 2, this setup is reduced to the well-known community detection problem. Note that when d is even and the measurement hypergraph is fixed, the conditional distribution of Y|X is equal to that of Y|X ⊕ 1, implying that decoding X is possible only up to a global shift. In the rest of the paper, we assume that d is odd for ease of presentation. The even case can be readily dealt with by allowing for the global shift error. B. Our Goals This paper concerns the exact recovery, that is, to reconstruct the ground truth X given observation vector Y. More precisely, for any recovery procedure, or decoder, ψ the error probability is defined as follows. Definition 1 (Error probability and admissibility). Pe (ψ) = min ψ

max

X∈{0,1}n

Pr[ψ(Y) 6= X].

Moreover, we say that p is admissible if limn→∞ Pe = 0. The goal is to characterize necessary or sufficient conditions on (n, p, d, θ) for reliable recovery. In particular, we will often rewrite the conditions using sample complexity p nd , which represents the expected number of hyperedges in the measurement random hypergraph.

n o 5/2 log n 5 log √ 2 √ √ where β1 = max (√1−θ− , 1, β2 = 2d θ)o ( 1−θ− θ)2 n log n√ 1 1, for a fixed θ. Hence, max d(√1−θ− θ)2 , (1−H(θ)) Theorem 1 provides an order-wise tight characterization of the admissible region. Next, when d = o(log n), Pe → 0 if p > 5/2 n log n n log n 1 √ √ √ , and Pe 6→ 0 if p < (√1−θ− . ( 1−θ− θ)2 d(n θ)2 d(n d) d) Thus, the theorem offers a tighter characterization relative to the case d = Ω(log n). Notice that the multiplicative gap between the lower and upper bounds is a small constant 5/2 regardless of θ. Especially, when is d is a constant order, the result is even more enhanced, formally stated in the following corollary. d Corollary 1. Suppose that d 1 and H ∼ Hn,p . For a fixed > 0,

  Pe → 0

if p ≥

Pe 6→ 0

if p ≤

√ 1+√ ( 1−θ− θ)2 √ 1−√ ( 1−θ− θ)2

n log n , d(n d) n log n . d(n d)

The proof can be readily deduced from the proof of Theorem 1, hence we omit the proof. Corollary 1 characterizes the sharp threshold on the admissible region for constant d. Note that this result recovers that in [8] as a special case. Finally, we reinterpret our main theorem using sample complexity. When d = o(log n), reliable recovery is possible only if sample complexity is superlinear. However, when d = Ω(log n), we see that reliable recovery can be done with linear sample complexity Θ(n). Hence, we can answer the question on how large d is needed to be to make reliable recovery with linear sample complexity possible. Corollary 2. For d = o(log n), reliable recovery is impossible with linear sample complexity, while it is possible for d = Ω(log n): if d = k log n for some constant k > 0, there exists a constant ck > 0 such that reliable recovery is feasible whenever sample complexity is larger than ck n; If d = ω(log n), there exists an absolute constant c > 0 such that reliable recovery is feasible whenever sample complexity is larger than cn.

III. M AIN R ESULTS We begin with the main theorem of the paper, which characterizes sufficient and necessary conditions for reliable recovery.

IV. P ROOF OF THE MAIN THEOREM We consider the noisy case (θ 6= 0) only for conciseness. We remark that the proof of the noiseless case can be done analogously, and hence we omit the proof.

d Theorem 1. Suppose that H ∼ Hn,p for 2 ≤ d ≤ n2 . For a fixed > 0, For both achievability and converse proofs, we use the o optimal maximum likelihood (ML) decoder, where ties are n  n log n log 2  √ , randomly broken. One can easily verify that the ML decoder Pe = O(n− ) if p ≥ (√5/2(1+) · max 1, 2dlog  n  1−θ− θ)2 d(n d)  n log n 1 reduces to: if p ≤ (1 − ) max (√1−θ−√θ)2 d n , ( d)  .  P 6→ 0  1 n  e ˆ ML = arg min d(A), (1−H(θ)) (n) X d A=[Ai ]i∈[n]

We refer the readers to Sec. IV for the proof. Let us interpret our main theorem for the following two cases: d = Ω(log n) and d = o(log n). Note that when d = Ω(log n), Pe → 0 if p > β1 nn and Pe 6→ 0 if p < β2 nn , (d) (d)

L def where d(A) = dH Y, and dH (·, ·) is the i∈E Ai E∈E Hamming distance between two binary vectors.

where the last inequality is due to the Chernoff bound [21]. This together with (3) gives n−1 X n N Pe ≤ (1 − p0 ) k . k

A. Achievability We first give an upper bound on the error probability: Pe =

ˆ ML 6= X] Pr[X

max

X∈{0,1}n

ˆ ML 6= 0|X = 0) = Pr(X   [ ≤ Pr  [d(A) ≤ d(0)] X = 0

(2)

A6=0

 = Pr 

n [

 [d(A) ≤ d(0)] X = 0

[

k=1 kAk1 =k

≤

n X X

Pr d(A) ≤ d(0) X = 0

k=1 kAk1 =k

≤

n X n k=1

k

k X

Pr d

! ei

i=1

! ≤ d(0) X = 0 ,

(3)

where (1) follows by symmetry; (2) follows by the fact that the ML decoder fails only if there exists one or more than one non-zero vectors whose likelihood is greater than or equal to that of the zero vector; and (3) follows by symmetry. We now find an upper bound to ! ! k X Pr d ei ≤ d(0) X = 0 , i=1

i=1

≤

Pr(|Ok | = m)

m=1

· Pr d

k X

! ei

i=1

! ≤ d(0) X = 0, |Ok | = m

Nk X Nk m ≤ p (1 − p)Nk −m m m=0 · Pr

X E∈Ok

Nk X

Now, we are left with estimating Nk to complete the proof. For d d 1, Nk can be easily estimated using the fact that nd ≈ nd! . However, in the general case where d can scale with n, the estimation is no longer valid. On the other hand, the following lemma, which is one of the key technical contributions in this paper, provides a clever lower bound: d+1 Lemma 1. Let β = dmax{ n−d+1 2d+1 , 2(n−d)+1 }e and α = d+1 max{ n−d+1 , n−d }. Then d  2k n  , k < β,  5α d     X k n − k  ≥ 15 nd , β ≤ k ≤ n − β,  i d−i  i≤d    i is odd   2(n−k) n, n − β < k. 5α

! |Ok | ZE ≥ |Ok | = m 2

Nk m ≤ p (1 − p)Nk −m e−mD(0.5kθ) m m=0 Nk def N = 1 − p(1 − e−D(0.5kθ) ) = (1 − p0 ) k ,

d

We defer the proof to Appendix A. Employing Lemma 1, we get: Pe ≤

β−1 X k=1

for 1 ≤ k ≤ n. Let Ok ⊂ E be the collection of hyperedges that contain an odd number of nodes among [k]. Observe that E ∈ / Ok P k contributes eqaully to d e and d(0). Hence, in order i i=1 Pk for i=1 ei to have higher likelihood than that of the zero vector, the number of bit flips among Ok has to be ≥ |O2k | . We define by Nk the size of the subset of [n] d such that each element contains n−kan odd number of elements in [k]: Nk = P k · i≤d i d−i . Using Ok and Nk notations, we can then i is odd get: ! ! k X Pr d ei ≤ d(0) X = 0 Nk X

k=1

(1)

+

n−β X n n (1 − p0 )Nk + (1 − p0 )Nk k k k=β n−1 X n n (1 − p0 )Nk + (1 − p0 )( d ) k

k=n−β+1

≤2

β−1 X

n−β X n 1 n 2k n n 0 5α ) ( d + (1 − p0 ) 5 ( d ) (1 − p ) k k k=β

k=1

n + (1 − p )( d )

0

≤2

β−1 X

nk e−p

k=1 β−1 X

0 2k 5α

(nd) + 2n e− 15 p0 (nd) + e−p0 (nd)

(

!) 2p0 nd ≤2 exp k log n − 5α k=1 0 n 1 n + exp n log 2 − p0 + e−p ( d ) . 5 d

(4) (5)

Note that (5) vanishes since p0 = p(1 − e−D(0.5kθ) ) ≥ (1 + ) n·5 nlog 2 . We now show that (4) also vanishes by considering (d) two cases: d = o(n) and d n.nFirst, consider o the case of n−d+1 d+1 d = o(n). Note that α = max , ≤ nd , β = l n om l d m n−d def d+1 max n−d+1 = n−d+1 → ∞, and γ = 2d+1 , 2(n−d)+1 2d+1 2dp0 (n) log n − 5n d → −∞. Thus, (4) ≤ 2

β−1 X

exp {kγ} = 2

k=1

exp {γ} − exp {βγ} → 0. 1 − exp {γ}

When d n, β = dmax{ n−d+1 2d+1 , max{ n−d+1 , d

d+1 n−d }

1. Also, δ

d+1 2(n−d)+1 }e 1 2p0 (n) def = log n − 5αd

and α = → −∞.

Thus, (4) = 2

exp {δ} − exp {βδ} → 0. 1 − exp {δ}

which leaves us to seek an upper bound of Pr d(ei ) > d(0) X = 0, Etyp

Therefore, Pe → 0. Carefully following the arguments above, we can see that Pe = O(n− ).

or a lower bound of

B. Converse

for i ∈ Rres . We denote by Fi ⊂ E the collection of hyperedges that contain i, for i ∈ Rres . Due to the construction, edges in Fi must meet Rbig only at i. Hence, big | |Fi | ≤ n−|R =: N . Observe that d(ei ) ≤ d(0) when d−1 P Z ≥ |F |/2. Thus, E i E∈Fi Pr d(ei ) ≤ d(0) X = 0, Etyp

First, it is obvious that Pe 6→ 0 for p ≤

n . (1−H(θ))(n d)

Hence, it suffices to consider the case d = O(log n) and nd p n log n . Let S be the event that the ground truth vector is the d unique candidate for arg minA d(A). As Pr(S) → 0 implies lim inf n→∞ Pe ≥ 1/2, we will show that Pr(S) → 0 when n . First, observe p ≤ (1 − )Cθ ndlog (nd) Pr(S) = Pr(S | X = 0)   \ ≤ Pr  [d(A) > d(0)] X = 0 A6=0 n \

≤ Pr

Pr d(ei ) ≤ d(0) X = 0, Etyp

= ≥

N X

Pr(|Fi | = m) Pr d(ei ) ≤ d(0) X = 0, |Fi | = m

m=1 N X m=1

N m p (1 − p)N −m m

! [d(ei ) > d(0)] X = 0 .

· Pr

E∈Fi

i=1

We find an upper bound to the above quantity by finding a large enough subset of [n] such that the corresponding collection of events of form [d(e· ) > d(0)] are mutually independent. To this end, we first propose a crude construction of a subset of [n] whose nodes First, n do not share hyperedges. o choose a big subset Rbig = 1, 2, · · · , d2c logn6 n e for some absolute constant c > 0. Then erase every node in Rbig which shares hyperedges with other nodes in Rbig to obtain Rres . Formally, Rres = Rbig −

d [

(k)

Rshare ,

k=2

where (k)

Rshare =

[

E ∩ Rbig ,

(k) E∈Fshare

(k)

and Fshare is the collection of edges that meets Rbig at exactly k nodes. The following lemma guarantees that Rres has comparable size with Rbig . Lemma 2. Suppose d = O(log n). Then with probability approaching 1, n |Rres | ≥ c 6 log n for some absolute constant c > 0. The proof is deferred to Appendix B. Let Etyp = [|Rres | ≥ c logn6 n ]. Conditioned on Etyp , the collection [d(ei ) > d(0)]i∈Rres is statistically independent. Hence, ! \ Pr(S) . Pr [d(ei ) > d(0)] X = 0, Etyp i∈Rres

=

Y i∈Rres

Pr d(ei ) > d(0) X = 0, Etyp ,

X

! |Fi | |Fi | = m . ZE ≥ 2

Applying the reverse Chernoff bound [21] with a fixed δ > 0, there exists nδ > 0 such that ! X |Fi | Pr |Fi | = m ≥ e−(1+δ)mD(0.5kθ) ZE ≥ 2 E∈Fi

for all m ≥ nδ . Let gn be a sequence (to be determined) that diverges to ∞ as n → ∞. Then for sufficiently large n, ! N X X N m |Fi | N −m p (1 − p) Pr ZE ≥ |Fi | = m m 2 m=1 E∈Fi N X N m ≥ p (1 − p)N −m e−(1+δ)mD(0.5kθ) (6) m m=1 gX n −1 N m − p (1 − p)N −m e−(1+δ)mD(0.5kθ) . (7) m m=1 We shall show that (7) is negligible compared to (6). Note that Pgn −1 pe−(1+δ)D(0.5kθ) m N (1 − p) m=1 N 1−p (7) m ≤ p P N N (6) (1 − p)N m=1 m 1−p e−(1+δ)D(0.5kθ) Pgn −1 pe−(1+δ)D(0.5kθ) m m=1 N 1−p ≤ N p 1 + 1−p e−(1+δ)D(0.5kθ) Pgn −1 pe−(1+δ)D(0.5kθ) m m=1 N 1−p n o . ≈ (8) pe−(1+δ)D(0.5kθ) exp N 1−p As |Rbig | = Θ( logn6 n ) and d = O(log n), simple algebra yields 1 (n − j) 1 − ≤ (n − j − |Rbig |) log2 n

1

1 0.9 0.8 Probability of success

Probability of success

0.8

0.6

0.4 d=4 d=7 d = 11

0.2

0.7 0.6 0.5 0.4 0.3

n = 50 n = 200 n = 800 n = 3200

0.2 0.1

0

0 0

0.5 1 1.5 Normalized sample complexity

2

Fig. 2: We run the Monte Carlo simulations to estimate the probability of success for n = 1000, varying d, and θ = 0. For each d, we normalize the number of samples by max(n, n log n/d). Observe that the probability of success quickly approaches 1 as the normalized sample complexity crosses 1.

for 0 ≤ j ≤ d − 2, which in turn gives d−1 n−|Rbig | 1 d−1 d−1 ≥ 1 − ≈ exp − → 1. n−1 log2 n log2 n d−1 −(1+δ)D(0.5kθ) Thus, we obtain N pe 1−p n−1 p log n, and d−1 m l pe−(1+δ)D(0.5kθ) , we get (8)→ 0. by taking gn = log N 1−p Hence, Pr d(ei ) > d(0) X = 0, Etyp N X N m .1− p (1 − p)N −m e−(1+δ)mD(0.5kθ) m m=1 = 1 − (1 − p(1 − e−(1+δ)D(0.5kθ) ))N h n oi ≈ exp − exp −N p(1 − e−(1+δ)D(0.5kθ) ) n−1 p(1 − e−(1+δ)D(0.5kθ) ) ≤ exp − exp − d−1 (1 − e−(1+δ)D(0.5kθ) ) ≤ exp − exp −(1 − ) log n (1 − e−D(0.5kθ) ) h n oi ≤ exp − exp − 1 − log n , 2 since δ > 0 can be arbitrarily chosen. Thus, we conclude Y Pr d(ei ) > d(0) X = 0, Etyp i∈Rres

h n oi|Rres | ≤ exp − exp − 1 − log n 2 n o n ≤ exp −c 6 exp − 1 − log n → 0. 2 log n V. E XPERIMENTAL R ESULTS In this section, we provide Monte Carlo simulation results which corroborate our theoretical findings. Each point plotted in Fig. 2 and Fig. 3 is an empirical success rate. All results are obtained with 50 Monte Carlo trials. In Fig. 2, we plot the probability of successful recovery for n = 1000, varying d, and θ = 0. For each d, we normalize the number of samples

2

4

6 Degree (d)

8

10

Fig. 3: We run the Monte Carlo simulations to estimate the probability of n

success for varying n, varying d, θ = 0, and p = 1.1n/ d . Note that when n increases by a multiplicative factor of 4, the curve shifts rightward about the same amount, supporting our result in Corollary 2

by max(n, n log n/d). One can observe that the probability of success quickly approaches 1 as the normalized sample complexity crosses 1. Plotted in Fig. 3 are the simulationresults for varying n, varying d, θ = 0, and p = 1.1n/ nd . We note that when n increases by a multiplicative factor of 4, the curve shifts rightward about the same amount, supporting our result in Corollary 2. VI. C ONCLUSION In this paper, we study the problem of community recovery in hypergraphs for the generalized Censored Block Model (CBM), and characterize the information-theoretic limits of the problem as a function of the number of nodes n, the size of hyperedges d, the noise probability θ, and the edge observation probability p. We also corroborate our theoretical results via Monte Carlo simulations. Our characterizations imply that the community recovery in hypergraphs with a linear number of measurements becomes possible when d is on the order of log n. R EFERENCES [1] S. Fortunato, “Community detection in graphs,” Physics reports, vol. 486, no. 3, pp. 75–174, 2010. [2] M. A. Porter, J.-P. Onnela, and P. J. Mucha, “Communities in networks,” Notices of the AMS, vol. 56, no. 9, pp. 1082–1097, 2009. [3] J. Chen and B. Yuan, “Detecting functional modules in the yeast protein–protein interaction network,” Bioinformatics, vol. 22, no. 18, pp. 2283–2290, 2006. [4] Q.-X. Huang and L. Guibas, “Consistent shape maps via semidefinite programming,” in Computer Graphics Forum, vol. 32, no. 5. Wiley Online Library, 2013, pp. 177–186. [5] J. Shi and J. Malik, “Normalized cuts and image segmentation,” IEEE Transactions on pattern analysis and machine intelligence, vol. 22, no. 8, pp. 888–905, 2000. [6] E. Abbe, A. S. Bandeira, A. Bracher, and A. Singer, “Linear inverse problems on erd˝os-r´enyi graphs: Information-theoretic limits and efficient recovery,” in 2014 IEEE International Symposium on Information Theory. IEEE, 2014, pp. 1251–1255. [7] ——, “Decoding binary node labels from censored edge measurements: Phase transition and efficient recovery,” IEEE Transactions on Network Science and Engineering, vol. 1, no. 1, pp. 10–22, 2014. [8] B. Hajek, Y. Wu, and J. Xu, “Exact recovery threshold in the binary censored block model,” in Information Theory Workshop-Fall (ITW), 2015 IEEE. IEEE, 2015, pp. 99–103.

[9] R. Durrett, Random graph dynamics. Citeseer, 2007, vol. 200, no. 7. [10] T. M. Cover and J. A. Thomas, Elements of information theory. John Wiley & Sons, 2012. [11] P. W. Holland, K. B. Laskey, and S. Leinhardt, “Stochastic blockmodels: First steps,” Social networks, vol. 5, no. 2, pp. 109–137, 1983. [12] A. Condon and R. M. Karp, “Algorithms for graph partitioning on the planted partition model,” Random Structures and Algorithms, vol. 18, no. 2, pp. 116–140, 2001. [13] A. Coja-Oghlan, “Graph partitioning via adaptive spectral techniques,” Combinatorics, Probability and Computing, vol. 19, no. 02, pp. 227– 284, 2010. [14] K. Chaudhuri, F. C. Graham, and A. Tsiatas, “Spectral clustering of graphs with general degrees in the extended planted partition model.” in COLT, vol. 23, 2012, pp. 35–1. [15] E. Abbe and C. Sandon, “Community detection in general stochastic block models: fundamental limits and efficient recovery algorithms,” arXiv preprint arXiv:1503.00609, 2015. [16] E. Mossel and J. Xu, “Density evolution in the degree-correlated stochastic block model,” arXiv preprint arXiv:1509.03281, vol. 7, 2015. [17] S. Heimlicher, M. Lelarge, and L. Massouli´e, “Community detection in the labelled stochastic block model,” arXiv preprint arXiv:1209.2910, 2012. [18] O. Watanabe and M. Yamamoto, “Average-case analysis for the max-2sat problem,” Theoretical Computer Science, vol. 411, no. 16, pp. 1685 – 1697, 2010. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S0304397510000022 [19] O. Watanabe, “Message passing algorithms for mls-3lin problem,” Algorithmica, vol. 66, no. 4, pp. 848–868, 2013. [20] P. Jain and S. Oh, “Provable tensor factorization with missing data,” in Advances in Neural Information Processing Systems, 2014, pp. 1431– 1439. [21] W. Hoeffding, “Probability inequalities for sums of bounded random variables,” Journal of the American statistical association, vol. 58, no. 301, pp. 13–30, 1963.

A PPENDIX A P ROOF OF L EMMA 1

i≤d i is odd

min{k, d}

X i=max{0, k+d−n} i is odd

k n−k . (9) i d−i

Here are some notations: • • • •

C = max{0, k + d − n}, D = min{k, d}, A : the smallest odd satisfying A ≥ C, B : the largest odd satisfying B ≤ D.

Thus, we have X k n − k n = d i d−i 0≤i≤d X k n−k = i d−i C≤i≤D k n−k k n−k = 1[C
X k n − k + . i d−i A≤i≤B i is odd

A≤i≤B

i is odd X k n − k 1 n ⇔ · ≤ (an expression) d i d−i A≤i≤B i is odd

Thus, we will concentrate on finding upper bounds of (10) and (11) in terms of (9). We will call (10) the tail part and (11) the body part. Conquering the Body Part Fortunately, the body part can be reasonably bounded regardless of k; the bound will be developed through the following two lemmas. Lemma 3. For C + 1 ≤ i ≤ D − 1, n−k 1 k 2 i d−i k n−k k n−k ≤ + i + 1 d − (i + 1) i − 1 d − (i − 1) In particular, B X n−k k d−i i i=A i is even

Throughout this section, we focus on deriving the clever lower bound of X k n − k = i d−i

Our overall scheme is to give upper bounds to (10) and (11) in terms of (9) so that we get a lower bound of (9) in terms of nd : X k n − k n ≤ (an expression) · d i d−i

B−2 X k n − k k n−k k n−k ≤2 +4 +2 A d−A i d−i B d−B i=A+2 i is odd

(12) Proof. Because C + 1 ≤ i ≤ D − 1, n−k n−k k k i+1 d−(i+1) + i−1 d−(i−1) k n−k i

d−i

(k − i)(d − i) i(n − k − d + i) = + (i + 1)(n − k − d + i + 1) (k − i + 1)(d − i + 1) s (k − i)(d − i) i(n − k − d + i) · ≥2 (i + 1)(n − k − d + i + 1) (k − i + 1)(d − i + 1) s (k − i) (d − i) i (n − k − d + i) =2 · · · (k − i + 1) (d − i + 1) i + 1 (n − k − d + i + 1) r 14 1 ≥2 = 2 2 (12) follows easily by applying the above to each summand in the left hand side.(We can apply the above since C + 1 ≤ A + 1 ≤ i ≤ B − 1 ≤ D − 1 in the left hand side) Conquering the Tail Part In contrast to the body part, in order to bound the tail part, we need to divide into two cases depending on k. The two cases are 1) β ≤ k ≤ n − β 2) k < β or k > n − β

,where β is some quantity to be defined. We begin with the first case. d+1 Lemma 4. Let β = dmax{ n−d+1 2d+1 , 2(n−d)+1 }e. For large enough n and β ≤ k ≤ n − β, k n−k k n−k ≤2 C d−C C +1 d−C −1 k n−k k n−k ≤2 D d−D D−1 d−D+1

Proof. First, note k n−k k n−k ≤2 0 d 1 d−1 n−k−d+1 ⇔ ≤ 2k d ⇔ (2d + 1)k ≥ n − d + 1 n−d+1 ⇔k≥ 2d + 1

and

k d − (n − k)

n−k n−k

≤2

k d − (n − k − 1)

n−k . n−k−1

Therefore, when k is in the range, we get the result. Now for the second case, we use the following lemma. d+1 Lemma 5. Let β = dmax{ n−d+1 2d+1 , 2(n−d)+1 }e and α = d+1 max{ n−d+1 , n−d }. For large enough n, if k < β, d k n−k α k n−k ≤ C d−C k C +1 d−C −1 k n−k α k n−k ≤ D d−D k D−1 d−D+1

and α ≥ 2. k If k > n − β, k n−k α k n−k ≤ n−k C +1 d−C −1 C d−C k n−k k n−k α ≤ n−k D−1 d−D+1 D d−D

and k n−k k n−k ≤2 k d−k k − 1 d − (k − 1) d−k+1 ⇔ ≤ 2k n−d ⇔ (2(n − d) + 1)k ≥ d + 1 d+1 ⇔k≥ . 2(n − d) + 1

and α ≥ 2. n−k

Hence, for k ≥ β, we have

Proof. Let’s firstly assume k < β. Since β < n − β, from the proof of Lemma 4, we have n−k k n−k k ≤2 0 d−1 1 d

k n−k k n−k ≤2 0 d 1 d−1 and

and

k n−k k n−k ≤2 . k d−k k − 1 d − (k − 1)

k d − (n − k)

Next, note

n−k n−k

≤2

k d − (n − k − 1)

n−k . n−k−1

Moreover, note k n−k k n−k ≤2 d 0 d−1 1 k−d+1 ⇔ ≤ 2(n − k) d 2dn + d − 1 n−d+1 ⇔k≤ =n− 2d + 1 2d + 1 and

≤2

k n−k k n−k d − (n − k) n − k d − (n − k − 1) n − k − 1 d−n+k+1 ⇔ ≤ 2(n − k) n−d 2n2 + n − 1 − 2dn − d d+1 ⇔k≤ =n− . 2n − 2d + 1 2(n − d) + 1

k n−k 0 d k n−k 1 d−1

=

n−k−d+1 n−d+1 ≤ kd dk

and k k

k k−1

n−k d−k n−k d−(k−1)

=

d−k+1 d+1 ≤ . (n − d)k (n − d)k

Since max{ n−d+1 , d k

Hence, for k ≤ n − β, we have k n−k k n−k ≤2 d 0 d−1 1

d+1 n−d }

≥

d+1 n−d } d+1 2(n−d)+1 }

max{ n−d+1 2d+1 , ( 2d+1 n−d+1 d+1 > n−d d , if d ≥ 2(n−d)+1 , otherwise n−d ≥ 2,

we have

max{ n−d+1 , d

, max{ n−d+1 k n−k d ≤ C d−C k n−d+1 max{ d , k n−k ≤ D d−D k

d+1 n−d }

k n−k n ≤ (9) + (10) + (11) C +1 d−C −1 d d+1  P k n−k k n−k n−d }  , k < β, ( αk + 3) · i≤d  i d−i  i is odd  D−1 d−D+1     P k n−k Next, let’s move on to the case k > n − β. Similarly, the 5· β ≤ k ≤ n − β, i≤d ≤ i d−i , fact that β < n − β ensures i is odd      P  k n−k α   ( n−k + 3) · , n − β < k, i≤d i d−i k n−k k n−k i is odd ≤2 0 d 1 d−1  P k n−k 5α  k < β, and i≤d  2k · i d−i ,  i is odd    k n−k k n−k   P ≤2 . k n−k k d−k k − 1 d − (k − 1) 5· β ≤ k ≤ n − β, i≤d ≤ i d−i , i is odd    Moreover, since   P  k n−k α   52 n−k · i≤d i d−i , n − β < k, i is odd k n−k k−d+1 n−d+1 d 0 < n−k = k d(n − k) d(n − k) Therefore, we get the result. d−1 1 and n−k k d−(n−k) n−k k n−k d−(n−k−1) n−k−1

d+1 d−n+k+1 < , = (n − d)(n − k) (n − d)(n − k)

we obtain the result analogously. Now, we are ready to prove Lemma 1. As it was in the d+1 previous lemmas, let β = dmax{ n−d+1 2d+1 , 2(n−d)+1 }e and α = n−d+1 d+1 max{ d , n−d }. According to Lemma 5, α ≥ 2 for k < β k and

A PPENDIX B P ROOF OF L EMMA 2 We show the proposed construction yields |Rres | ≥ 12 |Rbig | n with high probability. Choose c = 1/4, i.e., |Rbig | = 4 log 6 n. Lemma 6. For large enough n, |Rbig | − 1 n − |Rbig | |Rbig | − 1 n − |Rbig | ≤ i−1 d−i 1 d−2 for 2 ≤ i ≤ d. Proof. Enough to show

α ≥ 2 for k > n − β. n−k

|Rbig | − 1 n − |Rbig | |Rbig | − 1 n − |Rbig | ≤ i−1 d−i i−2 d−i+1

Then (the body part) B−2 X k n − k (a) k n−k ≤ 2 +4 A d−A i d−i i=A+2 i is odd

k n−k +2 B d−B

for each 3 ≤ i ≤ d. On the other hand, |Rbig |−1 n−|Rbig | i−2 d−i+1 |Rbig |−1 n−|Rbig | i−1 d−i

=

(i − 1)(n − |Rbig | − d + i) (|Rbig | − i + 1)(d − i + 1)

2(n − |Rbig | − d) |Rbig |d 2(n − |Rbig | − log2 n) ≥ → ∞, |Rbig | log2 n ≥

and (the tail part)  h n−k i k n−k k α  + ,  B d−B  k A d−A     h (b)  n−k n−k i k k ≤ 2 A + , d−A B d−B       α h k n−k i   + k n−k , n−k

A

d−A

B

d−B

k < β,

as |Rbig | = o(n). Hence, we get the inequality. β ≤ k ≤ n − β, n − β < k,

where (a) comes from Lemma 3 and (b) comes from Lemma 4,5. Hence

Lemma 7. With probability approaching 1 as n → ∞, (k)

|Rshare | ≤ 2|Rbig |2

d log3 n for all 2 ≤ k ≤ d. n

Proof. To begin with, observe that |Rbig | n − |Rbig | |Rbig | − 1 n − |Rbig | k p = |Rbig | p k d−k k−1 d−k (a) |Rbig | − 1 n − |Rbig | ≤ |Rbig | p 1 d−2 n−2 ≤ |Rbig |2 p d−2 n−2 n log n ≤ |Rbig |2 d−2 n d d d log n ≤ 2|Rbig |2 n for 2 ≤ k ≤ d where we used Lemma 6 in (a). Because (k) big | i.i.d. Bernoulli random |Fshare | is the sum of |Rkbig | n−|R d−k variables with probability p, |Rbig | n − |Rbig | (k) E|Fshare | = p, k d−k so (k)

E|Fshare | ≤ 2|Rbig |2

d log n . kn

By Markov inequality, we obtain n E|F (k) | 2|Rbig |2 d log (k) share kn Pr |Fshare | ≥ λ ≤ ≤ λ λ 3

and putting λ = 2|Rbig |2 d log kn Pr

(k) |Fshare |

n

yields

2 d log

≥ 2|Rbig |

3

n

kn

≤ 1/ log2 n.

Thus, ! d 3 [ d log n (k) Pr |Fshare | ≥ 2|Rbig |2 kn k=2 d X d log3 n (k) ≤ Pr |Fshare | ≥ 2|Rbig |2 kn k=2

≤

d → 0. log2 n

Thus, with probability approaching 1 as n → ∞, (k)

|Fshare | < 2|Rbig |2

d log3 n for all 2 ≤ k ≤ d. kn

Because (k)

(k)

|Rshare | ≤ k|Fshare |, we get the result.

Now, we are ready to finish the proof.

|Rres | ≥ |Rbig | − ≥ |Rbig | −

d X k=2 d X k=2

(k)

|Rshare | 2|Rbig |2

d log3 n n

|Rbig |2 d2 log3 n n f 2 (n) log6 n ≥ |Rbig | − 2 n ! |Rbig | log6 n 1 = |Rbig | 1 − 2 = |Rbig |. n 2 ≥ |Rbig | − 2

Community Recovery in Hypergraphs - Changho Suh

Loose Hamilton Cycles in Regular Hypergraphs

SUH 3rd year Linguistics Course Revision & Development.pdf ...

species, community-, and ecosystem-based recovery ...

Course: E0210 Recovery from Disaster: A Local Community Role

Excellence in Community Engagement & Community-Engaged ...

Any Monotone Property of 3-uniform Hypergraphs is Weakly Evasive

Community Drop In -

The Multilinear polytope for Î³-acyclic hypergraphs - Optimization Online

Dense Subgraph Partition of Positive Hypergraphs

The Multilinear polytope for acyclic hypergraphs

Point-in-Time Recovery of CDB in Oracle12c using RMAN.pdf ...

The Multilinear polytope for Î³-acyclic hypergraphs - Semantic Scholar

The Multilinear polytope for Î³-acyclic hypergraphs - Optimization Online

2016-2017 Varsity Letter in Community Service

Reclaiming Social Purpose in Community Education.pdf ...

GNR in the Community flyer.pdf

Why We Live in Community

Community-acquired pneumonia in children