Topic Mining over Asynchronous Text Sequences Xiang Wang, Xiaoming Jin, Meng-En Chen, Kai Zhang, and Dou Shen Abstract—Timestamped texts, or text sequences, are ubiquitous in real-world applications. Multiple text sequences are often related to each other by sharing common topics. The correlation among these sequences provides more meaningful and comprehensive clues for topic mining than those from each individual sequence. However, it is nontrivial to explore the correlation with the existence of asynchronism among multiple sequences, i.e. documents from different sequences about the same topic may have different timestamps. In this paper, we formally address this problem and put forward a novel algorithm based on the generative topic model. Our algorithm consists of two alternate steps: the first step extracts common topics from multiple sequences based on the adjusted timestamps provided by the second step; the second step adjusts the timestamps of the documents according to the time distribution of the topics discovered by the first step. We perform these two steps alternately and after iterations a monotonic convergence of our objective function can be guaranteed. The effectiveness and advantage of our approach were justified through extensive empirical studies on two real data sets consisting of six research paper repositories and two news article feeds, respectively. Index Terms—Temporal text mining, topic model, asynchronous sequences.

F

1

I NTRODUCTION

M

ORE and more text sequences are being generated in various forms, such as news streams, weblog articles, emails, instant messages, research paper archives, web forum discussion threads, and so forth. To discover valuable knowledge from a text sequence, a first step is usually to extract topics from the sequence with both semantic and temporal information, which are described by two distributions, respectively: a word distribution describing the semantics of the topic and a time distribution describing the topic’s intensity over time [1]–[10]. In many real-world applications, we are facing multiple text sequences that are correlated to each other by sharing common topics. Intuitively, the interactions among these sequences could provide clues to derive more meaningful and comprehensive topics than those found by using information from each individual stream solely. The intuition was confirmed by very recent work [11], which utilized the temporal correlation over multiple text sequences to explore the semantic correlation among common topics. The method proposed therein relied on a fundamental assumption that different sequences are always synchronous in time, or in their own term coordinated, which means that the common topics share the same time distribution over different sequences. • Xiang Wang, Xiaoming Jin, Meng-En Chen, and Kai Zhang are with School of Software, Tsinghua University, Beijing 100084, China. Email: [email protected]; [email protected]; [email protected]; [email protected] • Dou Shen is with Microsoft Adcenter Labs, One Microsoft Way, Redmond, WA, USA. Email: [email protected]

However, this assumption is too strong to hold in all cases. Rather, asynchronism among multiple sequences, i.e. documents from different sequences on the same topic have different timestamps, is actually very common in practice. For instance, in news feeds, there is no guarantee that news articles covering the same topic are indexed by the same timestamps. There can be hours of delay for news agencies, days for newspapers, and even weeks for periodicals, because some sources try to provide first-hand flashes shortly after the incidents, while others provide more comprehensive reviews afterwards. Another example is research paper archives, where the latest research topics are closely followed by newsletters and communications within weeks or months, then the full versions may appear in conference proceedings, which are usually published annually, and at last in journals, which may sometimes take more than a year to appear after submission. To visualize it, we have the relative frequency of the occurrences of two terms warehouse and mining respectively in the titles of all research papers published in SIGMOD (ACM International Conference on Management of Data), a database related conference, and TKDE (IEEE Transactions on Knowledge and Data Engineering) from 1992 to 2006, a database related journal. The first term identifies the topic data warehouse and the second data mining, which are two common topics shared by the two sequences. As shown in Fig. 1(a), the bursts of both terms in SIGMOD are significantly earlier than those in TKDE, which suggests the presence of asynchronism between these two sequences. Thus, in this paper, we do not assume that given text sequences are always synchronous. Instead, we deal with text sequences that

0.6 0.5

warehouse − SIGMOD warehouse − TKDE

Relative Frequency

Relative Frequency

2

0.4 0.3 0.2 0.1 0 1992

1994

1996

1998

2000

2002

2004

0.6 0.5 0.4 0.3 0.2 0.1 0 1992

2006

mining − SIGMOD mining − TKDE

1994

1996

Year

1998

2000

2002

2004

2006

2002

2004

2006

Year

0.5

warehouse − SIGMOD warehouse − TKDE

Relative Frequency

Relative Frequency

(a) Before synchronization 0.6

0.4 0.3 0.2 0.1 0 1992

1994

1996

1998

2000

2002

2004

2006

0.6 0.5

mining − SIGMOD mining − TKDE

0.4 0.3 0.2 0.1 0 1992

1994

1996

Year

1998

2000

Year

(b) After synchronization

Fig. 1. An illustrative example of the asynchronism between two text sequences and how it is fixed by our method. share common topics yet are temporally asynchronous. We apparently expect that multiple correlated sequences can facilitate topic mining by generating topics with higher quality. However, the asynchronism among sequences brings new challenges to conventional topic mining methods. As shown in Fig. 1(a), if we overlook the asynchronism and apply the conventional topic mining methods directly, we are very likely to fail in identifying data mining and/or data warehouse as common topics of the two sequences, since the bursts of the topics do not coincide (therefore the relative frequency of the topical words becomes too low as compared to other words). As a contrast, as shown in Fig. 1(b), after adjusting the timestamps of documents in the two sequences using our proposed method, the relative frequency of both warehouse and mining are boosted over a certain range of time, relatively. Thus we are more likely to discover both topics from the synchronized sequences. It proves that fixing asynchronism can significantly benefit the topic discovery process. However, as desirable as it is to detect the temporal asynchronism among different sequences and to eventually synchronize them, the task is difficult without knowing the topics to which the documents belong before hand. A na¨ıve solution is to use coarse granularity of the timestamps of sequences so that the asynchronism among sequences can be smoothed out. This is obviously dissatisfactory as it may lead to unbearable loss in the temporal information of common topics and different topics would inevitably be mixed up. A second way, shifting or scaling the time dimension manually and empirically, may not work either because the time difference of topics among different sequences can vary largely and irregularly, of which we can never have enough prior knowledge. In this paper, we target the problem of mining common topics from multiple asynchronous text sequences and propose an effective method to solve it. We formally define the problem by introducing a

principled probabilistic framework, based on which a unified objective function can be derived. Then we put forward an algorithm to optimize this objective function by exploiting the mutual impact between topic discovery and time synchronization. The key idea of our approach is to utilize the semantic and temporal correlation among sequences and to build up a mutual reinforcement process. We start with extracting a set of common topics from given sequences using their original timestamps. Based on the extracted topics and their word distributions, we update the timestamps of documents in all sequences by assigning them to most relevant topics. This step reduces the asynchronism among sequences. Then after synchronization, we refine the common topics according to the new timestamps. These two steps are repeated alternately to maximize a unified objective function, which provably converges monotonically. Besides of theoretical justification, our method was also evaluated empirically on two sets of real-world text sequences. The first is a collection of 6 literature repositories consisting of research papers in the database literature from year 1975 to 2006 and the second contains 2 news feeds of 61 days’ news articles between April 1 and May 31, 2007. We show that our method is able to detect and fix the underlying asynchronism among different sequences and effectively discover meaningful and highly discriminative common topics. To sum up, the main contributions of our work are: •

•

•

•

We address the problem of mining common topics from multiple asynchronous text sequences. To the extent of our knowledge, this is the first attempt to solve this problem. We formalize our problem by introducing a principled probabilistic framework and propose an objective function for our problem. We develop a novel alternate optimization algorithm to maximize the objective function with a theoretically guaranteed (local) optimum. The effectiveness and advantage of our method are validated by extensive empirical study on two real-world data sets.

The rest of the paper is organized as follows: related work is discussed in Section 2; we formalize our problem and propose a generative model with a unified objective function in Section 3; we show how to optimize the objective function in Section 4; extensions of our model and algorithm are discussed in Section 5; empirical results are presented in Section 6; we conclude our work in Section 7.

2

R ELATED W ORK

Topic mining has been extensively studied in the literature, starting with the Topic Detection and Tracking (TDT) project [12], [13], which aimed to find and track

3

topics (events) in news sequences1 with clustering based techniques. Later on probabilistic generative models were introduced into use, such as Probabilistic Latent Semantic Analysis (PLSA) [14], Latent Dirichlet Allocation (LDA) [15] and their derivatives [16]–[18]. In many real applications, text collections carry generic temporal information and thus can be considered as text sequences. To capture the temporal dynamics of topics, various methods have been proposed to discover topics over time in text sequences [1]–[9]. However, these methods were designed to extract topics from a single sequence. For example, in [5], [9], which adopted the generative model, timestamps of individual documents were modeled with a random variable, either discrete or continuous. Then it was assumed that given a document in the sequence, the timestamp of the document was generated conditionally independently from word. In [1], the authors introduced hyper-parameters that evolve over time in state transfer models in the sequence. For each time slice, a hyper-parameter is assigned with a state by a probability distribution, given the state on the former time slice. In [7], the time dimension of the sequence was cut into time slices and topics were discovered from documents in each slice independently. As a result, in multiple-sequence cases, topics in each sequence can only be estimated separately and potential correlation between topics in different sequences, both semantically and temporally, could not be fully explored. In [16]–[18], the semantic correlation between different topics in static text collections was considered. Similarly, [19] explored common topics in multiple static text collections. A very recent work by Wang et al. [11] firstly proposed a topic mining method that aimed to discover common (bursty) topics over multiple text sequences. Their approach is different from ours because they tried to find topics that shared common time distribution over different sequences by assuming that the sequences were synchronous, or coordinated. Based on this premise, documents with same timestamps are combined together over different sequences so that the word distributions of topics in individual sequences can be discovered. As a contrast, in our work, we aim to find topics that are common in semantics, while having asynchronous time distributions in different sequences. Asuncion et al. [20] studied a generalized asynchronous distributed learning scheme with applications in topic mining. However, in their work the term “asynchronous” means a set independent Gibbs samplers which communicate with each other in an asynchronous manner. Therefore, their problem setting is fundamentally different from ours. 1. In the literature of topic mining, timestamped text sequences are also refered to as text streams. In this paper, we use the term sequence to distinguish it from the concept of data stream in the theory literature.

TABLE 1 Symbols and their meanings Symbols d t w z M T V K

Description document timestamp word topic number of sequences length of sequences number of distinct words number of topics

We also note that there is a whole literature on similarity measure between time series (sequences). Various similarity functions have been proposed, many of which addressed the asynchronous nature between time series, e.g. [21], [22]. However, defining an asynchronism-robust similarity measure alone does not necessarily solve our problem. In fact, most of the similarity measures deal with asynchronism implicitly, rather than fix the asynchronism explicitly, like what we do in this work.

3

P ROBLEM F ORMULATION TIVE F UNCTION

AND

O BJEC -

In this section, we formally define our problem of mining common topics from multiple asynchronous text sequences. We introduce a generative topic model which incorporates both temporal and semantic information in given text sequences. We derive our objective function, which is to maximize the likelihood estimation subject to certain constraints. The main symbols used throughout the paper are listed in Table 1. First of all, we define text sequence as follows: Definition 1 (Text Sequence): S is a sequence of N documents (d1 , . . . , dN ). Each document d is a collection of words over vocabulary V and indexed by a unique timestamp t ∈ {1, . . . , T }. Note that in our definition, we allow multiple documents in the same sequence to share a common timestamp, which is usually the case in real applications. Given M text sequences, we aim to extract K common topics from them (K is given by users), which are defined as: Definition 2 (Common Topic): A common topic Z over text sequences is defined by a word distribution over vocabulary V and a time distribution over timestamps {1, . . . , T }. To find common topics {Zk : 1 ≤ k ≤ K} over text sequences {Sm : 1 ≤ m ≤ M }, we put forward a novel generative model, derived from the topic model family that has been widely-used in topic mining tasks. Our generative model is able to capture the interaction between temporal and semantic information of topics and this interaction as shown later can be used to extract common topics from asynchronous sequences with an alternate optimization process.

4

Fig. 2. An illustration of our generative model. Shaded nodes mean observable variables while white nodes mean unobservable variables. Arrow indicates the generation relationship.

The documents {d ∈ Sm : 1 ≤ m ≤ M } are modeled by a discrete random variable d. The words are modeled by a discrete random variable w over vocabulary V. The timestamps are modeled by a discrete random variable t over {1, . . . , T }. At last the common topics Z are encoded by a discrete random variable z ∈ {1, 2, . . . , K}. Note that semantic information of a topic is encoded by the conditional distribution p(w|z) and its temporal information by p(z|t). The generating process is as follows (also see Fig. 2): 1) Pick a document d with probability p(d). 2) Given the document d, pick a timestamp t with probability p(t|d), where p(t = t|d) = 1 for some t. This means that a given document has and only has one timestamp. 3) Given the timestamp t, pick a common topic z with probability p(z|t) ∼ Mult(θ). 4) Given the topic z, pick a word w with probability p(w|z) ∼ Mult(ϕ). According to the generative process, the probability of word w in document d is X p(w, d) = p(d)p(t|d)p(z|t)p(w|z). t,z

Consequentially the log-likelihood function over all sequences is: XX L= c(w, d) log p(w, d), w

d

where c(w, d) is the number of occurrences of word w in document d. Conventional methods on topic mining try to maximize the likelihood function L by adjusting p(z|t) and p(w|z) while assuming p(t|d) is known. However, in our work, we need to consider the potential asynchronism among different sequences, i.e., p(t|d) is also to be determined. Thus besides of finding optimal p(z|t) and p(w|z), we also need to decide p(t|d) to further maximize L. In other words, we want to assign the document with timestamp t to a new timestamp g(t) by determining its relevance to respective topics, so that we can obtain larger L, or equivalently, topics with better quality.

Note that the mapping from t to g(t) is not arbitrary. By the term asynchronism, we refer to the time distortion among different sequences. The relative temporal order within each individual sequence is still considered meaningful and generally correct (otherwise the current temporal information in the sequences becomes totally useless and should be discarded and the problem would reduce to mining topics from a collection of texts, not text sequences). Therefore, during each synchronization step, we preserve the relative temporal order of documents in each individual sequences, i.e., a document with earlier timestamp before adjustment will never be assigned to later timestamp after adjustment as compared to its successors. This constraint aims to protect local temporal information within each individual sequence while fixing the asynchronism among different sequences. Formally, given two documents d1 and d2 in a same sequence, we require that: g(t1 ) ≤ g(t2 ) ⇔ t1 ≤ t2 . Then we have: Definition 3 (Asynchronism): Given M text sequences {Sm : 1 ≤ m ≤ M }, in which documents are indexed by timestamps {t : 1 ≤ t ≤ T }, asynchronism means that the timestamps of the documents sharing the same topic in different sequences are not properly aligned. Finally, our objective is to maximize the likelihood function L by adjusting p(z|t) and p(w|z) as well as p(t|d) subject to the constraint of preserving temporal order within sequence. Formally it writes: arg max

L,

p(t|d),p(z|t),p(w|z)

(1)

s.t. ∀d1 , d2 ∈ Sm , g(t1 ) ≤ g(t2 ) ⇔ t1 ≤ t2 , for 1 ≤ m ≤ M , where t1 and t2 are the current timestamps of d1 and d2 , respectively and g(t1 ) and g(t2 ) are the timestamps after adjustment.

4

A LGORITHM

In this section we show how to solve our objective function in Eq.(1) through an alternate (constrained) optimization scheme. The outline of our algorithm is: Step 1 We assume the current timestamps of the sequences are synchronous and extract common topics from them. Step 2 We synchronize the timestamps of all documents by matching them to most related topics respectively. Then we go back to Step 1 and iterate until convergence. 4.1

Topic Extraction

First we assume the current timestamps of all sequences are already synchronous and extract common topics from them. In other words, now p(t|d) is fixed

5

and we try to maximize the likelihood function by adjusting p(t|z) and p(w|z). Thus we can rewrite the likelihood function as follows: XX XX c(w, d) log p(d)p(t|d)p(z|t)p(w|z) w

=

t

d

XX w

c(w, d) log p(d)

X

p(t|d)

X

g(t)

w

=

w

z

t

c(w, t) log

X

(2)

p(z|t)p(w|z).

z

t

Here c(w, d, t) denotes the number of occurrences of word w in document d at time t, and p(d) is summed out because it can be considered as a constant in the formula [14]. Eq.(2) can be solved by well-established EM algorithm [14]. The E-step writes: p(z|t)p(w|z) , p(z|w, t) = P z p(z|t)p(w|z)

(3)

and the M-step writes: P w c(w, t)p(z|w, t) p(z|t) = P P , z w c(w, t)p(z|w, t) P t c(w, t)p(z|w, t) p(w|z) = P P . w t c(w, t)p(z|w, t)

(4)

Time Synchronization

= max g(t)

arg max g(t)

Q(w, s)

X

c(w, d)

g(t)

w s=1

Q(w, s)

w s=1

= max max 1≤a≤T g(t) T X

T −1 X

X

c(w, d) +

r=1 d(r,s)

X

c(w, d)

d(T,s)

a T −1 X XX X ( Q(w, s) c(w, d) w

s=1

Q(w, s)

X

r=1 d(r,s)

c(w, d))

d(T,s)

= max (H(1 : (T − 1); 1 : a) + δ(T ; a : T )) , 1≤a≤T

where the second term equals to X X δ(r; a : T ) = max Q(w, s)c(w, d), {d:t=r}

a≤s≤T

w

for 1 ≤ r ≤ T , and the first term can be computed recursively as H(1 : i, 1 : j) = max (H(1 : (i − 1); 1 : a) + δ(i; a : j))

{d∈Sm :g(t)=s}

(7) (5)

s.t. ∀d1 , d2 ∈ Sm , g(t1 ) ≤ g(t2 ) iff t1 ≤ t2 , P where Q(w, s) = log z p(z|s)p(w|z). It is obvious that we can solve Eq.(5) by solving the following objective function for each sequence, respectively: T XX

1≤a≤j

m=1 w s=1

max

T XX

s=a

Once the common topics are extracted, we match documents in all sequences to these topics and adjust their timestamps to synchronize the sequences. Specifically, now p(z|t) and p(w|z) are assumed as known and we try to update p(t|d) to maximize our objective function. Given document d, we denote its current timestamp with t and its timestamp after adjustment with g(t). Then our objective function in Eq.(1) can be rewritten as: T XX

c(w, d),

r=1 d(r,s)

where 1 ≤ i, j ≤ T . Here d(r, s) denotes the set of all documents whose timestamps are changed from r to s, i.e., {d : t = r, g(t) = s}. It is easy to see that our objective function in Eq.(6) equals to H(1 : T, 1 : T ). Then we show how to compute H(1 : T, 1 : T ) recursively. The basic idea behind our approach is that: suppose we already have j timestamps {1, . . . , j} and documents whose current timestamps are ranging from 1 to i − 1, i.e., {d : 1 ≤ t ≤ i − 1}; then given documents whose current timestamps are i, according to our constraint, its new timestamp g(i) must be no smaller than the new timestamps of documents in {d : 1 ≤ t ≤ i − 1}. Thus if the smallest timestamp of documents in {d : t = i} is a, then documents in {d : 1 ≤ t ≤ i − 1} can only match to timestamps from 1 to a. So we can enumerate all possible matching for 1 ≤ a ≤ j to find an optimal a for H(1 : i, 1 : j). Formally, we have

+

M X

i X X

H(1 : T ; 1 : T )

The E- and M-step repeat alternately and the objective function guarantees to converge to a local optimum. 4.2

Q(w, s)

w s=1

p(z|t)p(w|z).

z

Since we have p(t = t|d) = 1 for some t, above equation can be reduced to XXX X c(w, d, t) log p(z|t)p(w|z) d XX

j XX

H(1 : i, 1 : j) = max

z

t

d

And p(t|d) can be decided by p(t = g(t)|d) = 1 and p(t 6= g(t)|d) = 0. Next we define following function:

Q(w, s)

X

c(w, d),

{d:g(t)=s}

s.t. ∀d1 , d2 , g(t1 ) ≤ g(t2 ) iff t1 ≤ t2 .

(6)

for 2 ≤ i ≤ T and 1 ≤ j ≤ T . Specially we have X X H(1 : 1, 1 : a) = max Q(w, s)c(w, d) {d:t=1}

1≤s≤a

w

for 1 ≤ a ≤ T . After H(1 : T, 1 : T ) is computed recursively, it gives the global optimum to our objective function in Eq.(6). Our algorithm is summarized in Algorithm 1. K is the number of topics and specified by users. The initial values of p(t|d) and c(w, d, t) are counted from the original timestamps in the sequences.

6

Algorithm 1: Topic mining with time synchronization Input: K, p(t|d), c(w, d, t); Output: p(w|z), p(z|t), p(t|d); repeat Update c(w, t) with p(t|d) and c(w, d, t); Initialize p(z|t) and p(w|z) with random values; repeat Update p(z|t) and p(w|z) following Eq.(3) and (4); until Convergence; for m=1 to M do for j=1 to T do Initialize H(1 : 1, 1 : j); for i=2 to T do for j=1 to T do Compute H(1 : i, 1 : j) as shown in Eq.(7); end end Update p(t|d); end until Convergence;

The computational complexity of the topic extraction step (with EM algorithm) is O(KV T ) while the complexity of time synchronization step is approximately O(V M T 3 ). Thus the overall complexity of our algorithm is O(V T (K + M T 2 )), where V is the size of vocabulary, T the number of different timestamps, K the number of topics and M the number of sequences. If we take V , K and M as constants and only consider the length of sequence, which is T , the complexity of Algorithm 1 becomes O(T 3 ). We will show in next section how to reduce it to O(T 2 ) with a local search strategy.

5 5.1

D ISCUSSIONS

AND

discard all temporal information given, which could result in topics without time distribution. There are subtle variations in how we implement this constraint in practice. To initialize our method, at Iteration 1, t1 and t2 are the original timestamps given to us. Then our algorithm will update them to g(t1 ) and g(t2 ) respectively. Then at Iteration 2, a naturally question arises: should we 1) keep using the original timestamps as t1 and t2 , or should we 2) use the current timestamps g(t1 ) and g(t2 ) to replace the original t1 and t2 . The difference is that the second option will give our algorithm more flexibility in updating timestamps. It may potentially allow two documents swap their positions along the time dimension after multiple iterations. For instance, suppose we have document d1 with timestamp 3 and d2 with timestamp 5. After the first round of synchronization, both d1 and d2 are mapped to time 4. Now we use 4 as input value for d1 and d2 , thus in the following round, it is possible that d2 would be assigned to an earlier timestamp than d1 , without violating the constraint. We argue that the second option works better in practice since real-world data sets are not perfect. Although we assume that the temporal order of the given sequences are correct in general, there will still be a small number of documents that do not conform to our assumption. For example, we have 10 documents on Topic A, followed by 10 documents on Topic B. Then there is one outlying document on Topic A, but with a timestamp after Topic B. Our iterative updating process and the relaxed constraint will help recover this kind of outlying documents and assign them to the correct topics. As we will show later in the experimental results (Fig. 7 and 14), in practice, documents tend to find new timestamps in the neighborhoods of their original positions and local swapping of documents’ positions often happens, which can empirically justify the flexibility and robustness of our method.

E XTENSIONS

The Constraint on Time Synchronization

Recall that in our model, we made a fundamental assumption about the asynchronism among the given sequences: we assume that the original timestamps as given are distorted, while the relative temporal order (the sequential information) between documents is correct in general. This assumption is made based on observations from real-world applications. For example, news stories published by different news agencies may vary in absolute timestamps, but their relative temporal order conforms to the order of the occurrences of the events. Then we translate this assumption into the formal constraint: g(t1 ) ≤ g(t2 ) ⇔ t1 ≤ t2 . This constraint can be interpreted as a tradeoff between two extreme cases: 1) strictly obey the original timestamps, which will harm the quality of the resultant topics due to the underlying asynchronism; 2)

5.2

Convergence

Both of the two steps in our algorithm guarantee a monotonic improvement in our objective function in Eq.(1), the algorithm will converge to a local optimum after iterations. Notice that there is a trivial solution to the objective function, which is to assign all documents to a single (arbitrary) timestamp and our algorithm would terminate at this local optimum. This local optimum is apparently meaningless since it is equivalent to discard all temporal information of text sequences and treat them like a collection of documents. Nevertheless, this trivial solution only exists theoretically. In practice, our algorithm will not converge to this trivial solution, as long as we use the original timestamps of text sequences as initial value and have K > 1, where K is the number of topics. As shown in Section 6, the adjusted timestamps of

7

documents always converge to more than K different time points. Note that this is so even after relaxing the constraint by allowing two documents to swap their temporal order after multiple iterations, as discussed above. This is because our algorithm is essentially a mutual reinforcement process where we use both semantical and temporal information to identify common topics. The topic extraction step will prevent the algorithm from assigning all documents to a single timestamp, since in this case we may end up with topics with lower quality. 5.3

Cases Where Our Method May Not Work (Well)

Given the assumption we made, our model and our algorithm will not work well in the following cases: 1) there is no correlation between the semantical and temporal information of topics, i.e. the time distribution of any topic is random (no bursty behavior); 2) the temporal order of documents as given by their original timestamps varies greatly from the temporal order of underlying topics, e.g. Topic A appears before Topic B in one sequence, but after B in another. In either case, the better choice would be discarding the original temporal information and treating the text sequences as a collection of documents. 5.4

The Local Search Strategy

In some real-world applications, we can have a quantitative estimation of the asynchronism among sequences so it is unnecessary to search the entire time dimension when adjusting the timestamps of documents. This gives us the opportunity to reduce the complexity of time synchronization step without causing substantial performance loss, by setting a upper bound for the difference between the timestamps of documents before and after adjustment in each iteration. Specifically, given document d with time t, we now look for an optimal g(t) within the neighborhood of t, where is the user-specified search range. Accordingly, Eq.(6) becomes: max g(t)

T XX w s=1

Q(w, s)

X

c(w, d),

{d:g(t)=s}

s.t. ∀d, g(t) ∈ [t − , t + ], ∀d1 , d2 , g(t1 ) ≤ g(t2 ) ⇔ t1 ≤ t2 . This objective function can be solved by Eq.(7) with simple modifications. We can see that the complexity of the synchronization step has been reduced to O(V M T 2 ), thus the overall complexity is reduced from O(T 3 ) to O(T 2 ).

6

E MPIRICAL E VALUATION

We evaluated our method on two sets of real-world text sequences, a set of six research paper repositories

TABLE 2 Statistics of the literature repositories data ID DEXA ICDE IS SIGMOD TKDE VLDB

Year 1990 - 2006 1984 - 2006 1975 - 2006 1975 - 2006 1989 - 2006 1975 - 2006

#Docs 1477 1957 939 1877 1457 2329

#Words per doc (title) 6.03 5.90 5.93 5.40 6.29 5.67

and a set of two news article feeds. The goal is to see if our method is able to: 1) explore the underlying asynchronism among text sequences and fix it with our time synchronization techniques; 2) find meaningful and discriminative common topics from multiple text sequences; 3) consistently outperform the baseline method (without time synchronization), but also an improved competitor with one-time synchronization process; 4) achieve stable performance against different parameter settings and random initialization. 6.1

Data Sets

The first data set used in our experiment is six research paper repositories extracted from DBLP2 , namely DEXA, ICDE, Information Systems (journal), SIGMOD, TKDE (journal) and VLDB. These repositories mainly consist of research papers on database technology. Each repository is considered as a single text sequence where each document is represented by the title of the paper and timestamped by its publication year. The second data set is two news articles feeds, which consist of the full texts of daily news reports published on the web sites of International Herald Tribune3 and People’s Daily Online4 respectively from April 1, 2007 to May 31, 2007. Each document is timestamped by its publication date. Text sequences are preprocessed by TMG5 for stemming and removing stop words. Words that appear too many (appear in over 15% of the documents) or too few (appear in less than 0.5% of the documents) times are also removed. After preprocessing, the literature repositories have a vocabulary of 1686 words and news feeds 3358 words. The basic statistics of the data sets after preprocessing is shown in Table 2 and 3. 6.2

The Baseline Method and Implementation

For the simplicity of description, in Section 4, we use standard PLSA [14] method as the topic extraction 2. http://www.informatik.uni-trier.de/∼ley/db/ 3. http://www.iht.com/ 4. http://english.peopledaily.com.cn/ 5. http://scgroup.hpclab.ceid.upatras.gr/scgroup/Projects/ TMG/

8

TABLE 3 Statistics of the news feeds data ID IHT People

#Days 61 61

#Docs 2488 6461

#Words per doc 271.9 65.8

topical words. We also compute the pairwise KL-divergence between topics to evaluate how discriminative they are. (In practice, we normally expect meaningful topics that can be easily understood by human users and we want these topics to be as discriminative as possible, in order to avoid redundant information.) For p(z|t), we want to see if, after synchronization, our method is able to separate different topics along the time dimension, which would eventually improve the quality of extracted topics. For p(t|d), we demonstrate how our method adjusts documents’ timestamps and fixes the synchronization among sequences.

step of our algorithm. Yet in the experiments, we • introduced two additional techniques as used in [7], [11] and this modified version of PLSA algorithm was used as a baseline method for topic extraction. The first technique is to introduce a background topic p(w|B) into our generative model so that back• ground noise can be removed and we can find more bursty and meaningful topics. Specifically, the objective function in Eq.(2) is rewritten as We also computed the log-likelihood of our method XX X c(w, t) log(λB p(w|B)+(1−λB ) p(z|t)p(w|z))), and compared it to that of the baseline method. w z t In order to show the stability of our method against P P P where p(w|B) = t c(w, t)/ w t c(w, t) is a back- random initialization, we repeated our method for 100 ground topic whose distribution is independent from times and compared it to the baseline method under time and λB ∈ (0, 1) is a weighting parameter that two different metrics: log-likelihood and pairwise KLdecides the strength of the background topic. Empir- divergence between the words distributions of differically we have λB ∈ [0.9, 0.95], as suggested by previ- ent topics. ous work [7], [11]. In our experiments, we empirically had λB = 0.9 for literature streams and λB = 0.95 for news streams, according to their respective character6.4 Results on Literature Repositories istics. The second technique is to impose time dependency First we performed our method as well as the baseon p(z|t) by smoothing the time distribution of topic line method on the literature repositories. We exbetween adjacent timestamps, which writes: tracted 10 common topics from the six sequences. For p(z|t) ←

µp(z|t − 1) p(z|t) µp(z|t + 1) + + , 2(1 + µ) 1+µ 2(1 + µ)

where µ is a smoothing factor. In our experiment, we empirically chose µ = 0.1, following [11]. Note that the introduction of background topic and smoothing factor, which takes place during each topic extraction step, does not affect the time synchronization step of our algorithm. Note that the topic extraction step (EM algorithm) of our method converges quite fast in practice. It normally took about 20 iterations to converge on both of the data sets we used. 6.3

Evaluation Metrics

We evaluated the performance of our method using several different metrics. Recall that in order to optimize our objective function, as shown in Eq.(1), we have three parameters to estimate, namely p(t|d), p(z|t) and p(w|z). Here p(t|d) gives the new timestamps of documents after adjustment, p(z|t) indicates the time distribution of extracted topics while p(w|z) gives the word distribution. These parameters are all to be examined in our experiments. Specifically: • For p(w|z), we evaluate the meaningfulness of extracted topics by examining their top-ranked

each topic, 10 topical words with highest probability (p(w|z)) were shown in Fig. 3 and 4. We can see that all topics extracted by our method (sync) were meaningful and easy to understand. For example, #7 includes research topics like data mining, highdimensional/multidimensional data, data warehouse, association rule, workflow, etc., while #10 includes sensor network, privacy preserving, classification, ontology, top-k query, etc. All of these topical words accurately suggest most important research topics in the database area. Comparing the topics extracted by our method to those by the baseline method (no sync), we can see that our method provided highly discriminative topics. As a contrast, the baseline method suffered from the asynchronism in the sequences and extracted many duplicated topical words (see Fig. 4). In asynchronous sequences, documents related to different topics may be indexed by the same timestamp, and documents related to the same topic may appear at different timestamps. As a result, common topics discovered by conventional method contain redundant information, whereas our method is able to fix the asynchronism and discover highly discriminative topics. To further prove that our time synchronization technique helped to generate more discriminative topics, we computed the pairwise KL-divergence between

9

Top-10 topical words (sorted by probability) file, data, language, abstract, relational, program, model, base, access, user design, schema, theory, conceptual, methodology, CODASYL, specific, paper, tool, practice distribute, concurrency, control, relational, hash, performance, extend, recursive, evaluation, depend knowledge, expert, transaction, transit, replicate, closure, protocol, product, intelligence, hypertext object, orient, deductive, parallel, database, multi-database, language, model, buffer, persistent active, server, multimedia, heterogeneous, time, real, constraint, architecture, maintain, federal mining, spatial, warehouse, association, dimension, workflow, high, business, scalable, video web, search, similarity, cache, service, sequence, multi-dimensional, mobile, nearest, extract XML, stream, peer, pattern, document, continuous, adaptive, approximate, XQuery, move network, privacy, sensor, preserve, match, XPath, ranking, classification, ontology, top-k

1 2 3 4 5 6 7 8 9 10

Fig. 3. Common topics extracted by our method (sync) from literature repositories (K = 10). Top-10 topical words (sorted by probability) data, base, file, abstract, relational, language, level, large, conversation, structural base, design, data, theory, paper, relational, CODASYL, practice, methodology, language database, relational, design, distribute, file, recursive, hash, concurrency, control, extend object, knowledge, orient, system, expert, transaction, transit, parallel, hypertext, deductive object, orient, parallel, knowledge, database, deductive, multi-database, system, expert, language object, rule, active, orient, server, parallel, heterogeneous, database, multimedia, transaction mining, web, warehouse, multimedia, spatial, index, workflow, scalable, dimension, high XML, cache, web, efficiency, service, similarity, search, mobile, mining, association XML, stream, web, peer, mining, service, XQuery, P2P, adaptive, pattern XML, network, stream, efficiency, privacy, pattern, peer, classification, web, clustering

1 2 3 4 5 6 7 8 9 10

Fig. 4. Common topics extracted by the baseline method (no sync) from literature repositories (K = 10). Underlined are duplicated topical words. 300 1

2

250

1

1

250

0.8

0.8

0.6

0.6

150 6

200

4

Topic

5

5

p(t|z)

3 200

4

p(t|z)

3

Topic

300 1

2

0.4 0.2

7

100

8

7

50

9

50

9 10

1

2

3

4

5

6

7

8

9

10

0

1

2

3

4

Topic

5

6

7

8

9

10

0

Topic

(a) sync

(b) no sync

Fig. 5. The pairwise KL-divergence between topics extracted from the literature repositories (K = 10).

topics as follows: KL(z1 , z2 ) =

X w

0.2

100

8

10

0.4

150 6

p(w|z1 ) log

p(w|z1 ) . p(w|z2 )

Note that larger KL-divergence indicates the two topics are more discriminative to each other and 0 divergence means two topics are identical. We present the results in Fig. 5, where darker blocks mean smaller KL-divergence values. We can see that our method extracted much more discriminative topics than those extracted by the baseline method. As discussed above, this was due to the fact that our method successfully fixed the asynchronism in the data set. The time distribution of extracted topics is shown in Fig. 6. Note that we used p(t|z), which can be computed from p(z|t) using the Bayes’ Law, as the

0 1975

1980

1985

1990

1995

Year

(a) sync

2000

2006

0 1975

1980

1985

1990

1995

2000

2006

Year

(b) no sync

Fig. 6. The time distribution of topics extracted from the literature repositories (K = 10).

y-axis so that we can more clearly see the distribution of the given topic z over time t. We can see that without synchronization, the extracted topics overlapped significantly over time (Fig. 6(b)), while our method substantially reduced the overlapping area between topics by fixing the asynchronism (Fig. 6(a)). This explains why our method was able to find more discriminative topics. We further provide a detailed view of how our method adjusted the timestamps of documents. Fig. 7 shows the mapping from documents’ original timestamps to the ones assigned by our method (sync). We can see that our synchronization technique on one hand preserved the temporal order in original text sequences, and on the other hand, it discovered temporally adjacent documents belonging to the same topic and assigned them to same timestamps. Moreover, for documents at each timestamp, we

10 1975

1980

1985

1990

1995

2000

2006 1975

1980

1985

1975

1980

1985

1990

1995

2000

2006 1975

1980

1985

(a) DEXA

1990

1995

2000

2006 1975

1980

1985

1990

1995

2000

2006 1975

1980

1985

(b) ICDE

1975

1980

1985

1990

1995

2000

2006 1975

1980

1985

1975

1980

1985

1990

1995

2000

2006 1975

1980

1985

(d) SIGMOD

1990

1995

2000

2006

1990

1995

2000

2006

1990

1995

2000

2006

1990

1995

2000

2006

(c) IS

1990

1995

2000

2006 1975

1980

1985

1990

1995

2000

2006 1975

1980

1985

(e) TKDE

(f) VLDB

2 1 0 −1 −2 1980

1985

1990

1995

2000

2006

2 1 0 −1 −2 −3 1975

1980

1985

Year

2 1 0 −1 −2 1985

1990

2000

2006

2 1 0 −1 −2 −3 1975

1980

1985

1995

2000

2006

2 1 0 −1 −2 1980

1985

Year

(d) SIGMOD

1990

1995

2000

2006

1995

2000

2006

(c) IS

3

−3 1975

1990

Year

(b) ICDE Normalized Average Time Offset (year)

Normalzied Average Time Offset (year)

(a) DEXA

1980

1995

3

Year

3

−3 1975

1990

1995

Year

(e) TKDE

2000

2006

Normalized Average Time Offset (year)

−3 1975

3

Normalized Average Time Offset (year)

3

Normalized Average Time Offset (year)

Normalized Average Time Offset (year)

Fig. 7. The mapping from documents’ original timestamps (upper axis) to those determined by our method (lower axis). The boldness of lines indicates the number of documents belonging to that mapping.

3 2 1 0 −1 −2 −3 1975

1980

1985

1990

Year

(f) VLDB

Fig. 8. Normalized average time offset of papers at each year. Positive offset indicates that most papers in the corresponding year were assigned to a later timestamp, which means that they addressed common topics earlier than those papers with negative offset.

computed the difference between their original timestamps and the final timestamps after synchronization (g(t)−t). The offsets was then normalized so that they sum up to 0 at each timestamp. At last the average offset for each timestamp was shown in Fig. 8. Note that positive time offset means that most documents at this timestamp were assigned to a later timestamp after synchronization. In other words, documents with positive time offset addressed common topics earlier than documents with negative time offset. In Fig. 8 we can see that papers from ICDE, SIGMOD and VLDB had positive time offsets at most timestamps while papers from IS and TKDE mostly had negative time offsets. This means that common topics were addressed earlier in ICDE, SIGMOD and VLDB than IS and TKDE, which conforms to our knowledge that latest research results in this area normally first appear in conference proceedings years before they appear in journals.

and topics from other 99 rounds were re-ordered to match topics from Run 1 using a greedy algorithm, i.e., we matched a given topic to its most similar topic in Run 1, with similarity function defined by KLdivergence. Thus, we obtained 99 similarity matrices constructed by the KL-divergence values between reordered topics and benchmark topics. Then we averaged the 99 similarity matrices into one matrix. We repeated above process 100 times so that every run was chosen once as the benchmark run. The average KL-divergence is shown in Fig. 9(a). This matrix suggests that a large percentage of topics have similar word distributions over different rounds of random initialization. In other words, the topics extracted by our method are stable in semantics. As shown in Fig. 10(a), our method converges fast after a small number of timestamp updates.

We also examined the semantical stability of the topics extracted by our method against random initialization. Specifically, we chose the topics extracted with K = 10 and 100 rounds of random initialization. The 10 topics from Run 1 was chosen as benchmark,

Now we present the performance of our method on the news feeds data set. We extracted 15 common topics (K = 15) from two news feeds consisting of 61 days’ news reports with full texts. Note that for efficiency consideration, here we used the local search

6.5

Results on News Feeds

11 400

400

300

300

1

4

250

5 200 6 150

7 8

100

2 250

4

300

4

6

250

6

8

200

10

150

10

12

100

12

250 4

200

8

150

200

6

Topic

300

2

350

Topic

3

Benchmark Topic (Run 1)

Benchmark Topic (Run 1)

2

350

2

8

150

10

100

100

12 50

9

50

50

14

50

14

14

10 1

2

3

4

5

6

7

8

9

10

2

Re−ordered Topic (from Run 2 to Run 100)

4

6

8

10

12

14

(a) Literature (K = 10)

4

6

8

10

12

14

2

4

6

8

Topic

(b) News (K = 15)

Fig. 9. The average pairwise KL-divergence between topics extracted by our method (sync) over 100 rounds of random initialization.

10

12

14

0

Topic

(a) sync

(b) no sync

Fig. 13. The pairwise KL-divergence between topics extracted from the news feeds (K = 15). 1

10

20

1

10

20

30

40

50

61 1

10

20

30

40

50

61

30

40

50

61 1

10

20

30

40

50

61

6

5

−3.44

2

Re−ordered Topic (from Run 2 to Run 100)

0

x 10

−8.32

x 10

−8.33

−3.45 Log−Likelihood

Log−Likelihood

−8.34 −3.46 −3.47 −3.48

−8.35 −8.36

(a) IHT

−8.37

(b) People

−8.38 −3.49 −3.5 1 2

−8.39 3

4

5 6

7

8 9 10 11 12 13 14 15 16 17 18 19 20 Number of Iteration

(a) Literature

−8.4 1 2

3

4

5 6

7

8 9 10 11 12 13 14 15 16 17 18 19 20 Number of Iteration

(b) News

Fig. 14. The mapping from documents’ original timestamps (upper axis) to those determined by our method (lower axis) in news streams.

Fig. 10. The convergence behavior of our method on the literature and news data sets.

2

Normalized Average Time Offset (day)

Normalized Average Time Offset (day)

strategy for time synchronization, as described in Section 4. The local search radius was set to be 3, as we assumed that time difference between (online) news articles belonging to the same topic normally will not exceed 3 days. The topic extraction step remained the same. We list in Fig. 11 the topical words of all 15 common topics extracted by our method (sync) and those by baseline method (no sync) in Fig. 12. Comparing these two sets of results, we can see that both methods discovered some common topics in the sequences, e.g. British sailors captured in Iran, Campus shooting at VT, France presidential election, Darfur problem, etc. Besides, our method was able to find better focused and more discriminative topics, while the baseline method found some confusing and duplicated topics. For instance, #10 of our method clearly and uniquely describes the France presidential election. As a contrast, relevant topical words appear repeatedly (marked by underline) in several different topics found by the baseline method (see the underlined words in #8, #9, #11 and #12). Similarly, #12 of our method discusses Mid-East situation, which is also discussed by #14 and #15 of the baseline method, and these two topics are basically duplicated. Besides of duplicated topical words, some of the topics found by the baseline method contain keywords about different (and irrelevant) news events, which may confuse the users. For example, #8 of the baseline method mentions both campus shooting at Virginia Tech and France presidential election. As a contrast, topics extracted by our method are much better focused. Moreover, since our method is able to fix the asynchronism in the sequences and discover better focused and

discriminative topics, it can eventually extract more information than the baseline method. In our case, given the same number of common topics (K = 15), our method found in #9 the resignation of President of the World Bank, which was not properly addressed by the baseline method. Fig. 13 proves in quantity that topics extracted by our method (sync) are much more discriminative to each other than those extracted by the baseline method (no sync). Fig. 14 and 15 show how our method adjusted the timestamps of documents in both news sequences, which is consistent to its behavior on literature repositories: it automatically discovered documents related to the same topic after considering their semantic as well as temporal information and then assigned them to the same timestamp. Fig. 9(b) shows that the semantics of topics extracted by our method with different random initial values were stable. Results on news feeds show that our method performs well on different kinds of data. It has also proved that the local search strategy, which reduces the complexity of our method from O(T 3 ) to O(T 2 ), would not harm the performance of the method, as long as we have a rough estimation for the level

1 0 −1 −2 1

10

20

30

Day

(a) IHT

40

50

61

2 1 0 −1 −2 1

10

20

30

40

50

61

Day

(b) People

Fig. 15. Normalized average time offset of news articles at each day.

12

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Top-10 topical words (sorted by probability) British, Iranian, Iran, sailor, Britain, water, captive, marine, personnel, seize church, Somalia, prison, Somali, Mogadishu, tax, Ethiopian, ship, Timor, Muslim English, language, company, China, learn, test, oil, watch, native, speaker student, shoot, Virginia, campus, Tech, Cho, gunman, university, victim, classroom gun, Korean, mental, Korea, Cho, blame, firearm, happen, society, kid company, billion, share, market, price, stock, game, Hong, Kong, sale Arab, Nigeria, Baghdad, Maliki, car, gate, wall, Sunny, Sadr, neighborhood Russia, missile, Russian, Putin, Moscow, Yeltsin, NATO, Japan, ab, Czech bank, Wolfowitz, bill, senate, Republican, Olmert, resign, committe, board, Turkey Sarkozy, France, French, Royal, socialist, Bayrou, Nicolas, Segolene, candidate, voter Afghan, Taliban, Blair, Afghanistan, Pakistan, Pakistani, church, Musharraf, abort, justice Palestinian, Hamas, Gaza, Isra, Israel, Fatah, rocket, camp, Lebanese, Lebanon Syria, climate, Pelosi, emission, Yushchenko, warm, Damascus, Yanukovich, environment, water Iraqi, Iran, Baghdad, nuclear, wound, Sadr, Shiite, insurgency, Sunni, explosion Darfur, African, Africa, Sudan, Sudanese, rebel, DPRK, peacekeeper, north, Thai

Fig. 11. Common topics extracted by our method (sync) from news feeds (K = 15). 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Top-10 topical words (sorted by probability) water, Syria, Pelosi, emission, Damascus, sailor, environment, music, diplomat, gas British, Iranian, Iran, sailor, water, Britain, marine, personnel, captive, seize Baghdad, church, tax, Sadr, Timor, desert, prison, ship, gas, catholic English, language, learn, native, speaker, speak, oil, culture, method, gas Darfur, nuclear, Sudan, Sudanese, Africa, north, Arab, bank, Thai, tribune student, shoot, campus, Virginia, gunman, gun, Tech, bear, hall, classroom gun, Korean, Cho, mental, Korea, student, Virginia, blame, killer, happen gun, France, mental, thing, Bayrou, (Le)Pen, video, man, Cho, Don wall, Royal, round, voter, Bayrou, Nigeria, candidate, ballot, (Le)Pen, Sunni Yeltsin, Russian, rose, George, treaty, Putin, ab, Soviet, Chinese, Japanese Olmert, debate, Royal, oil, labor, Mccain, resign, governor, candidate, veto Sarkozy, France, French, Royal, socialist, Nicolas, Segolene, Chirac, voter, Paris Afghan, Cheney, abort, Taliban, Kosovo, depart, drug, justice, church, (Ramos-)Horta Hamas, Fatah, camp, Gaza, Lebanese, rocket, Palestinian, Lebanon, military, Islam Hamas, Isra, Iran, Iraqi, Palestinian, Gaza, rocket, camp, Israel, arrest

Fig. 12. Common topics extracted by the baseline method (no sync) from news feeds (K = 15). Underlined are duplicated topical words.

of asynchronism. Fig. 10(b) shows the convergence behavior of our method against timestamp updates. 6.6

segments to d, we compute the similarity between them and d and update the timestamp of d to the nearest neighbor. Formally we have

Fixing Asynchronism is Non-Trivial

The results of our method on both data sets, with comparison to the no sync baseline method, have proved that fixing the asynchronism among text sequences can significantly improve the quality of extracted common topics. So it is always desirable to synchronize multiple text sequences when mining common topics. However, one may argue that we can perform simpler one-time preprocessing to synchronize the text sequences, which would be more efficient. For example, give M text sequences, we can choose one of them to be fixed and align the remaining M − 1 sequences to the chosen one. Specifically, we assume that we fix S, then for every document d ∈ S 0 , S 0 6= S, we traverse documents in S that are within δ time

tnew (d0 ) = t(d∗ ), ∀d0 ∈ S 0 where d∗ =

arg max

sim(d, d0 ).

d∈S,t(d)∈[t(d0 )−δ,t(d0 )+δ]

sim is a predefined similarity function between two documents. In our implementation, it is the inner product. To show how this preprocessing approach works as compared to our method, we implemented and applied it on both literature and news data sets. From Fig. 16 we can see that the preprocessing method improved the quality of topics (in terms of pairwise KLdivergence) over the no sync baseline method, which indicates that the one-time alignment can somehow

13 300

300 1

250

3

250

3 200

4 5

4 5

100

8

6

x 10

−8.3

−3.42

4

150 6

100

7

100

8 50

9

−3.46 −3.48

word−only no−sync one−sync sync preproc

−3.5 −3.52

50

9

−3.54 10 0 2

4

6

8

0

10

2

(a) no sync

4

6

8

0

10

2

4

(b) preprocessing

6

8

−3.56

10

10

150

10

12

6

8

10

12

14

(a) no sync

15

20

25

−8.44

30

5

10

15

20

25

30

K

(b) News

4

5

x 10

−8.34 noSync oneSync Sync Preproc

−3.45

−3.452

−3.454

−3.456

−3.458

12

x 10

−8.345 −8.35 −8.355 −8.36 noSync oneSync Sync Preproc

−8.365 −8.37

50

5

10

15

20 K

25

30

−8.375

5

10

15

20

25

30

K

14 0 2

4

6

8

10

12

14

(b) preprocessing

0 2

4

6

8

10

12

14

(a) Literature

(b) News

(c) sync

Fig. 17. The pairwise KL-divergence between topics extracted from the news feeds (K = 15), with comparison between the baseline method, the preprocessing method (δ = 7) and our method. The IHT sequence was chosen to be fixed. relieve the problem of asynchronism. However, it was still far outperformed by our method, which uses iterative timestamp updating to refine the results. On the other hand, as shown in Fig. 17, the preprocessing method on the news feeds data did not help much and the results were as bad as the baseline method. Note that we tried different combinations of K, ranging from 5 to 30, and different values for δ, the results were similar to the two instances reported here. 6.7

10

100

50 14

4

150

10

0 2

8

100

50 14

200 6

8

100 12

−8.4 −8.42

Fig. 18. The log-likelihood curves of our method and the baseline methods, with different K and 100 rounds of random initialization.

Average Held−out Likelihood

200 6

150

250 4

200

8

300 2

250 4

6

−8.38

(a) Literature

300 2

250 4

−8.36

K

−3.448

300

5

(c) sync

Fig. 16. The pairwise KL-divergence between topics extracted from the literature repositories (K = 10), with comparison between the baseline method, the preprocessing method (δ = 5) and our method. The SIGMOD sequence was chosen to be fixed. 2

word−only no−sync one−sync sync preproc

−8.34

10

Average Held−out Likelihood

10

x 10

−8.32

−3.44

200

5

7 8

50

9

250

150 6

7

−3.4

2 3

200

150 6

5

300 1

2

Log−likelihood

2

Log−likelihood

1

The Impact of the Parameter K

When it comes to topic mining, it has been always subtle to choose a proper for K, the number of topics, which is the only parameter of our method. On one hand, the quality of extracted topics definitely vary over different K. The most practical and effective way to decide K is by introducing domain knowledge. Other approaches include various model selection principles, which is out of the scope of this paper. On the other hand, as we will show next, our method consistently improves the quality of topics over different K. In other words, the benefit from fixing the asynchronism is insensitive to the change in K. Fig. 18(a) and 18(b) are the log-likelihood curves. We used different K ranging from 5 to 30, and for each K, we ran all methods for 100 times with random initialization. The log-likelihood was defined as Eq.(1). For the baseline method, we simply used the original timestamps of documents. We can see in Fig. 18(a) and 18(b) that our method (sync) consis-

Fig. 19. The log-likelihood on held-out data, with 10fold cross validation and different K.

tently outperformed the baseline method (no sync) by a large margin. In addition, we show that our method outperformed the one sync method, which is the onetime synchronization version of our method. This on the other hand verified the improvement in objective function due to iterations of synchronization step. We also introduced the word only method, which discards all the temporal information and handles given sequences as a static collection of documents. It performed the worst in terms of likelihood and this suggests that temporal information can indeed facilitate the topic mining procedure. On the other hand, Fig. 19(a) and 19(b) show the predictive power of our method by the likelihood on the held-out data (10-fold cross validation). Our method with full synchronization (sync) outperformed all the competitors on both data sets over different K. Fig. 20 and 21 clearly showed that our method generated topics with higher semantical quality than those generated by the baseline method, over different K. For each K value (from 5 to 30), we repeated our algorithm 100 times with random initialization, and then computed the average pairwise KL-divergence between extracted topics, as shown in Fig. 22. The results again justified that the advantage of our method is stable against not only random initialization but also different values of K, the only parameter in our algorithm. At last, we provide a method for choosing K in practice. Previous work [10] suggested we can start with a small K, gradually increase it while monitor the likelihood value; we pick the K value that maximizes the overall likelihood. We adopt the same mechanism here, only that instead of monitoring

300

300

1 2

2

250

4

250

5

200

10

200

150

15

150

100

20

100

50

25

50

0

30

250

3 6

200

4

8

5 150 6

10 12

7

100

485

450

480

440

475 470 465 460 455 450

14

445

8

16

50

9

Average KL−divergence

300

Average KL−divergence

14

430 420 410 400 390 380

10

20

100

200 K

500

800

1200

370

10

20

100

300 K

500

1000

1500

18

10

20

0 2

4

6

8

10

5

(a) no sync,K = 10

10

15

20

300

10

15

20

25

30

(c) no sync,K = 30

300

300

1 2

2

250

4

(a) Literature

0 5

(b) no sync,K = 20 250

5

250

200

10

200

150

15

150

100

20

100

50

25

50

0

30

(b) News

Fig. 23. The average cross-topic KL-divergence over different K, using our method (sync).

3 6

200

4

8

5 150 6

10 12

7

100

8

14 16

50

9

18

10

20

0 2

4

6

8

10

5

(d) sync,K = 10

10

15

20

0 5

(e) sync,K = 20

10

15

20

25

30

(f) sync,K = 30

Fig. 20. The quality of extracted topics from the literature repositories over different K, with comparison between our method and the baseline method.

the overall likelihood6 , we monitor the average KLdivergence between different topics (which indicates how distinguishable the topics are). As shown in Fig. 23, the average KL-divergence of our method peaks at around 100 on both data sets, which suggests us to set K around that value.

7 300

300

300

1 2

2

250

4

250

5

250

200

10

200

150

15

150

100

20

100

50

25

0

30

3 6

200

4

8

5 150 6

10 12

7

100

8

14 16

50

9

50

18

10

20

0 2

4

6

8

10

5

(a) no sync,K = 10

10

15

20

(b) no sync,K = 20

300

0 5

10

15

20

25

30

(c) no sync,K = 30

300

300

1 2

2

250

4

250

5

250

200

10

200

150

15

150

100

20

100

50

25

0

30

3 6

200

4

8

5 150 6

10 12

7

100

8

14 16

50

9

50

18

10

20

0 2

4

6

8

10

5

(d) sync,K = 10

10

15

20

0 5

(e) sync,K = 20

10

15

20

25

30

(f) sync,K = 30

500

500

400

400

Average KL−divergence

Average KL−divergence

Fig. 21. The quality of extracted topics from the news feeds over different K, with comparison between our method and the baseline method.

300

200

word−only nosync one−sync sync preproc

100

0

5

10

15

20 K

(a) Literature

25

30

C ONCLUSION

In this paper we tackle the problem of mining common topics from multiple asynchronous text sequences. We propose a novel method which can automatically discover and fix potential asynchronism among sequences and consequentially extract better common topics. The key idea of our method is to introduce a self-refinement process by utilizing correlation between the semantic and temporal information in the sequences. It performs topic extraction and time synchronization alternately to optimize a unified objective function. A local optimum is guaranteed by our algorithm. We justified the effectiveness of our method on two real-world data sets, with comparison to a baseline method. Empirical results suggest that 1) our method is able to find meaningful and discriminative topics from asynchronous text sequences; 2) our method significantly outperforms the baseline method, evaluated both in quality and in quantity; 3) the performance of our method is robust and stable against different parameter settings and random initialization.

ACKNOWLEDGMENTS word−only nosync one−sync sync preproc

300

200

The work was supported by National Natural Science Foundation of China (90924003, 60973103), and China research project 2010ZX01042-002-002.

100

0

5

10

15

20

25

30

K

(b) News

Fig. 22. The average (and variance) of pairwise KLdivergence between topics over 100 rounds of random initialization with different K values. Our method has higher average KL-divergence, which means more distinguishable topics, as compared the baseline methods.

R EFERENCES [1] [2] [3]

D. M. Blei and J. D. Lafferty, “Dynamic topic models,” in ICML, 2006, pp. 113–120. G. P. C. Fung, J. X. Yu, P. S. Yu, and H. Lu, “Parameter free bursty events detection in text streams,” in VLDB, 2005, pp. 181–192. J. M. Kleinberg, “Bursty and hierarchical structure in streams,” in KDD, 2002, pp. 91–101.

6. The reason is that in our formulation, the likelihood is not only decided by the semantic quality of topics, but also the timestamp assignment of documents, which will change when we increase K.

15

[4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22]

A. Krause, J. Leskovec, and C. Guestrin, “Data association for topic intensity tracking,” in ICML, 2006, pp. 497–504. Z. Li, B. Wang, M. Li, and W.-Y. Ma, “A probabilistic model for retrospective news event detection,” in SIGIR, 2005, pp. 106–113. Q. Mei, C. Liu, H. Su, and C. Zhai, “A probabilistic approach to spatiotemporal theme pattern mining on weblogs,” in WWW, 2006, pp. 533–542. Q. Mei and C. Zhai, “Discovering evolutionary theme patterns from text: an exploration of temporal text mining,” in KDD, 2005, pp. 198–207. R. C. Swan and J. Allan, “Automatic generation of overview timelines,” in SIGIR, 2000, pp. 49–56. X. Wang and A. McCallum, “Topics over time: a non-markov continuous-time model of topical trends,” in KDD, 2006, pp. 424–433. T. L. Griffiths and M. Steyvers, “Finding scientific topics,” PNAS, vol. 101, no. Suppl 1, pp. 5228–5235, 2004. X. Wang, C. Zhai, X. Hu, and R. Sproat, “Mining correlated bursty topic patterns from coordinated text streams,” in KDD, 2007, pp. 784–793. J. Allan, R. Papka, and V. Lavrenko, “On-line new event detection and tracking,” in SIGIR, 1998, pp. 37–45. Y. Yang, T. Pierce, and J. G. Carbonell, “A study of retrospective and on-line event detection,” in SIGIR, 1998, pp. 28–36. T. Hofmann, “Probabilistic latent semantic indexing,” in SIGIR, 1999, pp. 50–57. D. M. Blei, A. Y. Ng, and M. I. Jordan, “Latent dirichlet allocation,” in NIPS, 2001, pp. 601–608. D. M. Blei and J. D. Lafferty, “Correlated topic models,” in NIPS, 2005. W. Li and A. McCallum, “Pachinko allocation: Dag-structured mixture models of topic correlations,” in ICML, 2006, pp. 577– 584. D. M. Mimno, W. Li, and A. McCallum, “Mixtures of hierarchical topics with pachinko allocation,” in ICML, 2007, pp. 633–640. C. Zhai, A. Velivelli, and B. Yu, “A cross-collection mixture model for comparative text mining,” in KDD, 2004, pp. 743– 748. A. Asuncion, P. Smyth, and M. Welling, “Asynchronous distributed learning of topic models,” in NIPS, 2008, pp. 81–88. D. J. Berndt and J. Clifford, “Using dynamic time warping to find patterns in time series,” in KDD Workshop, 1994, pp. 359–370. H. Sakoe, “Dynamic programming algorithm optimization for spoken word recognition,” IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 26, pp. 43–49, 1978.

PLACE PHOTO HERE

Xiang Wang received his BSc degree in Mathematics in 2004 and Master of Software Engineering degree in 2008, both from Tsinghua University, China. He is currently a PhD student in Computer Science at University of California, Davis. His research interests are data mining and machine learning.

PLACE PHOTO HERE

PLACE PHOTO HERE

PLACE PHOTO HERE

Xiaoming Jin is an associate professor in the School of Software, Tsinghua University. He received his doctor’s degree from Tsinghua University in 2003. His research interests are data mining and machine learning.

Meng-En Chen received the BS degree in automation from Zhejing University of Technology, China, in 2008. He is currently working toward the ME degree in the School of Software, Tsinghua University, China. His research interests include machine learning, data mining, and information retrieval.

Kai Zhang received the BSc degree in automation from Beihang University in 2006 and the master’s degree in software engineering from Tsinghua University in 2009, respectively. He is currently working as a quantitative investment analyst at Yinhua Fund Management Co., Beijing. His research interests include data mining, especially text mining, statistical models and their applications.

Dou Shen is an Applied Researcher in Microsoft. He received his PhD degree in Computer Science from the Hong Kong University of Science and Technology (HKUST). PLACE His research interests are data mining and PHOTO machine learning, information retrieval, comHERE putational advertising. During his study in HKUST, he led a team participating in KDDCUP and won all three prizes. Dr. Shen have published about 40 journal and conference papers and invented 10 patents. He is serving as a program committee member for the major conferences in the field (KDD, SIGIR, WWW, WSDM, AAAI, SDM, ICDM). He coorganized the data mining and audience intelligence for advertising workshops in conjunction with KDD in 2007, 2008, 2009 and 2010.