Finding Near Duplicates in Short Text Messages in Singlish Using MapReduce and Phonetic Matching Jophia Yi Wen Soh

Kenny Zhuo Ming Lu

School of Information Technology Nanyang Polytechnic 180 Ang Mo Kio Avenue 8 Singapore 569830 Email: [email protected]

School of Information Technology Nanyang Polytechnic 180 Ang Mo Kio Avenue 8 Singapore 569830 Email: kenny [email protected]

Abstract—We consider the near duplicate detection problem in the context of English tweets localized with Singapore dialects. We study and implement the detection algorithm using MinHash and SimHash hashing algorithms in MapReduce framework. To handle localized terms (a.k.a. Singlish), we extend the algorithm by applying a phonetic based matching to the out-of-vocabulary terms. The empirical result shows that this approach increases the accuracy.

I. I NTRODUCTION Twitter is one of the major social media source that is indispensable for many analytic applications. Many analyses of twitter data require tweets to be categorized. This is to improve the accuracy of the analysis and clarity of the data visualization. In this work, we set target on detecting nearduplicating tweets found in Singapore cyber space. This subset of tweets exhibits interesting features thanks to Singlish, a localized dialect of English with mixed vocabulary and grammar structures influenced by Chinese Mandarin, Chinese dialects, Bahasa Malay and Indian Tamil. Near-duplicate detection has been a well studied topic [12], [8], [10]. It has been applied in domains of document crawling, indexing, retrieval, classification and plagiarism detection, etc. There have been many works which address duplication detection [12], [8], [10]. To the best of our knowledge there is no work addressing localized English variants such as Singlish. Scalability is one of the main concerns of our project. Our implementation is able to scale up for terabytes of data with a MapReduce [5] oriented design. In this paper, we specifically exclude duplicates caused by retweeting which are indicated via the “RT” keyword in the text or the Twitter API result attribute. We focus on identifying repetitive tweets generated by third party apps (such as games, mobile advertisement app) and human spamming. Our contributions are as follow, 1) We formulate a near-duplicate detection algorithm in MapReduce using MinHash and SimHash. 2) We develop a localized dictionary to handle Singlish phrases. 3) We propose a novel phonetic based matching algorithm for Singlish phases based on regular expression patterns. The paper is organized as follows. In Section II, we go through a set of challenging examples, which will motivate the need of the algorithm and its extensions. In Section III we present a concise formulation of the MinHash-SimHash algorithm in MapReduce Framework. In Section IV we extend the algorithm to handle the Singlish terms arising from the tweets. In Section V, we present the experiment results. In

Section VI we discuss about the related works. We conclude in Section VII. II. E XAMPLES To motivate the key ideas, let’s consider some examples which are regarded as near duplicate. The following sets of tweets generated by some game application. 1 has reached level 25 in Race or Die Online for the iPhone ! Check i t out : http ://b it . ly/2drIz6 2 has reached level 250 in Race or Die Online for the iPhone ! Check i t out : http :// b it . ly/2drIz6 3 has reached level 5 in Race or Die Online for the iPhone ! Check i t out : http :// b it . ly/2drIz6

There are merely differences among them as they are results of some third party game application reporting the player’s progress. This information is not interesting for many of the followers nor generates too much business analytic values (except for the game publisher and developers). Standard algorithms such as MinHash [3] and SimHash [4] are able to identify the similarity among the above tweet snippets. In essence, both algorithms operate by converting the input into a set of features. The similarity comparison is performed over the feature set. Since it is a set, the order of the features is not taken into consideration. Suppose we map each word into a feature, the algorithms are unable to distinguish contrived cases like the following, 1 Before I die , I ’m going to bang a black guy in a bear costume in the lib r ar y . Not too much I ’m asking for . 2 I ’m going to bang a black guy in bear costume in the lib r ar y before I die . I ’m not asking for too much.

To eliminate false-positivity, we consider taking bi-gram or three-gram features when we apply the algorithm. A more detail discussion comes shortly in Section III-C. The task is getting subtle when we consider tweets that contain Singlish phrases. Consider the following 1 [ English ] The soup was f an tas tic ! 2 [ Singlish ] The soup was shiok ! 3 [ English ] That place is so is o lated th at you could hardly see anyone there . 4 [ Singlish ] That place is so ulu th at you could hardly see anyone there . 5 [ English ] Excuse me, do you think i t would be possible for me to enter through th is door to the HR department ? 6 [ Singlish ] Scuse me, is i t possible for me to pass th is door to the HR department ? 7 [ English ] The Malaysian government concludes th at no one is s t i l l alive on MH370. 8 [ Singlish ] The Msian gahment concludes th at no one is s t i l l alive on MH370.

where the odd rows are the tweets in standard English, the even rows are the Singlish counter-part. It is obviously challenging for the “vanilla” MinHash-SimHash algorithm to handle these Singlish phrases. One may suggest that false-negative instances as such can be fixed by either adjusting the threhold of the similarity comparison or adopting a pre-processing phase where the dialect terms are translated statically to the English equivalent. However, adjusting the threshold will introduce more false positive than true postive and lead to the drop of accuracy. Using a static dictionary for translation has a limitation on the fixed size of the vocabularies and time lag of the introductory of new phrases. Our implementation incorporates not only a pre-processing phase to normalize known Singlish terms to their English equivalents, but also a phonetic-based post-processing matching phase for the out-of-vocabulary Singlish terms which can be derived phonetically from the English counter parts. For instance, terms such as “Scuse”, “Msian” and “gahment” will be handled without a need of having their entries in the static dictionary. The gist of the idea is to use regular expression patterns to model the pronounciation of these words and compare them via regular expression matching. The matching result will be used to moderated the duplicate detection scores generated by MinHash-SimHash algorithm.

the same cluster to have minor difference, we set the number of hash functions to be 10. B. SimHash In Figure 2, we describe the SimHash algorithm in ML style of pseudo code. let x = e′ in e denotes a let-binding expression, where x = e′ means one or more bindings. For each binding of x = e′ , if the evaluation e′ leads to a value, the value will be bound to the variable x which can be used in the evaluation context in scope. [·] denotes a list where elements are indexed by integers. v[i] returns the i-th element of the list v. [ei |i ∈ {0, .., 63}] constructs a list of size 64 where the i-th element is ei . We overload lists containing 64 bit elements as 64-bit integers. Function hash64 takes a feature as the input and returns a 64-bit integer as the output digest. Function sum takes a set of integers and returns sum of all elements. The simhash function applies the hash function hash64 to all the features. The resulting intermediate digests are aggregated into a final digest as follows. Let i ∈ {0, ..., 63}, the ith bit in the final digest is determined by the counting the numbers of 0s and 1s at ith bit position of the intermediate digests. The ith bit in the final digest is 1 if there are more 1s than 0s, otherwise the bit is 0. One of the important result of SimHash is that given two similar feature sets, f v1 and f v2 , and a hashing function h that hashes a feature into a 64-bit digest, simhash(f v1 ) and simhash(f v2 ) are differing in a few bits.

III.

M AP R EDUCE - BASED N EAR D UPLICATE D ETECTION USING M IN H ASH AND S IM H ASH In this section, we consider a MapReduce-based algorithm for near duplication detection. The main idea is to use MinHash for clustering, and SimHash for similarity calculation. Naively if we directly apply SimHash [4] to calculate the distance between any two tweets, it requires 2n comparison operations given n tweets. We follow Manku [8] and Seshasa [12]’s works to develop the near duplicate detection algorithm in the MapReduce framework. For scalability, we apply the MinHash [3] to cluster tweets sharing some common features into groups. As a result, we only need to perform the comparison among any two tweets falling into the same cluster. Let k be the number of the clusters, the total time required is reduced to n + k ∗ 2n/k . Before we go into details of the MapReduce based algorithm, let’s have a walk through of the MinHash and SimHash algorithms.

C. Feature Selection As mentioned in Section II, MinHash and SimHash will not take into account the orderings among the features. For instance, let f1 and f2 be features, we have minhash([f1 , f2 ]) = minhash([f2 , f1 ]) simhash([f1 , f2 ]) = simhash([f2 , f1 ]) As a result, using the individual words in a tweet as features will cause two tweets with same vocabulary set and different structures sharing the same digest, for example, refer to Section II. In the context of MinHash clustering, this is still logical as the final result is to be decided by the pair-wise SimHash comparison. In the context of SimHash, feature selection needs to be considered with extra care. We consider multiple options, i.e. single word features, bigram features, and n-gram features. From this point onwards, the string to feature set conversions in minhash(·) and simhash(·) are left implicit. Strings are converted into list of words before being applied to minhash, and converted to list of bi-grams before being applied to simhash.

A. MinHash We describe the standard MinHash algorithm in Figure 1. {·} denotes a set. Function min takes a set of integers and returns the minimum value. The MinHash algorithm approximates the Jaccard distance by applying a set of randomly generated hash functions to the feature sets. The idea is to simulate different orderings among the features. The minimum values of the orderings are selected as representative shingles to compute the final hash result. This W is obtained by applying a bit-wise exclusive-or operation ( ) to all shingles. As a known result of the MinHash algorithm, the more hash functions used in the computation, the better the approximation to the actual Jaccard distance. In the context of checking near duplicate among tweets, we set the number of hash functions used in the MinHash clustering as 10. This is based on the assumption the average length of English words is 5.1 characters [2]. With the maximum length of a tweet is 140 characters, taking into the preceding and trailing spaces into account, we deduce that the maximum number of words (features) in a tweet is around 20. To restrict tweets falling into

D. Hamming Distance The Hamming Distance function compares the given two digests by counting the number of difference bits. The definition is straight-forward, hence we omit the details here. E. MinHash and SimHash in MapReduce MapReduce was first proposed by [5]. It became one of the most popular programming design patterns for its highlevel parallelization. We present a simplified formulation in Figure 3 to illustrate the gist of MapReduce. A task can be implemented using the MapReduce framework if the input consists a set of homogeneous elements {i1 , ..., in } and there is a uniform operation map which transfers all the inputs into a set of homogeneous intermediate output {(k1 , v1 ), ..., (kn , vn )}, where ki is the key used for grouping in the latter step, and 2

minhash(f eatures) = hash f unctions Fig. 1.

=

W {min({g(f )|f ∈ f eatures})|g ∈ hash f unctions}

{g|g is a distinctive hash function mapping features to integers}

MinHash Algorithm

simhash(f eatures) =

Fig. 2.

bit(c)

=

[if(c[i] == 1) 1 else (−1)|i ∈ {0, ..., 63}]

vector(bits)

=

[sum({b[i]|b ∈ bits})|i ∈ {0, ..., 63}]

SimHash Algorithm

(k1 , v1 ) = map(i1 ) ... (kn , vn ) = map(in ) Fig. 3.

let codes = {hash64(f )|f ∈ f eatures} bits = {bit(c)|c ∈ codes} vectors = vector(bits) in [if(vectors[i] < 0) 0 else 1|i ∈ {0, ..., 63}]

)

reduce(ki )([vi1 , ..., vim ]) for i ∈ {1, .., n}

A task implemented in MapReduce (simplified)

vi is the value. The final output is computed by aggregating {vi1 , ..., vim }, which are values sharing the same key ki . The MinHash-SimHash combo naturally fits into the MapReduce framework. In Figure 4, we sketch a near duplicate detection algorithm in MapReduce using MinHash and SimHash. The map function computes the MinHash and SimHash digests out of the text of the tweet. The MinHash digest is used as the key in the reducing step. The reduce function aggregates all SimHash digests that share the same MinHash digest. The aggregation performs a pair-wise hamming distance among the SimHash digests and filters out those results falling below the predefined threshold.

the Out-of-vocabulary (OOV) terms from the text. 1) OOV: A word w is considered as an out-of-vocabulary (OOV) term if • w’s inverse document frequency (IDF) is equal or greater to some pre-defined threshold, and • w is tagged as a noun or an adjective by the POStagger. We apply a phonetic-based matching algorithm to the OOVs and the English counter parts. Given an OOV term whose preceding word is wp and the trailing word is wt , we search for its English counter part in the compared tweet by looking for words that are surrounded by wp and wt . If the phonetic similarity between the OOV term and the English counter part falls below some pre-defined threshold, they are considered the same and the overall hamming distance score shall be reduced. 2) Phonetic-based Matching: Now let’s have a walkthrough of our novel phonetic matching algorithm based on regualar expression matching. In general, there are phoneticbased algorithms to measure the similarity of terms based on pronunciation. Our phonetic-base matching algorithm is a special instance of such kind. Given an English term and a Singlish term, the matching algorithm is to determine the similarity by 1) Normalizing the inputs into sequences of sound symbols, i.e. consonants and vowels. 2) Assuming the Singlish term possesses fewer syllables than its English equivalent, the algorithm construct a regular expression pattern associate with confidence level from the English term. 3) Computing the final confidence score by running the regular expression pattern over the Singlish syllables. For instance, consider the English term “government” and its Singlish counter-part “gahmen”. The latter is a short-hand due to some lazy pronunciation among Singaporean. The

IV. E XTENSIONS A. Preprocessing As a standard practice, we normalize the input before submitting them through the algorithm main pipeline. The steps include removing the punctuations, the hash-tags, the URLs and converting the words into lower cases. In addition, we replace Singlish terms by their English equivalent (if a definite translation exists) by looking up from a pre-compiled static dictionary. As motivated in the earlier sections, a pre-compiled static dictionary is never sufficient to catch the growing Singlish terms. To pushing the limit a bit further, we consider the postprocessing phase which allows us to apply a phonetic matching step to cover the unknown Singlish terms whose are derived phonetically. B. Postprocessing To improve the accuracy, we post-process the tweet pairs whose hamming distances exceed marginally the pre-defined threshold. The objective is to recover the false negatives introduced due to the limitation of the pre-compiled static dictionary used in the pre-processing phase. The idea is to capture a the newly invented Singlish terms by searching for 3

(k1 , v1 ) = map(tweet1 ) ... (kn , vn ) = map(tweetn ) map(tweet)

=

reduce(key)(values) =

)

reduce(ki )([vi1 , ..., vim ]) for i ∈ {1, .., n}

(minhash(tweet.text), (tweet.id, simhash(tweet.text))) ) ( (tid1 , simhash1 ) ∈ values∧ (tid1 , tid2 ) (tid2 , simhash2 ) ∈ values ∧ tid1 6= tid2 ∧ hammingDist(simhash , simhash ) ≤ threshold 1

Fig. 4.

A MinHash-SimHash in MapReduce

algorithm first normalizes the terms into [’g’, ’o’, ’v’, ’er’, ’n’, ’m’, ’e’, ’n’, ’t’ ] and [ ’g’, ’ah’, ’m’, ’e’, ’n’ ], where ’g’ and ’v’ are the consonants and ’o’ and ’er’ are the vowels. Note that for simplicity, we use the ASCII symbols to represent the syllables instead of the IPA phonetic symbols. For every sound symbol in the English sequence, we derive the set of similar sounds. This is implemented by a simple lookup from a SoundMap. A SoundMap assigns the confidence of similarity between two sounds. The confidence score ranges from 0 to 1. E.g. the confidence score assigned to ’g’ and ’g’ is 1, the score assigned to ’g’ and ’k’ is 0.6. In the current prototype, the SoundMap is generated by manual collection and examination. There is opportunity to apply statistical techniques such as SVM, which we consider as future work. For instance, the similar sound set of ’g’ is {g : 1, gh : 0.9, j : 0.4, q : 0.4, k : 0.6}

rm = (m : 1|ǫ)

2)

3)

(1)

(9)

rt = (d : 0.6|t : 1|th : 0.8|ǫ)

(10)

Concatenate all the choice patterns by interleaving with the wild card pattern. (. : −1)∗ , which is a zero or more repetition of the wild card literal . associated with negative score −1. We will explain why it is given a negative confidence in a short while. For brevity, we write (−1)∗ to as a short hand for (. : −1)∗ . For instance,

Note that we use r1 r2 to denote the concatenation of r1 and r2 . The formal definition of the r will be given shortly. As the next step, we match the sound sequence of the Singlish term against the above pattern generated from the English word. Before we dive into the details, let’s consider the simple ordinary regular expression matching problem. Regular expression has been used for sequence validation and extraction in many real world applications such as awk, grep and sed. In Figure 5, we define the syntax of the regular expressions. The matching problem of regular expression is given a word w and a regular expression r, to validate whether w matches with r. Let match(w)(r) be a regular expression match algorithm which yields true or false given the input word w and regular expression r. For example, matching the string “abaac” with the regular expression (a|ab)(baa|a)(ac|c)

(2)

(3)

Let’s use rg to denote the above choice pattern (3). We apply the same step to obtain ro , rv , rer , rm , re , rn , rt , which are defined in the following equations (4) to (10) ro = (o : 1|u : 0.6|al : 0.8|aw : 0.6|or : 0.8| ol : 0.9|ow : 0.9|ul : 0.4|ough : 0.9|ǫ) (4)

yields true and matching “abaac with the same pattern will trigger a failure. Regular expression matching is ambiguous if we consider the sub-matches, i.e. it may yield more than one valid match. For instance, reusing the above regular expression, we use i ∈ {1, 2, 3} refers to the sub-expressions, (a|ab),

rv = (f : 0.8|ph : 0.8|v : 1|w : 0.2|wh : 0.2|ǫ) (5) rer = (er : 1|el : 0.8|ir : 0.9|ur : 0.9|ǫ)

rn = (n : 1|kn : 0.9|ǫ)

rg (−1)∗ ro (−1)∗ rv (−1)∗ rer (−1)∗ rn (−1)∗ rm (−1)∗ re (−1)∗ rn (−1)∗ rt (11)

Note that we use POSIX style of regular expression syntax notation, in which | denotes a choice operator. Include the empty sequence ǫ into the choice pattern, (2) is augmented as follows, (g : 1|gh : 0.9|j : 0.4|q : 0.4|k : 0.6|ǫ)

(7)

re = (a : 0.9|e : 1|i : 0.9|y : 0.6|er : 0.9|ir : 0.9|ǫ) (8)

where l : n denotes a sound literal l associated with confidence level n. When n = 0, l : n is simplified as l. We construct a regular expression pattern by running through the following steps 1) Turn every set of similar sounds with positive confidence into a choice pattern with each sound becoming a literal associated with the confidence score. e.g. the set of in (1) is converted into (g : 1|gh : 0.9|j : 0.4|q : 0.4|k : 0.6)

2

(6) 4

Words:

::= | |

w

TABLE I.

ǫ l∈Σ lw

Empty word Letters Concatenation

Configuration Vanilla Static dict Phonetic Static dict & Phonetic

Regular expressions: ::= | | | | |

r

Fig. 5.

r|r rr r∗ ǫ φ l∈Σ

Choice Concatenation Kleene star Empty word Empty language Letters

::= |

Empty sequence Sound Symbol Concatenation c v

Consonants Vowels

Regular expressions: r

::= | | | | |

r|r rr r∗ ǫ φ l:n

Choice Concatenation Kleene star Empty sequnce Empty language Sound symbol with confidence score

Score: n ∈ {−1.0} ∪ [0.0, 1.0] Fig. 6.

Accuracy (%) 99.9659533 99.9659533 99.9662932

96.141404

97.1210778

99.9662932

’g’ matches rg yield 1.0 as the cumulative score. Note that the pattern is ambiguous because rg possesses ǫ which means we can also match ’g’ with the subsequence wild card pattern (. : −1)∗ , which yields -1.0 as the cumulative result. We carry matching the rest of the sound sequence [’ah’, ’m’, ’e’, ’n’ ] with the remaining pattern. Our algorithm finds all the possible matches. The match yielding the highest cumulative score will be selected as the final result. Running through the remaining sound sequence with the above pattern, we can the highest match score as 3.0, since ’ah’ has to match with the wild card pattern. We divide the final score by the size of the Singlish sequence to obtain 0.6 as the score. 3) Integration: In Figure 7, we present the MinHashSimHash algorithm integrated the pre- and post-processing steps in MapReduce. The norm function used in the map step performs the normalization described in in IV-A. In the reduce step, we introduce three threshold values. thresholdl is the strict threshold which is the same as the original threshold mentioned in Figure 7. thresholdh is the hypothesized threshold where thresholdh ≥ thresholdl . We collect tweet pairs whose hamming distances fall below thresholdl . In addition, we also further examine the tweet pairs whose hamming distances is greater than thresholdl but is still below thresholdh . The extended examination applies the function phoneticM atch to the tweet pairs. The function search for OOV words in the tweet pairs. The resulting score is a value ranging between 0 and 1. The lower the score, the higher the phonetic similarity between the OOVs and the English counter parts. We multiply the phonetic matching result with the hamming distance to obtain the final score. Implementation details can be found in Appendix B.

Sound Symbols: l

Precision (%) 97.1505984 97.1505984 97.1210778

rg = (g : 1|gh : 0.9|j : 0.4|q : 0.4|k : 0.6|ǫ) (13)

Sound Sequences: ::= ǫ | l | ls

Sensitivity (%) 96.0688474 96.0688474 96.141404

Since

Regular Expression Syntax

s

E XPERIMENT R ESULTS

Regular Expression Phonetic Pattern

(baa|a) and (ac|c). And we use (i, wi ) to denote the submatches. The running example actually have two possible matches {(1, ab), (2, a), (3, ac)} and {(1, a), (2, baa), (3, c)}. Now let’s take one step further to extend the regular expression matching algorithm to handle sound symbols with confidence scores. In Figure 6, we extend the regular expression pattern with support of phonetic symbols and confidence scores. It is clear that the extension is very minimal. The input word becomes sound sequence in which the elements are either consonants or vowels. The literal regular expression pattern becomes l : n, which is a sound symbol l associated with the confidence score n. Recall the running example, g : 1 is a sound symbol g associated with score 1.0. The wild card pattern . : −1 is used to capture sound symbols that could not be matched by any of the similar sound set generated by the target English term. The associated confidence score -1 is to penalize as these “unmatched” sounds. The matching algorithm of the regular expression sound pattern will output and additional result as the cumulative score for the match. Recall our running example, we would like to match [ ’g’, ’ah’, ’m’, ’e’, ’n’ ] with rg (−1)∗ ro (−1)∗ rv (−1)∗ rer (−1)∗ rn (−1)∗ rm (−1)∗ re (−1)∗ rn (−1)∗ rt

V. E XPERIMENT In Table I, we evaluate the performance of the proposed algorithm by measuring the sensitivity, precision and accuracy. We benchmark the vanilla MinHash-SimHash algorithm against the other configurations consists of extensions. The set of data we use in the experiment includes the tweet corpus captured between December 2013 and April 2014 specifically on the topic of GCE O’ Level result 2014 in Singapore. The data set consists of 8191 tweets. The sensitivity, precision and accuracy are computed by benchmarking the set of potential near duplicates generated by the algorithm against the set of generated by manual identification. In this experiement, we only consider using the bigram feature set during the SimHash computation. Based on some prelimanary test, we set 6 to tbe threshold of the hamming distance results. In addition, the phonetic matching extension re-verifies those potential duplicates whose hamming distance scores fall into the range between 6 and 8.

(12) 5

(k1 , v1 ) = map(tweet1 ) ... (kn , vn ) = map(tweetn )

)

reduce(ki )([vi1 , ..., vim ]) for i ∈ {1, .., n}

= (minhash(norm(tweet.text)), (tweet, simhash(norm(tweet.text))))   (t1 , simhash1 ) ∈ values∧       (t2 , simhash2 ) ∈ values ∧ t1 .id 6= t2 .id∧     hDist = hammingDist(simhash , simhash )∧   1 2   (t1 .id, t2 .id) reduce(key)(values) = (hDist ≤ thresholdl )∨             hDist ≤ thresholdh ∧     phoneticM atch(t1 , t2 ) ∗ hDist ≤ thresholdl map(tweet)

Fig. 7.

Integrating Phonetic Matching into the MinHash-SimHash algorithm in MapReduce

As reported by Table I, the static dictionary preprocessing extension alone does not contribute any improvement to the algorithm. With the phonetic matching extension enabled, the algorithm is able to identify more true positive, hence the sensitivity increases. It leads to a drop in the precision, thanks to the introduction of the false positive. In overall the accuracy increases as there are more true positive introduced than false positive. One thing to note that the increase seems marginally thanks to high proportion of true negativity. We are planning to include more data sets in the benchmark in the near future.

As the future work, we are planning to look into incorporation of static modeling technique such as SVM to generate an “organic” sound map database for our phonetic matching algorithm. R EFERENCES [1] Apache Hadoop. http://hadoop.apache.org/. [2] Average word length in English. http://www.wolframalpha.com/input/?i=average+english+word+length. [3] Andrei Z. Broder, Moses Charikar, Alan M. Frieze, and Michael Mitzenmacher. Min-wise independent permutations (extended abstract). In Proceedings of the Thirtieth Annual ACM Symposium on Theory of Computing, STOC ’98, pages 327–336, New York, NY, USA, 1998. ACM. [4] Moses S. Charikar. Similarity estimation techniques from rounding algorithms. In Proceedings of the Thiry-fourth Annual ACM Symposium on Theory of Computing, STOC ’02, pages 380–388, New York, NY, USA, 2002. ACM. [5] Jeffrey Dean and Sanjay Ghemawat. Mapreduce: Simplified data processing on large clusters. Commun. ACM, 51(1):107–113, January 2008. [6] Patrick A. V. Hall and Geoff R. Dowling. Approximate string matching. ACM Comput. Surv., 12(4):381–402, December 1980. [7] Bo Han and Timothy Baldwin. Lexical normalisation of short text messages: Makn sens a #twitter. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1, HLT ’11, pages 368–378, Stroudsburg, PA, USA, 2011. Association for Computational Linguistics. [8] Gurmeet Singh Manku, Arvind Jain, and Anish Das Sarma. Detecting near-duplicates for web crawling. In Proceedings of the 16th International Conference on World Wide Web, WWW ’07, pages 141–150, New York, NY, USA, 2007. ACM. [9] L. Philips. Hanging on the metaphone. Computer Language Magazine, 7(12):38–44, 1990. [10] Bingfeng Pi, Shunkai Fu, Weilei Wang, and Song Han. Simhash-based effective and efficient detecting of near-duplicate short messages. In Proceedings of the Second Symposium International Computer Science and Computational Technology, ISCSCT ’09, pages 20–25, 2009. [11] David Pinto, Darnes Vilari˜no, Yuridiana Alem´an, Helena G´omez, and Nahun Loya. The soundex phonetic algorithm revisited for sms-based information retrieval. In Proceeding of II Spanish Conference on Information Retrieval, CERI 2012, 2012. [12] Shreyes Seshasai. Efficient near duplicate document detection for specialized corora, 2008. Master thesis. [13] Martin Sulzmann and Kenny Zhuo Ming Lu. Regular expression sub-matching using partial derivatives. In Danny De Schreye, Gerda Janssens, and Andy King, editors, PPDP, pages 79–90. ACM, 2012. [14] Twitter Scalding. https://github.com/twitter/scalding.

VI. R ELATED W ORK Seshasai’s work [12] gave a clear comparison among different techniques of checking document similarity. In addition, he studied a number of techniques for optimizing SimHash performance, such as feature selection and weight settings. We are considering incorporating the weight assigning technique as one of the possible future extension. Han and Baldwin [7] address the issue of unparsable texts found in SMS and twitter by creating a pre-processor to normalize these unparsable tokens (AKA ill-formed words) into their original counter-parts. In [11] Pinto et al evaluate the effectiveness of using phonetic algorithms such as Soundex in information retrieval using SMS as queries. Their targeted languages are English and Spanish. We are planning to applying our phonetic matching approach in the domain of IR. Soundex [6] and Metaphone [9] are two classic techniques for phonetic matching and searching for names. The gist of their idea is to build index of words based on the pronunciation of the consonants of the words. Our phonetic matching is inspired by them. While focusing on the comparison instead of searching, we take into account of the vowels. We use regular expression as a domain specific language to denote the sound pattern. The sound sequence matching is naturally followed from the regular expression matching algorithm. VII. C ONCLUSION We applied the MinHash and SimHash algorithm to search for near duplicate tweets in Singlish using MapReduce. The experiment results confirm that this algorithm combo is very accurate. We further improve the algorithm by incorporating a phonetic matching post-processing step. Empirical results suggest that there is an effective improvement. 6

1 object MinHash { 2 val random = new Random(2147483647) 3 val coefs = for { ← Stream . from(1) } yield random . nextInt ( ) 4 val hash funs = for { (a , b) ← coefs . zip ( coefs . drop (1) ) 5 } yield ( (x : Int ) ⇒ ( a + b∗x) ) 6 7 def hash min( f : Int ⇒ Int ) ( xs : List [ Int ] ) : Int = { 8 def hash second min ( a : Int , b : Int ) : Int = a . min( f (b) ) 9 xs . foldLeft ( Int . MaxValue ) ( hash second min ) 10 } 11 12 def shingle (n : Int ) (nums: List [ Int ] ) : List [ Int ] = 13 for { f ← hash funs . take (n) . toList 14 } yield hash min ( f ) (nums) 15 16 def minhash [A] ( fv : List [A] , num of hash : Int ) : Int = { 17 val is = fv .map( . hashCode ( ) ) 18 val sh = shingle (num of hash ) ( is ) 19 sh . foldLeft (0) ( (x , y) ⇒ x ˆ y) 20 } 21 } Fig. 8.

1 Tsv( ”input . tx t ” ) ) . read . 2 .map( ’ tex t → ’ n text ) { tex t : String ⇒ tex t . s p l i t ( ”““s+” ) . toList } 3 .map( ’ n text → ’minhash ) { ls : List [ String ] ⇒ 4 MinHash . minhash ( ls ,10) 5 } 6 .map( ’ n text → ’simhash ) { ls : List [ String ] ⇒ 7 val fv = toFeature ( ls ) 8 SimHash. simhash ( fv ) ( hashcode64 ) 9 } 10 . groupBy ( ’minhash ) { 11 x⇒x . toList [ ( String ,Long) ] 12 } ( ( ’ rowid , ’simhash ) → ’ id hash ) 13 .map( ’ id hash → ’ id hash ) { 14 id hashes : List [ ( String ,Long) ] ⇒ findNearDup ( id hashes ) 15 } 16 . flatMap ( ’ id hash → ( ’ orig id , ’dup id ) ) { 17 tup2 : ( ( String , Long) , ( String , Long) ) ⇒ tup2 match { 18 case ( ( id , hash ) , ( dup id , dup hash ) ) → ( id , dup id ) 19 } 20 } 21 . project ( ’ orig id , ’dup id ) 22 . write (Tsv( ”output ” ) )

Minhash algorithm in Scala Fig. 10. Near Duplicate detection in MapReduce using Scalding (Simplified)

1 object SimHash { 2 def simhash [A] ( fv : List [A] ) ( hash :A ⇒ Long) : Long = 3 { 4 val hashCodes = fv .map( hash ) 5 val bitSets = 6 hashCodes .map( x ⇒ (0 to 63) .map( z ⇒ i f ( tes tB it (x , z) ) 1 else −1 ) ) 7 val zeros = (0 to 63) .map ( ⇒ 0 ) . toList 8 val wtVector = bitSets . fold ( zeros ) ( . zip ( ) .map(xy ⇒ xy . 1+xy . 2) ) 9 (0 to 63) . zip ( wtVector ) . foldLeft (0L) ( (x , bv) ⇒ sign (bv . 2) (x , bv . 1) ) 10 } 11 12 def tes tB it (num:Long, i : Int ) : Boolean = ( (num & (1L << i ) ) != 0) 13 def clearBit (num:Long, i : Int ) :Long = (num & ˜(1L << i ) ) 14 def setBit (num:Long, i : Int ) :Long = (num | (1L << i ) ) 15 def sign (v : Int ) : ( ( Long, Int )⇒Long) = i f (v < 0) { clearBit } else { setBit } 16 17 def hammingDistance ( a :Long) (b :Long) : Int = 18 (0 to 63) . f i l t e r ( x ⇒ tes tB it ( a , x) != tes tB it (b , x) ) . length 19 } Fig. 9.

using MapReduce, MinHash and SimHash. For brevity, we adopt a domain specific language extension of Apache Hadoop MapReduce [1], Scalding [14]. In Scalding DSL, MapReduce tasks are viewed as different stages connected by pipes. Tsv("input.txt").read creates an input pipe from the a tab separated file “input.txt”. Similarly 1 write (Tsv( ”output . tx t ”) )

writes the output of the preceding pipe into an output file named “output.txt”. In our algorithm, there are four basic operations carried out along the pipes. The map operation is used to create a new column or update an existing column by applying an lambda function to every rows of data flowing through the pipe. For instance, 1 map( ’ tex t → ’ n text ) { tex t : String ⇒ tex t . s p l i t ( ”““s+” ) . toList }

creates a new column n_text from text by splitting the text by spaces. Despite the confusing names, the map operation in Scalding DSL is not the same as the mapper in MapReduce. The former is a Scalding operation that transform data in a pipe, the latter is a process that can be executed in parallel. Note this particular Scalding map operation will be compiled into a mapper in the generated MapReduce code, since there is no aggregation required. Similarly, the following two map operations compute the MinHash and SimHash digests correspondingly. The helper function of hashCode64 hash a string into a 64-bit Long value. The function toFeature converts the list of strings into the right feature sets, which will be discussed shortly in Section III-C. Both of the two map operations will be compiled into mappers in MapReduce The next operation in the pipe is a groupBy operation. Similar to the “group by” in SQL, it groups the rows based on the given column. In our algorithm, we group the rows based on the MinHash digest. The rowid and simhash fields from the grouped rows are collected into a list of pairs. With the next map operation, we search for near duplicate within the grouped list of rowid-simhash pairs, using findNearDup. This function computes the pair wise hamming distance among

Simhash algorithm in Scala

A PPENDIX A. Implementation 1) MinHash In Scala: In Figure 8, we implement the MinHash algorithm (see Figure 1) in Scala. The minhash-scala function takes a list of features as input. Its computation approximates the Jaccard distance by applying a set of randomly generated hash functions to the feature sets. This is to simulate different orderings among the features. The shingle function yields a set of integers. Each integer corresponds to a minimal value of the resulting set of applying a particular hash function to the feature set. 2) SimHash in Scala: In Figure 9, we implement the SimHash algorithm in Scala. The function simhash takes a feature set and a hash function as arguments. This is an implementation of what is specified by Figure 2. 3) Near Duplicate Detection in Scalding DSL: In Figure 10, we scratch out the near duplicate detection algorithm 7

1 val table = new Tsv( input tsv , ( ’ rowid , ’ tex t ) ) // ’c1 is the row id 2 val mainpipe = table . read 3 . f i l t e r ( ’ tex t ) { c2 : String ⇒ ! isEmpty ( c2 )} 4 // remove Rt and non−as cii rows 5 . f i l t e r ( ’ tex t ) { c2 : String ⇒ !isRT( c2 ) && ! isNotAscii ( c2 ) } 6 // compute minhash 7 .map( ’ tex t → ’minhash ) {c2 : String ⇒ 8 minhash ( cond rp ( c2 . replaceAll ( ” http ://[ a−zA−Z. ]+” , ”” ) . replaceAll ( ” [ ˆ““w““s ] ” , ”” ) . s p l i t ( ”““s+” ) . toList ) , numOfHashFunc) } 9 // compute simhash 10 .map( ’ tex t → ’simhash ) {c2 : String ⇒ 11 simhash ( bigram( cond rp (c2 . replaceAll ( ” http ://[ a−zA−Z. ]+” , ””) . replaceAll ( ” [ ˆ““w““s ] ” , ”” ) . s p l i t ( ”““s+” ) . toList ) ) ) ( strHashCodeL ) } 12 // clu s ter by minhash 13 . groupBy ( ’ minhash ) { . toList [ ( String ,Long) ] ( ( ’ rowid , ’simhash )→’c3 ) } 14 // compute hamming distance based on simhash 15 .map( ’c3 → ’c3 ) {c3 : List [ ( String , Long) ] ⇒ pairWiseHammingDistance ( c3 ) } 16 // f l a t t e n r es u lt 17 . flatMap ( ’ c3 → ( ’ outrowid , ’ spamrowid , ’ hammingDist ) ) { 18 c3 : List [ ( String , String , Long) ] ⇒ c3 19 } 20 . unique ( ’ outrowid , ’ spamrowid , ’hammingDist ) 21 . project ( ’ outrowid , ’ spamrowid , ’ hammingDist ) 22 23 // tex t pipe to be joined 24 val textpipe1 = table . read 25 .map( ’ rowid → ’rowid2) {rowid : String ⇒ rowid} 26 .map( ’ tex t → ’ text2 ) {c2 : String ⇒ c2} 27 . project ( ’ rowid2 , ’ text2 ) 28 29 // tex t pipe to be joined 30 val textpipe2 = table . read 31 .map( ’ rowid → ’rowid3) {rowid : String ⇒ rowid} 32 .map( ’ tex t → ’ text3 ) {c2 : String ⇒ c2} 33 . project ( ’ rowid3 , ’ text3 )

1 val lowerBoundScore = 6 2 val upperBoundScore = i f ( args . boolean ( ”upperbound ” ) ) { Integer . parseInt ( args ( ”upperbound ” ) ) } else { 13 } 3 // idf pipe 4 val idfpipe = table . read . lim it ( lim ) 5 . f i l t e r ( ’ tex t ) { c2 : String ⇒ ! isRT( c2 ) && ! isNotAscii ( c2 ) } 6 .map( ’ tex t → ’ tex t ) {c2 : String ⇒ 7 c2 . replaceAll ( ” http ://[ a−zA−Z. ]+” , ”” ) . replaceAll ( ” [ ˆ““w““s ] ” , ” ”) } 8 . flatMap ( ’ tex t → ( ’word , ’one , ’ count ) ) { 9 tex t : String ⇒ { 10 val words = tex t . toLowerCase . s p l i t ( ”““s+” ) 11 val empty = Map( ) :Map[ String , Int ] 12 val counts = words . foldLeft (empty ) ( 13 (m:Map[ String , Int ] ,word : String ) ⇒ { 14 m. get (word ) match { 15 case Some( i ) ⇒ m.+(word → ( i+1) ) 16 case None ⇒ m.+(word → 1) 17 } 18 }) 19 counts . toList .map( (x : ( String , Int ) ) ⇒ { (x . 1 , 1 , x . 2) } ) 20 } 21 } 22 . groupBy ( ’word){ . toList [ ( Int , Int ) ] ( ( ’ one , ’ count ) → ’ one and count ) } 23 .map( ’ one and count → ’ idf ) { one and count : List [ ( Int , Int ) ] ⇒ { 24 one and count . foldLeft ((0 ,0) ) ( (p : ( Int , Int ) , fc : ( Int , Int ) ) ⇒ (p . 1+fc . 1 , p . 2+fc . 2) ) match { 25 case ( freq , count ) ⇒ log10 ( lim/( freq∗1.0) ) 26 } 27 } 28 } 29 . groupAll { . sortBy ( ’ idf ) } 30 . project ( ’word , ’ idf ) 31 . groupAll { . toList [ ( String , Double) ] ( ( ’ word , ’ idf ) → ’ l i s t ) } 32 .mapTo( ’ l i s t → ’ id f d ict ) { 33 ( l : List [ ( String , Double) ] ) ⇒ 34 l . foldLeft (Map( ) :Map[ String , Double] ) ( (m:Map[ String , Double ] , p : ( String , Double) ) ⇒ m + (p . 1 → p . 2) ) Fig. 11. Near Duplicate Detection Integreted with Phonetic Matching and 35 } Static Dictionary (Part 1) 36 . project ( ’ id f d ict ) 37 // mainpipe joining with the r es t 38 mainpipe . joinWithSmaller ( ’ outrowid → ’rowid2 , textpipe1 , cascading . pipe . joiner . LeftJoin ) ) all the SimHash digests and output the pairs whose distance 39 .map( ’joiner=(new text2 → ’ outtext ) {x : String ⇒ x} falls under a threshold. The flatMap flattens the output 40 . project ( ’ outrowid , ’ spamrowid , ’ outtext , ’ hammingDist ) lists of duplicates. The project operation is similar to the 41 . joinWithSmaller ( ’spamrowid → ’rowid3 , textpipe2 , joiner=(new cascading . pipe . joiner . LeftJoin ) ) SELECT clause in SQL, projecting the wanted columns as output. The Scalding DSL compiles the sequence of operations 42 .map( ’ text3 → ’ spamtext ) {x : String ⇒ x} 43 . project ( ’ outrowid , ’ spamrowid , ’ outtext , ’ spamtext , ’ hammingDist ) after groupBy into a reduce task. 44 . crossWithTiny( idfpipe ) 4) Near Duplicate Detection Integrated with Phonetic 45 .map( ( ’ outtext , ’ spamtext , ’ hammingDist , ’ id f d ict ) → ’ score ) { Matching and Static Dictionary: The full near duplication 46 x : ( String , String , Int , Map[ String , Double] ) ⇒ x match { case ( outtext , spamtext , hammingDist , idf map ) i f detection algorithm integrated with static dictionary prepro- 47 phonetic flag ⇒ cessing and phonetic matching post-processing is presented in 48 i f ( ( hammingDist > lowerBoundScore ) && (hammingDist ≤ Figure 11 and Figure 12. The scalding snippet in Figure 11 is upperBoundScore ) ) { val phonetic score = phonetic match ( outtext , spamtext , similar to the version presented in Figure 10, except that we 49 idf map ) incorporate some preprocessing steps such as removing URLs 50 ( phonetic score ∗ hammingDist ) . to I n t from tweets and removing empty tweets, non-ASCII tweets 51 } else { hammingDist} and RT tweets. Apart from the mainpipe, we additionally 52 case ( outtext , spamtext , hammingDist , idf map ) ⇒ { setup two text pipes which provide the textual data used in hammingDist } the phonetic matching. In Figure 12, in the snippet between 53 } line 3 to line 37, we build an IDF dictionary which maps 54 } 55 . f i l t e r ( ’ score ) { score : Int ⇒ score ≤ lowerBoundScore } words to their IDF scores. From line 37 onwards, we join the 56 . project ( ’ outrowid , ’ spamrowid , ’ score ) mainpipe with the two text pipes and the idf pipe. Tweet 57 . write (Tsv( args ( ”output ” ) ) )

pairs whose hamming distance falls below the lower threshold (6) will be returned immediately. Tweet pairs whose hamming distance falls between the lower and the upper threshold (8), will be matched phonetically. Note that phonetic_match method apply the phonetic matching algorithm mentioned in

Fig. 12. Near Duplicate Detection Integreted with Phonetic Matching and Static Dictionary (Part 2)

8

update function f . f is applied to the cumulative score n. The main function match(·)(·) calls loop(·)(·) with the inputs and the initial score 0. Note that we compute all the possible matches by filtering the results from loop(s)({(R, 0)}) with the epsilon-possession-test ǫ ∈ R′ . Finally, we return the maximum score among all the matches.

Section IV-B to all the OOV words to their English counter parts. The aggregated phonetic score ranges from 0 to 1. The lower the score, the less the difference. The product of phonetic score and original hamming distance will be used as the final hamming distance. B. Regular Expression Matching Algorithm Extended with Sound and Confidence In this section we consider the details of the regular expression matching algorithm extended with sound symbol and confidence score. We follow the partial derivative algorithm found in [13]. We extend the algorithm to operate over sound sequence and regular expression sound pattern. In Figure 13, we define the partial derivative operation over ordinary regular expressions and a literal. In a nutshell, given a regular expression r and a literal l, r\p l computes the set of all possible “residuals” of r by removing the leading l. For those who familiar with automata theory, partial derivative operation produces transitions amount regular expressions that will eventually leads to an NFA construction. For expression A∗ we find A∗ \p A = {ǫA∗ } =simplif ication {A∗ } We omit the details of simplification rules such as ǫr −→ r, which can be found in [13]. In Figure 14, we define the regular language. A regular expression r describes a regular language L(r) which denotes a set of words. Let match(w)(r) be a regular expression matching algorithm, its soundness property is as follows, match(w)(r) iff w ∈ L(r) The main idea of the partial derivative based regular expression matching is based on the fact that [ L(ri ) lw ∈ L(r) iff w ∈ ri ∈r\p l

which indicates a word lw is in the language of r, if any only if the suffix w is in the language of the partial derivative of r respect to l. From which the regular expression matching algorithm can be derived as stated in Figure 15, The result from [13] states that match(w)(r) iff w ∈ L(r) In Figure 16 we extend the partial derivative operation to regular expression pattern with sound symbols and confidence scores. The partial derivative operation now takes a regular expression pattern and a sound symbol as inputs and returns a set of pair. Each pair consists of the partial derivative and the score update function. Let’s consider the sound symbol pattern case l1 : n. When l1 is equal to l2 or l1 is a wild card symbol, the operation emits a singleton set. The pair element contains the empty sequence and a lambda function which increment the current score by n. Otherwise, an empty set is returned. The remaining cases are pretty straight-forward. In Figure 17, we extend the regular expression matching algorithm to handle phonetic symbol and confidence score. The function loop(·)(·) function takes a sound sequence and a set of regular expression pattern-confidence pairs as input. The function returns the input set when the input sound sequence is empty. In case of non-empty sound sequence, for every pair (R, n) in the input set, we apply the extended partial derivative operation to R to retrieve the residual R′ as well as the score 9

·\p · :: r → l → {r}

Fig. 13.

φ\p l ǫ\p l

= =

l1 \ p l2

=

(r1 |r2 )\p l

=

(r1 r2 )\p l

=

r ∗ \p l

=

{} {} 

{ǫ} if l1 == l2 {} otherwise r1 \p l ∪ r2 \p l {(rr2 )|r ∈ r1 \p l} ∪ r2 \p l {(rr2 )|r ∈ r1 \p l} {(r′ r∗ ) | r′ ∈ r\p l}

if ǫ ∈ L(r1 ) otherwise

Regular Expression Partial Derivatives

Regular languages: L(r1 |r2 ) L(r1 r2 ) L(r∗ ) L(ǫ) L(φ) L(l) Fig. 14.

= = = = = =

L(r1 ) ∪ L(r2 ) {w1 w2 | w1 ∈ L(r1 ), w2 ∈ L(r2 )} {ǫ} ∪ {w1 ...wn |i ∈ {1, .., n} wi ∈ L(r)} {ǫ} {} {l}

Regular Expression Language

loop(·)(·) :: w → {r} → {r} loop(ǫ)(rs) loop(lw)(rs)

= rs = loop(w)({r′ |r ∈ rs, r′ ∈ r\p l})

match(·)(·) :: w → r → Bool match(w)(r) ǫ∈r

Fig. 15.

= ∃r ∈ loop(w)({r}).ǫ ∈ r

ǫ∈ǫ ǫ ∈ r∗ ǫ ∈ r1 r2 iff ǫ ∈ r1 ∧ ǫ ∈ r2 ǫ ∈ (r1 |r2 ) iff ǫ ∈ r1 ∨ ǫ ∈ r2 Regular Expression Matching Algorithm

·\p · :: R → l → {(R, n → n)} φ\p l ǫ\p l

= =

l1 : n\p l2

=

(r1 |r2 )\p l

=

(r1 r2 )\p l

=



R \p l Fig. 16.

=

{} {} 

{(ǫ, λx.n + x)} if l1 == l2 or l1 == . {} otherwise r1 \p l ∪ r2 \p l  {(Rr2 , f )|(R, f ) ∈ r1 \p l} ∪ r2 \p l if ǫ ∈ L(r1 ) {(Rr2 , f )|(R, f ) ∈ r1 \p l} otherwise {(R′ R∗ , f ) | (R′ , f ) ∈ R\p l}

Regular Expression Partial Derivatives With Sound Symbols and Confidence

10

loop(·)(·) :: s → {(R, n)} → {(R, n)} loop(ǫ)(rns) loop(ls)(rns)

= rns = loop(s)({(R′ , f (n))|(R, n) ∈ rns, (R′ , f ) ∈ r\p l})

match(·)(·) :: s → R → n match(s)(R) ǫ∈R

Fig. 17.

=

max({n|(R′ , n) ∈ loop(s)({(R, 0)}), ǫ ∈ R′ })

ǫ∈ǫ ǫ ∈ R∗ ǫ ∈ r1 r2 iff ǫ ∈ r1 ∧ ǫ ∈ r2 ǫ ∈ (r1 |r2 ) iff ǫ ∈ r1 ∨ ǫ ∈ r2 Regular Expression Matching Algorithm With Sound Symbols and Confidence Score

11

Finding Near Duplicates in Short Text Messages in ...

Scalability is one of the main concerns of our project. Our implementation is able to scale up for ... repetitive tweets generated by third party apps (such as games, mobile advertisement app) and human spamming. ..... We are planning to applying our phonetic matching approach in the domain of IR. Soundex [6] and ...

175KB Sizes 1 Downloads 251 Views

Recommend Documents

Detecting Near-Duplicates for Web Crawling - Conferences
small-sized signature is computed over the set of shin- .... Another technique for computing signatures over .... detection mechanisms for digital documents.

Finding semantic needles in haystacks of Web text and links
Despite the commercial and social impact of search engines and their ranking functions, we lack ..... [5] GW Flake, S Lawrence, CL Giles, and FM Coetzee.

Text Messages-Huffman Staff.pdf
There was a problem previewing this document. Retrying... Download. Connect more apps... Try one of the apps below to open or edit this item.

Exception messages in Planning run.pdf
Exception messages in Planning run.pdf. Exception messages in Planning run.pdf. Open. Extract. Open with. Sign In. Main menu.

Finding Near-Duplicate Web Pages: A Large-Scale ...
Aug 11, 2006 - were visually compared and a Linux diff operation was per- formed on the two token ..... Finding similar files in a large file system. In Proc. of the ...

Text Search in Wade - GitHub
wr(t, q, d) = wm(t, d)[. 1. ∑ p∈ q(µ(p)). ] 1. q is a multiset. 2. µ(p) is the multiplicity of p in the multiset q. 3. ∑p∈ q(µ(p)) is the cardinality of the multiset q. This can be used to represent how much each term should affect the sco

Mobile Search with Text Messages: Designing the User ... - CiteSeerX
Apr 7, 2005 - The goal of the Google SMS service is to provide this large existing base of users with ... from a personal computer, but users also need to find information when they are ..... CHI 2001, ACM, 365–371. 4. Jones, M., Buchanan ...

Finding semantic needles in haystacks of Web text and ...
degree to which the meanings of the two are related. While people are ..... “Business” branch, σs = 0 even though the two are semantically related. Thus such a ...

Finding Friend Groups in Blogosphere
for discovering friend groups in the social networks is pre- ... is also one branch of data mining [4]. .... blogs and it might cause the star effect: a small number of.

finding god in the bible
about the Talpiot Tomb narrow it's usage time period from 30BC to 70AD to ...... Daniel's Messiah in the Critic's Den, http://home.comcast.net/~rrr33/critic7.pdf.

Finding Statistically Significant Communities in Networks - Plos
Apr 29, 2011 - vertices have a total degree M. We can separate the above quantities in the ...... between areas of the United Kingdom, and therefore it has a ... The database can be accessed online at the site of the Office for. National ...

Hayat Sindi - Fouad Dehlawi SMS Text Messages (English).pdf ...
Hayat Sindi - Fouad Dehlawi SMS Text Messages (English).pdf. Hayat Sindi - Fouad Dehlawi SMS Text Messages (English).pdf. Open. Extract. Open with.

finding god in the bible
Yeshua bar Yosef / Jesus Son of Joseph - Identifies Jesus whose Father is Joseph. Mariamne e Mara - Mariamne is interpreted by Jacabovici as Master Mary Magdalene but by Pfann as Mary &. Martha. Ossuary #701. Yose - Nickname for Joseph like Joey. Jac

Finding Statistically Significant Communities in Networks - Plos
Apr 29, 2011 - for the micro-communities and then for the macro-communities. In order to do so ..... livejournal.com), and was downloaded from the Stanford Large. Network .... wish to compare the trend of the network similarity with that of the.

Finding Friend Groups in Blogosphere
1 Introduction. Due to the rapid growth of Internet in the last decade,. Internet service providers (ISP) nowadays need to provide various social network services ...

Finding Statistically Significant Communities in Networks - PLOS
Apr 29, 2011 - clusters in networks accounting for edge directions, edge weights, overlapping ...... Vertices of the LFR benchmark have a fixed degree (in this case ..... The database can be accessed online at the site of the Office for. National ...

Finding Data Races in C++ Code
If you've got some multi-threaded code, you may have data races in it. Data races ... Valgrind is a binary translation framework which has other useful tools such.

SHORT-TERM MEMORY IN DOWN SYNDROME Short ...
Of course, this implies that participants in this instance were able to recode ..... length effects among their sample of individuals with Down syndrome, subsequent studies ..... Gathercole, S. E., Willis, C. S., Emslie, H., & Baddeley, A. D. (1992).

Finding Statistically Significant Communities in Networks - Csic
Apr 29, 2011 - Therefore networks can be found everywhere: in biology. (e. g., proteins and ... yielding an evolutionary advantage on the long run [26]. However, most ...... communities involved in cancer metastasis. BMC Bioinf 7: 2. 15. Holme P, Hus

Finding Statistically Significant Communities in Networks - PLOS
Apr 29, 2011 - funders had no role in study design, data collection and analysis, ...... Tan PN, Steinbach M, Kumar V (2005) Introduction to Data Mining. New.

Extracting Hidden Messages in Steganographic Images
Jul 16, 2014 - establishes an important result addressing this shortcoming: we show that ..... [5] A. D. Ker, Locating steganographic payload via WS residuals,.