Mining Spatial Patterns in Mix-Quality Text Databases

Viewer
Transcript

Term Project Proposal for 995202088: Mining Spatial Patterns in Mix-Quality Text Databases Joseph Chang National Central University No.300, Jhongda Rd., Jhongli City, Taoyuan County 32001, Taiwan +886-3-4227151#35200, 35250

[email protected] ABSTRACT In this proposal, we present a novel data mining research direction of text mining on mixed-quality free-text databases. Possible methodologies and preliminary experimental results will also be presented to suggest the feasibility of the proposing method. Our main idea is base on the believe of that a mixed-quality corpus, i.e. the Web, can hide interesting information of realistic colocation information that can be made useful in practice. Our goal is to develop a solid method for free-text database summarization and interesting pattern extraction for discovering co-locations and high frequency co-location in an anomaly set, possibly includes grammatical or semantically errors. The planned method involves automatically summarization from free-text databases to n-gram models and spatial language models, automatically filtering spatial models using three statistical based interestingness measurements, automatically extraction of interesting co-location patterns based on the language models. To minimize noise, we plan to depend on data intensive computing, i.e. to use very large databases to reduce the effect of noise. For that, we will use the Hadoop[1] parallel programming framework for executing MapReduce[2] algorithms to manipulate and store[12] large dataset (the Google Web1T N-grams corpus [3]). Possible applications that may benefit from the expected mined data includes automatically local grammatical error identification and correction, real time writing assistance.

Categories and Subject Descriptors H.2.8 [Database Management]: Database Applications – Data mining, Statistical databases. I.2.7 [Artificial Intelligence]: Natural Language Processing – Text analysis, Language models.

General Terms Algorithms. Human Factors. Languages.

Keywords Web, Text data mining. Spatial co-locations mining, Frequent pattern mining, Corpus analysis, Information extraction.

1. INTRODUCTION Free-text databases, or corpora, often contain valuable non-trivial information that can only be discovered though text-mining techniques. Unlike structural databases, we need different freetext specific summarization methods to transform free-text data into accessible, structuralize formats. Much work has been done

on text mining for various goals. However, a lot of these methods target on mining patterns from high quality, sometimes domainspecific free-text databases. We present a novel direction of text mining. Instead of mining valuable information from a high quality corpus, e.g. Wikipedia[4], British National Corpus[5], Wall Street Journal Archive[6], we target on databases of general English usages that contains different textual errors, i.e. the Web. These errors may include grammatical errors, semantic errors, or more importantly colocation convention mistakes. Base on these errors and the textual co-location found in the free-text database, we try to mine information that is hidden in the errors in the database. Consider the expressions “listen to music” and “go home”, it is often incorrectly written by non-native English users as “listen music” and “go to home” respectively. Furthermore, nongrammatical errors such as “acquire knowledge” being written as “learn knowledge” are even more difficult to be detected by stateof-the-art word processing suites. These language usage conventions, present in the database as co-locations, are particularly difficult for non-native speakers to pickup, but a reasonable target for knowledge discover using data mining techniques on intensive data to find high frequency patterns. To suggest the effectiveness of our proposed method, preliminary experimental results on these particular cases will be presented in later sections. In our method, we are targeting on both correct colocations and frequent errors found in the corpus. In the proposed method, we first treat the input free-text data as segments of words and transform the input free-text into n-gram language models for further processing. We then further summarize the database into spatial models base on word distance. These spatial patterns are denoted as histograms/probability distributions. Three previously proposed interestingness measurement base on statistical theories would be used to filter out irrelevant/uninteresting patterns. Using these spatial models as summary of the free-text database, we would then extract co-location patterns from the corpus. Another interestingness measurements would be designed to select from the newly extracted patterns that is indicative of colocations and frequent errors. The remaining parts of the proposal are organized as following. In the Next Section (2) we describe some previous research that is related to our proposing method. We will introduce some existing resources, mainly corpora and concept hierarchies, that we plan to make use of in Section 3 and 4. In Section 5 we describe in detail the proposing method. In Section 6 we present some preliminary experimental results that would suggest the feasibility and

effectiveness of the proposing method. In the Last Section (7), we describe a possible application that benefits from the expected uncovered patterns to show the potential usefulness of this research direction.

2. Related Work (Draft)

3. Databases of Consideration To discover frequent language usage mistakes by mining information free-text databases, we need a database that contains a considerable amount of errors. However, as an empirical method, if the errors in the database overwhelm correct usages, the results can be disastrous. For that, we consider using Wikipedia dump[4], British National Corpus[5], and Google Web 1T corpus[3].

3.1 Web Data (Google Web 1T) In 2006, the search company Google published their n-gram models of the Web through Linguistics Data Consortium. They’ve also provided several time limited offering for free distribution and shipping to universities around the world for research purposes. The Google Web 1T corpus is a 24GB (after gzip compressed) corpus that consists of n-grams range from unigram to five-grams generated from approximately 1 trillion words in publicly accessible Web pages. These Web pages may include a considerable amount of errors. However, in this work, we are targeted to exploit these errors, and mine valuable knowledge from it.

3.2 Wikipedia (Freebase WEX) Wikipedia is a free online encyclopedia compiled by millions of volunteers all around the world. Anyone on the Internet can freely edit existing entries and/or create new entries to add to Wikipedia. Owing to the size of its participants, Wikipedia has achieved both high quantity and quality. In fact, as of August 12, 2009, the English Wikipedia consists of over 2,990,000 articles, near 1 billion words, and is consider having similar quality comparing to traditional encyclopedias compiled by experts. [7] (J. Giles, 2005) Due to these reasons, Wikipedia has become one of the largest referenced tools. In efforts to make information publically available, Wikipedia also provides raw database dump for download in various formats. However, the raw format of Wikipedia article is in Wikipedia markup syntax, which is rather time consuming to parse. However, Freebase Wikipedia Extraction (WEX)[8] provides parsed Wikipedia dump in various syntaxes including plain text and structured XML.

3.3 British National Corpus (BNC) The British National Corpus (BNC) is an English corpus maintained by the Oxford University. Comparing to Wikipedia, it is a smaller corpus that contains approximately 100 million words. However, the sources of the corpus are mainly published and well-maintained materials that guarantee the quality of the corpus. Furthermore, BNC also consists of 10% spoken English language. Due to its high quality, BNC will probably not suitable for our method, since we are depending on the errors in the database. However, we may use BNC as a reference to confirm the correct colocations.

Database

Source

Size

Quality

Web1T

English Webpages

1012 words

Lower

9

Wikipedia

Online collaborative encyclopedia

10 words

Median

BNC

Oxford Univ. Newspapers, books, journals (90%). meetings, radio shows, phone calls (10%).

106 words

Higher

Table 1. Comparison of considered free-text databases.

4. Concept Hierarchy of Consideration With the help of semantic concept hierarchies, we can extract patterns that not only consist of words, but also conceptual classes. For example, if we see the phrase “to absorb knowledge”, we can roll-up the word absorb (take up mentally) to the more general concept learn (gain knowledge or skills), or we can drilldown to a more specific concept imbibe (receive into the mind and retain). In application development, we can also make use of concept hierarchies to find closely related words for substitution suggestions in an error detection system.

4.1 WordNet WordNet[9] is a freely available handcrafted lexical semantic database for English. Started its development back in 1985 at Princeton University by a team of cognition scientists, WordNet was originally intended to support psycho-linguistic research. Over the years, WordNet has also been increasingly popular in the fields of text data mining, information extraction, natural language processing, and artificial intelligent. Through each releases, WordNet has grown to be a comprehensive database of concepts in the English language. As of today, the latest 3.0 version of WordNet contains extensively a total of 207,000 semantic relations between 150,000 words organized in over 115,000 concepts. Concepts in WordNet are represented as synonym sets (synsets). A synset contains one or more words that can express the same meaning. WordNet also records various semantic relations between its concepts. The hypernym relation between concepts in WordNet made it possible to see WordNet as a concept hierarchy, or in the linguistic terminology, ontology.

4.2 Roget’s Thesaurus Roget’s Thesaurus [10] is English thesaurus widely used as reference or research resource for over 200 years. Created by Dr. Peter Mark Roget in 1805, and released to public in 1852, Roget’s Thesaurus uses a three-level hierarchical structure to divide words in to multiple classes. There are six primary classes in the toplevel; each of them composes of multiple divisions. At the bottom-level, words are divided into over one thousand sections under different divisions. A most general word in each of the sections is labeled as the headword. These sections of word clusters are not all strict synonyms; they can also be semantically closely related words.

5. Proposed Method We are proposing a three-phase process to uncover patterns that can represent interesting co-location patterns in a free-text

database that contains language usage errors. In the first phase (Section 5.1), we transform each corpus into n-gram language models, and summarize the database into spatial patterns from the models. In the second phase (Section 5.2), we use different interestingness measurements to select interesting patterns. In the third and final phase (Section 5.3), we use the selected spatial models to discover co-location patterns in the n-gram models.

in the database. On the contrary, in Figure 3 shows a much less related word pair of (GO, MUSIC), which has low appearance counts in the database and shows no significant peaks in the histogram comparing to (GO, HOME). Another example presented in Figure 3 shows the co-location (ACCQUIRE, KNOWLEDGE) comparing to an often mistake usage of (LEARN, KNOWLEDGE).

5.1 Spatial Pattern Extraction In phase one, we first transform each of the unstructured free-text databases into n-grams models. Due to the limitations of using Google Web1T N-grams corpus, we use n-grams of length from unigrams to 5-gram.

700000

101000

80800

525000

60600 350000 40400

3000000

175000

20200

0

2250000

-4

-3

-2

-1

1

2

3

4

0

-4

-3

-2

-1

1

2

3

4

Figure 3. Comparison of Left: (LEARN, KNOWLEDGE) and (ACCQUIRE, KNOWLEDGE), Right: (GO, HOME) and (GO, MUSIC).

1500000

750000

0

-4

-3

-2

-1

1

2

3

4

In order to identify interesting spatial patterns of word pairs, we describe three interestingness measurements that are based on statistical theories proposed by Smadja in 1993 [11]. These measurements score the patterns regarding to their strength, spread, and interesting distances respectively.

Figure 1. Spatial pattern of word pair (GO, HOME) extracted from Google Web 1T 5-gram Corpus. The red lines shows number of appearances of phrase “go to home”. 700000

Figure 4. The three interestingness measurements proposed by Smadja. 525000

In the next three sub-sections, we will explain the details of these measurements respectively.

350000

5.2.1 C1: Measuring Strength

175000

0

-4

-3

-2

-1

1

2

3

4

Figure 2. Spatial pattern of the word pair (LISTEN, MUSIC) extracted from Google Web 1T 5-gram Corpus. The red lines shows number of appearances of the phrase “listen to music”. We define our spatial windows D to 8 words (from -4 to +4, excluding 0). From the n-gram models, we count the distances of any word pairs from -4 to 4 as spatial patterns. For the example word pair (LISTEN, MUSIC), the distance “listen music” is 1, “listen to music” is 2, and “listen to the music” is 3. In the below figure, we show two spatial patterns extracted from the Google Web1T corpus of word pairs (GO, HOME) and (LISTEN, MUSIC).

5.2 Interestingness of Spatial Patterns As shown in Figure 1 and 2, the histogram of related word pairs usually have one or more peaks and very high appearance counts

The strength measurement is a mutual-information-like score that evaluates the relatedness of two words in the database; more precisely, how often do they appear no further apart than the size of the defined spatial window. The definition of strength measurement is base on statistical measurement of z-score (or standard score). To calculate the strength of a given word pair (W, Wi), we need to first calculate the average frequency f and standard derivation σ of the set {(W, Wx), for ∀ Wx ∈ database}. We than calculate the difference between the frequency of the word pair (W, Wi) freq and the average frequency f , in terms of numbers of standard derivation σ as the strength score.

5.2.2 C2: Measuring Spread The spread measurement summarizes the shape of each pattern base the normalized variance of the histogram. In the formula, pij denotes the number of appearances of (W, Wi) in the database that is j words apart, and D denotes the windows size.

Figure 5. Spread Measurement of (W, Wi)

The two before mentioned interestingness measurements eliminate uninteresting rows in the database. From the histogram patterns selected from the previous two interestingness measurements, the last measurement identifies the interesting distance(s) of each selected word pairs.

5.3 Co-location Discovery and Scoring Collocations in English may contain more then just two words; they are often patterns that contain multiple words. For example, from the spatial model of word pair (DO, FAVOR), we can probably extract the pattern “do a favor”. In the final phase of the proposing method we expands inward of the word pairs discovered by the previous phases, into collocation patterns. For this, we propose a three-step process described in the next three sub-sections.

5.3.1 Anchor Words Identification In the first step, we try to find the anchor words between the given word pairs and distances, i.e. to discover the second word between the word pair (DO, FAVOR, distance=3) is almost always the word “a”. This can be simply achieved by counting. Likelihood probability estimations can also be assigned to the anchor words. This will form the pattern “do * a favor”.

5.3.2 Missing Column Generalization I the second step, we try to generalize columns that have no anchor words found, i.e. no specific word is use frequently in that position. As an example, there is no any high frequency word for the second column in the word pair (DO, FAVOR, distance=3), but many appearances of him, me, yourself or names of a person. With a proper concept hierarchy, we can generalize these terms to a tag . Therefore, forming the co-location pattern “do a favor”.

do a small favor

3.

do a big favor

4.

do a favor

The following operations can be made to relate these patterns.

If the spread is low, then the shape of the histogram is tend to be flat, indicating that Wi is not often presented in any particular position around W in the database.

5.2.3 C3: Evaluating Interesting Distances

2.

1.

Pattern2 contains Pattern1 since it can be transformed Pattern1 by adding the word small.

2.

Pattern2 and Pattern4 can both be rolled-up to Pattern4.

3.

Pattern2 and Pattern3 can generalized to form Pattern4.

Instead of clustering, we can also try to produce a pattern that can represent all the four patterns, , i.e. do a {} favor. ({} indicates optional)

6. Preliminary Results We use the before mentioned word pair (DO, FAVOR) as a mining target for preliminary research. The results shown are promising.

6.1.1 Corpus Comparison using Spatial Patterns 1.0

1.0

0.8

0.8

0.6

0.6

0.4

0.4

0.2

0.2

0

-4

-3

-2

-1

1

2

3

4

0

-4

-3

-2

-1

1

2

3

4

Figure 6. Spatial Patterns of (LISTEN, MUSIC) and (GO, HOME) extracted from both Wikipedia(blue) and Web1T(green).

Here we show the spatial patterns of two word pairs extracted from Wikipedia and Web1T. The result shows similar histogram shape for the same word pair in different free-text databases, indicating spatial language model is a solid approach for summarizing free-text databases.

6.1.2 Analysis of (DO, FAVOR) We can first look at the histogram of (DO, FAVOR) generated from the Google Web 1T N-grams Database: 300000

5.3.3 Co-location Pattern Clustering In the many co-location patterns discovered in Step2, some patterns may be closely related to another, or even contains another pattern. We define the following three rules to associate related patterns: 1.

Containment: Two patterns that can be made identical by adding or removing elements from one of the patterns.

2.

Rolling-up: Two patterns that can be made identical by generalizing words from one of the patterns using the concept hierarchy.

3.

Generalization: Multiple patterns that all generalize to a new pattern.

As an example, the following four patterns that should be extracted by Step2: 1.

do a favor

225000

150000

75000

0

-4

-3

-2

-1

1

2

3

4

Figure 7. Histogram of (DO, FAVOR)

In the histogram, we find distances 2, 3 and 4 is much more frequent then the others. Therefore, we further investigate these distances using the n-gram models. Below shows the top 8 for the three distances.

7. Application – Local Error Correction D=2 do not favor do you favor do a favor do the favor do this favor do I favor do they favor do me favor

D=3 do yourself a favor do me a favor do themselves a favor do us a favor

Local error identification and correction is an obvious practical usage for the mined data. Our main idea for developing such an application is to identify low frequency patterns in the input freetext, and suggest the user to change to an alternate pattern of higher frequency within the same pattern cluster, or in another related pattern cluster. This way, we can help non-native English user to use the word wisely, or as most people would use it.

8. REFERENCES [1] Apache Hadoop. http://hadoop.apache.org/ [2] Dean, J. and Ghemawat, S. 2008. MapReduce: Simplified data processing on large clusters. Communications of the ACM (2008). [3] Google Web1T 5-gram, the Linguistic Data Consortium. http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogI d=LDC2006T13

do you a favor

[4] Wikipedia:Database download. http://en.wikipedia.org/wiki/Wikipedia:Database_download

do me the favor

[5] British National Corpus. http://www.natcorp.ox.ac.uk/

do everyone a favor do him a favor

[6] WSJ Corpus, the Linguistic Data Consoritum. http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogI d=LDC2000T43

D=4

[7] Giles, J. 2005. Internet encyclopedias go head to head. Nature (2005)

do us all a favor do the world a favor do yourself a big favor do yourself a huge favor do me a big favor do us both a favor do your self a favor do me a huge favor Figure 8. Detail analysis of 3,4,5-grams of the word pair (DO, FAVOR) and corresponding distances of 2,3,4.

Here we found the results identical to the previous assumptions made in Section 5.3.3. Furthermore, the results also show concentrate peaks in frequency distribution. This strongly suggests that the proposing method can be effective.

[8] Google, Freebase Wikipedia Extraction (WEX), http://download.freebase.com/wex/ [9] Fellbaum, C. 1998. WordNet: An electronic lexical database. The MIT Press (1998). [10] Peter, R. 1852. Roget’s Thesaurus. http://poets.notredame.ac.jp/Roget/contents.html [11] Smadja, F. 1993. Retrieving collocations from text: Xtract. Computational linguistics (1993). [12] Chang, F. and Dean, J. and Ghemawat, S. Bigtable: A distributed storage system for structured data. ACM Transactions Computer Systems (Volume 26 Issue 2, June 2008) and The 7th USENIX Symposium on Operating Systems Design and Implementation (2006, pp. 205-218).

Mining Spatial Patterns in Mix-Quality Text Databases

i.e. the Web, can hide interesting information of realistic colocation ... method involves automatically summarization from free-text databases to n-gram models ...

Download PDF

326KB Sizes 1 Downloads 269 Views

Report

Mining Spatial Patterns in Mix-Quality Text Databases

Recommend Documents