Term Project for 995202088: Mining Interesting Ngrams ...

Viewer
Transcript

Term Project for 995202088: Mining Interesting Ngrams based on Collocations for Prepositional Error Correction Joseph Chang National Central University No.300, Jhongda Rd., Jhongli City, Taoyuan County 32001, Taiwan +886-3-4227151#35200, 35250

[email protected] ABSTRACT

General Terms

We present a novel data mining research direction of text mining on web-scale text database, mainly the Google Web1T Ngrams range from unigrams to 5-grams. Methodologies, preliminary results, problem definition, and formal evaluation will also be presented to show the effectiveness of the method. Our main idea is base on the believe of that a large-scaled corpus, i.e. the Web, can hide interesting information of realistic colocation information that can be made useful in practice. Our goal is to make use of different data mining techniques to develop a solid method for free-text database summarization and interesting pattern extraction for discovering co-locations and high frequency co-location in an anomaly set. Previous methods for prepositional error correction or candidate selections either failed to archive real web-scale or rely on very intensive queries on enormous webscale databases, which made these methods impossible to run on a single commodity computer system. By using data mining techniques, we can effectively extract only the most valuable information from the large corpus, therefore archiving both webscale and lightweight. The method involves automatically summarization from free-text databases to n-gram models and spatial language models, automatically filtering spatial models using three statistical based interestingness measurements, automatically extraction of interesting co-location patterns based on the language models. To minimize noise, we plan to depend on data intensive computing, i.e. to use very large databases to reduce the effect of noise. For that, we will use the Hadoop[1] parallel programming framework for executing MapReduce[2] algorithms to manipulate and store[12] large dataset (the Google Web1T N-grams corpus [3]). In a implementation of the proposed method, our system, PrepAnnotator, detects potential prepositional errors in the input article, and gives automatic suggestions of replace, delete or inserting prepositions. The training stage of the implementation of PrepAnnotator, we make use of the Cassandra[13] distributed NOSQL database system to achieve real-time error correction. In evaluation, PrepAnnotator shows reasonable precision and recall, relying on only database under 100 megabytes.

Algorithms. Human Factors. Languages.

Categories and Subject Descriptors H.2.8 [Database Management]: Database Applications – Data mining, Statistical databases. I.2.7 [Artificial Intelligence]: Natural Language Processing – Text analysis, Language models.

Keywords Web, Text data mining. Spatial co-locations mining, Frequent pattern mining, Corpus analysis, Information extraction.

1. INTRODUCTION Prepositional error correction has been shown a difficult real life task that is crucial especially for non-native language learners. In a investigation to find the nature of the task, Tetreault and Chodorow [17] use two human annotators of fill in preposition from a prepositions-removed text. The two annotators achievbed only 75% agreement with each other and with the original text. The problem had been usually divided into three different parts, error detection, candidate generation, and candidate selection. In the Previous Section, we will introduce several previous methods of the same interest. However, most methods only deal with the last part of the problem. Free-text databases, or corpora, often contain valuable non-trivial information that can only be discovered though text-mining techniques. Unlike structural databases, we need different freetext specific summarization methods to transform free-text data into accessible, structuralize formats. Much work has been done on text mining for various goals. However, a lot of these methods target on mining patterns from high quality, sometimes domainspecific free-text databases. We present a novel direction of text mining. Instead of mining valuable information from a high quality corpus, e.g. Wikipedia[4], British National Corpus[5], Wall Street Journal Archive[6], we target on databases of general English usages that contains different textual errors, i.e. the Web. These errors may include grammatical errors, semantic errors, or more importantly colocation convention mistakes. However, by using data intensive based data mining methods, we can handle web-scaled data while reducing the effects of noise. Consider the expressions “listen to music” and “go home”, it is often incorrectly written by non-native English users as “listen music” and “go to home” respectively. Furthermore, nongrammatical errors such as “acquire knowledge” being written as “learn knowledge” are even more difficult to be detected by stateof-the-art word processing suites. These language usage conventions, present in the database as co-locations, are particularly difficult for non-native speakers to pickup, but a

reasonable target for knowledge discover using data mining techniques on intensive data to find high frequency patterns. To suggest the effectiveness of our proposed method, preliminary experimental results on these particular cases will be presented in later sections. In our method, we are targeting on extracting the most valuable information for the task of prepositional error detection base on colocations. In the proposed method, we first treat the input free-text data as segments of words and transform the input free-text into n-gram language models for further processing. We then further summarize the database into spatial models base on word distance. These spatial patterns are denoted as histograms/probability distributions. Three previously proposed interestingness measurement base on statistical theories would be used to filter out irrelevant/uninteresting patterns. Using these spatial models as summary of the free-text database, we would then extract co-location patterns from the corpus. The remaining parts of the proposal are organized as following. In the Next Section (2) we describe some previous research that is related to our proposing method. We will introduce some existing resources, mainly corpora and concept hierarchies, that we plan to make use of in Section 3 and 4. In Section 5 we describe in detail the proposing method. In Section 6 we present some preliminary experimental results that would suggest the feasibility and effectiveness of the proposing method. In the Last Section (7), we describe a possible application that benefits from the expected uncovered patterns to show the potential usefulness of this research direction.

2. Related Work Several web-scale approaches prepositional error correction.

have

been

proposed

for

2.1 Smart Query [15] In 2010, Gamon el. al at Microsoft Research Center, investigated to potential of using web data for prepositional error correction. With a corpus that contains grammatical errors and human annotated corrections, the proposed system takes both the errors and the answers, and tries to see if web data can distinguish right from wrong. To form a more effective query, POS tagging and parsing techniques are applied to the input sentences to select relevant words surrounding the error preposition. After formulating the two queries, Google Search API, Microsoft BING API and Google Web1T are queries respectively to see if the frequency of the query term containing answers is higher than the one that contains errors. Focusing on only the third part of the task of candidate selection. The three databases achieved different precision and recall. For precision, Google Search API, Microsoft BING API, and Google Web1T achieved 81%, 74%, 88% respectively, and 81%, 71%, 69% respectively for recall. Considering answers are given as the input, the proposing method is not a complete solution to the problem. However, it did suggested that web-data is a good research direction for prepositional error correction. Other limitations of the proposed method include the need for POS-tagging and parsing, and relying on querying large web-scaled databases.

2.2 Web-Scale Ngram Models for Lexical Disambiguation In 2009, Bergsma et. Al proposed a method of using Google Web1T Ngram for lexical disambiguation. The main idea of the proposed method is using a sliding window of different sizes to form all possible queries ranging from unigram to ngram. For example: system tried to decide w0 tried to decide w0 the to decide w0 the two decide w0 the two confusable w0 the two confusable words The proposed method is also evaluated for solving prepositional error correction focusing on both second and third part of the problem, using a set of 34 prepositions as candidate for all labeled errors. The proposed method achieved an impressive 75% precision and recall with New York Times corpus for training and evaluation, and 58% for using web-scale database. However, to achieve high precision and recall, the proposed method only use web-scale data as reference to decide the weight of different features. Therefore, the vocabulary of the model is limited to the training corpus, i.e. New York Times. Even though the proposed method requires no parsing or POS-tagging. It requires 476 queries of all possible ngram.

2.3 Interestingness of Spatial Patterns [11] In 1993, Smadja [11] proposed a novel way that uses the concept of histograms to summarize free-text data that indicates the spatial relation of each word pairs in the database. Smadja also proposed three different interestingness measurements to select collocations from the summarized database. Shown in Figure 1 and 2, the histogram of related word pairs usually have one or more peaks and very high appearance counts in the database. On the contrary, in Figure 3 shows a much less related word pair of (GO, MUSIC), which has low appearance counts in the database and shows no significant peaks in the histogram comparing to (GO, HOME). Another example presented in Figure 3 shows the co-location (ACCQUIRE, KNOWLEDGE) comparing to an often mistake usage of (LEARN, KNOWLEDGE).

700000

101000

80800

525000

60600 350000 40400 175000

20200

0

-4

-3

-2

-1

1

2

3

4

0

-4

-3

-2

-1

1

2

3

4

Figure 3. Comparison of Left: (LEARN, KNOWLEDGE) and (ACCQUIRE, KNOWLEDGE), Right: (GO, HOME) and (GO, MUSIC).

In order to identify interesting spatial patterns of word pairs, we describe three interestingness measurements that are based on statistical theories proposed by Smadja in 1993 [11]. These

measurements score the patterns regarding to their strength, spread, and interesting distances respectively.

Figure 4. The three interestingness measurements proposed by Smadja.

In the next three sub-sections, we will explain the details of these measurements respectively.

2.3.1 C1: Measuring Strength The strength measurement is a mutual-information-like score that evaluates the relatedness of two words in the database; more precisely, how often do they appear no further apart than the size of the defined spatial window. The definition of strength measurement is base on statistical measurement of z-score (or standard score). To calculate the strength of a given word pair (W, Wi), we need to first calculate the average frequency f and standard derivation σ of the set {(W, Wx), for ∀ Wx ∈ database}. We than calculate the difference between the frequency of the word pair (W, Wi) freq and the average frequency f , in terms of numbers of standard derivation σ as the strength score.

2.3.2 C2: Measuring Spread The spread measurement summarizes the shape of each pattern base the normalized variance of the histogram. In the formula, pij denotes the number of appearances of (W, Wi) in the database that is j words apart, and D denotes the windows size.

Figure 5. Spread Measurement of (W, Wi) If the spread is low, then the shape of the histogram is tend to be flat, indicating that Wi is not often presented in any particular position around W in the database.

2.3.3 C3: Evaluating Interesting Distances The two before mentioned interestingness measurements eliminate uninteresting rows in the database. From the histogram patterns selected from the previous two interestingness measurements, the last measurement identifies the interesting distance(s) of each selected word pairs.

3.1 Web Data (Google Web 1T) In 2006, the search company Google published their n-gram models of the Web through Linguistics Data Consortium. They’ve also provided several time limited offering for free distribution and shipping to universities around the world for research purposes. The Google Web 1T corpus is a 24GB (after gzip compressed) corpus that consists of n-grams range from unigram to five-grams generated from approximately 1 trillion words in publicly accessible Web pages. These Web pages may include a considerable amount of errors. However, in this work, we are targeted to exploit these errors, and mine valuable knowledge from it.

3.2 Wikipedia (Freebase WEX) Wikipedia is a free online encyclopedia compiled by millions of volunteers all around the world. Anyone on the Internet can freely edit existing entries and/or create new entries to add to Wikipedia. Owing to the size of its participants, Wikipedia has achieved both high quantity and quality. In fact, as of August 12, 2009, the English Wikipedia consists of over 2,990,000 articles, near 1 billion words, and is consider having similar quality comparing to traditional encyclopedias compiled by experts. [7] (J. Giles, 2005) Due to these reasons, Wikipedia has become one of the largest referenced tools. In efforts to make information publically available, Wikipedia also provides raw database dump for download in various formats. However, the raw format of Wikipedia article is in Wikipedia markup syntax, which is rather time consuming to parse. However, Freebase Wikipedia Extraction (WEX)[8] provides parsed Wikipedia dump in various syntaxes including plain text and structured XML.

3.3 British National Corpus (BNC) The British National Corpus (BNC) is an English corpus maintained by the Oxford University. Comparing to Wikipedia, it is a smaller corpus that contains approximately 100 million words. However, the sources of the corpus are mainly published and well-maintained materials that guarantee the quality of the corpus. Furthermore, BNC also consists of 10% spoken English language. Due to its high quality, BNC will probably not suitable for our method, since we are depending on the errors in the database. However, we may use BNC as a reference to confirm the correct colocations. Database Web1T

English Webpages

Size

Quality

12

Lower

9

10 words

Wikipedia

Online collaborative encyclopedia

10 words

Median

BNC

Oxford Univ. Newspapers, books, journals (90%). meetings, radio shows, phone calls (10%).

106 words

Higher

Brown

Newspapers, books, governmental documents, reports

106 words

Higher

3. Free-text Databases To discover frequent language usage mistakes by mining information free-text databases, we need a database that contains a considerable amount of errors. However, as an empirical method, if the errors in the database overwhelm correct usages, the results can be disastrous. For that, we consider using Wikipedia dump[4], British National Corpus[5], and Google Web 1T corpus[3].

Source

Table 1. Comparison of free-text databases

4. Concept Hierarchies With the help of semantic concept hierarchies, we can extract patterns that not only consist of words, but also conceptual classes. For example, if we see the phrase “to absorb knowledge”, we can roll-up the word absorb (take up mentally) to the more general concept learn (gain knowledge or skills), or we can drilldown to a more specific concept imbibe (receive into the mind and retain). In application development, we can also make use of concept hierarchies to find closely related words for substitution suggestions in an error detection system.

4.1 WordNet WordNet[9] is a freely available handcrafted lexical semantic database for English. Started its development back in 1985 at Princeton University by a team of cognition scientists, WordNet was originally intended to support psycho-linguistic research. Over the years, WordNet has also been increasingly popular in the fields of text data mining, information extraction, natural language processing, and artificial intelligent. Through each releases, WordNet has grown to be a comprehensive database of concepts in the English language. As of today, the latest 3.0 version of WordNet contains extensively a total of 207,000 semantic relations between 150,000 words organized in over 115,000 concepts.

For this, we propose a three-step process described in the next three sub-sections.

5.1 Problem Definition 5.1.1 Preposition Detection Given a corpus with prepositions removed, predict the locations that in the original corpus contain a prepositional word or phrase and maximize precision and recall.

5.1.2 Preposition Selection Given a corpus with annotated location of missing prepositions, rank the highest probable prepositions to insert into the location, maximize precision, recall and/or MRR.

5.1.3 Preposition Error Correction Given a corpus that contains prepositional errors, identify and correct all errors.

5.2 Preposition Detection Unlike most problems in natural language processing, training data for prepositional error correction comes free by simply using the appearances of preposition in a high quality corpus as answers. We train two taggers to predicting preposition appearances in a prepositions removed corpus: A specialized POS tagger, and a preposition appearances tagger.

Concepts in WordNet are represented as synonym sets (synsets). A synset contains one or more words that can express the same meaning. WordNet also records various semantic relations between its concepts. The hypernym relation between concepts in WordNet made it possible to see WordNet as a concept hierarchy, or in the linguistic terminology, ontology.

The specialized POS tagger is used to apply POS tagging of the input data that is expected to contain prepositional errors. The specialized POS tagger is trained with the POS annotated brown corpus with preposition words removed. We experiments with HMM, CRF, and the SVM based Yamcha tagger and achieved 93.17% precision.

4.2 Roget’s Thesaurus

The second tagger makes use of POS data and the words itself to predict the appearance of preposition. For example, the sentence “I was listening to music by the band.” With prepositions removed became “I was listening music the band.”, an implementation of the method, we are able to obtain the results “I:0.0 was:0.0 listening:0.0 music:0.53 the:0.98 band:0.0 .:0.0”, indicating the position before “music” and “the” is high probably for inserting a preposition. We use CRF and SVM to train the prepositional prediction tagger, and achieved over 95% precision. However, the ratio between prepositions and non-preposition is biased, and we only concern on finding prepositions, in the Evaluation Section we will evaluate the results base on precision and recall of prepositions.

Roget’s Thesaurus [10] is English thesaurus widely used as reference or research resource for over 200 years. Created by Dr. Peter Mark Roget in 1805, and released to public in 1852, Roget’s Thesaurus uses a three-level hierarchical structure to divide words in to multiple classes. There are six primary classes in the toplevel; each of them composes of multiple divisions. At the bottom-level, words are divided into over one thousand sections under different divisions. A most general word in each of the sections is labeled as the headword. These sections of word clusters are not all strict synonyms; they can also be semantically closely related words.

5. Proposed Method We are proposing a three-phase process to uncover patterns that can represent interesting co-location patterns in a free-text database that contains language usage errors. In the first phase (Section 5.1), we transform each corpus into n-gram language models, and summarize the database into spatial patterns from the models. In the second phase (Section 5.2), we use different interestingness measurements to select interesting patterns. In the third and final phase (Section 5.3), we use the selected spatial models to discover co-location patterns in the n-gram models. Colocation Discovery and Scoring -> Interesting nGrams Discovery Collocations in English may contain more then just two words; they are often patterns that contain multiple words. For example, from the spatial model of word pair (DO, FAVOR), we can probably extract the pattern “do a favor”. In the final phase of the proposing method we expands inward of the word pairs discovered by the previous phases, into collocation patterns.

5.3 Preposition Selection 5.3.1 Spatial Pattern Extraction In phase one, we first transform each of the unstructured free-text databases into n-grams models. Due to the limitations of using Google Web1T N-grams corpus, we use n-grams of length from unigrams to 5-gram.

Listen

3000000

For

[1, 338208], [4, 140275], [-3, 84219], ... Features extracted for word “Listen”.

2250000

1500000

750000

0

-4

-3

-2

-1

1

2

3

4

Figure 1. Spatial pattern of word pair (GO, HOME) extracted from Google Web 1T 5-gram Corpus. The red lines shows number of appearances of phrase “go to home”.

The second set of features is much similar to the first set of features, but with different constrain. In this strategy, we limit the spatial pattern to only two non-prepositional words that contains prepositions between them. For example, “Listen to music” is formulated as (listen, music, d=2, prep=to); “listen to rock music” is formulates (listen, music, d=d, prep=to) and (listen, rock, d=2, prep=to). This strategy yields higher precision with lower coverage.

6. Classification Model We follow the model used in Yarowski 1992 [14] for class-space word sense disambiguation to build or classifier.

700000

525000

Naïve Bayes Classifier 350000

A Naïve Bayes – Like classifier is used, which is simply a NB classifier with prior remove, and log-sum all conditional probabilities.

175000

0

-4

-3

-2

-1

1

2

3

4

Figure 2. Spatial pattern of the word pair (LISTEN, MUSIC) extracted from Google Web 1T 5-gram Corpus. The red lines shows number of appearances of the phrase “listen to music”. We therefor define our spatial windows D to 8 words (from -4 to +4, excluding 0). From the n-gram models, we count the distances of any word pairs from -4 to 4 as spatial patterns. For the example word pair (LISTEN, MUSIC), the distance “listen music” is 1, “listen to music” is 2, and “listen to the music” is 3. In the below figure, we show two spatial patterns extracted from the Google Web1T corpus of word pairs (GO, HOME) and (LISTEN, MUSIC).

5.3.2 Features Extraction We extract features base on the Smadja 1993 spatial model. To maximize or coverage of all vocabularies, we use only the second interestingness measure, spread, to find interesting collocations. The first strategy is to extract collocation that contains a preposition and a non-prepositional word that has particular distance phenomenon. Compare to the second strategy, the first set of features generates 100% with lower precision. Word

Preposition

Histogram [distance, count]

Listen

Of

[4, 404576], [3, 219687], [-3, 75100], ...

Listen

On

[3, 167359], [1, 150102], [2, 121642], ...

Listen

To

[1, 13453974], [-1, 6362783], [2, 358088],..

7. Preliminary Results We use the before mentioned word pair (DO, FAVOR) as a mining target for preliminary research. The results shown are promising.

7.1.1 Corpus Comparison using Spatial Patterns 1.0

1.0

0.8

0.8

0.6

0.6

0.4

0.4

0.2

0.2

0

-4

-3

-2

-1

1

2

3

4

0

-4

-3

-2

-1

1

2

3

4

Figure 6. Spatial Patterns of (LISTEN, MUSIC) and (GO, HOME) extracted from both Wikipedia(blue) and Web1T(green).

Here we show the spatial patterns of two word pairs extracted from Wikipedia and Web1T. The result shows similar histogram distribution for the same word pair in different free-text databases, indicating spatial language model is a solid approach for summarizing free-text databases.

7.1.2 Analysis of (DO, FAVOR) We can first look at the histogram of (DO, FAVOR) generated from the Google Web 1T N-grams Database:

300000

Here we found the results identical to the previous assumptions made in Section 5.3.3. Furthermore, the results also show concentrate peaks in frequency distribution. This strongly suggests that the proposing method can be effective.

225000

150000

8. Evaluation

75000

0

-4

-3

-2

-1

1

2

3

4

Figure 7. Histogram of (DO, FAVOR)

In the histogram, we find distances 2, 3 and 4 is much more frequent then the others. Therefore, we further investigate these distances using the n-gram models. Below shows the top 8 for the three distances.

In the implementation of the proposed method, we use the higher quality brown corpus for training the specialized POS-tagger and preposition prediction model. For correction phase, we use the entire Google Web1T NGram Corpus. For evaluation, we use a separate part of the brown corpus with no POS tags, and remove all prepositions.

8.1 Preposition Appearances Detection We use HMM and Yamcha (SVM base) to train a the specialized POS tagger, which targets text with all prepositions removed. As shown in following table, similar results are achieved. The SVM based Yamcha tagger take much longer to train, and yield slightly better precision.

D=2 do not favor do you favor do a favor do the favor do this favor

Model

Precision

Yamcha

93.17%

Yamcha (approximate)

92.76%

HMM

92.94%

Specialized part-of-speech tagger using different model.

do I favor do they favor do me favor

D=3 do yourself a favor do me a favor

We use CRF and Yamcha to train the preposition prediction tagger, both of them achieved over 95% precision. However, the ratio between preposition and non-preposition word is bias by about eight to one. Furthermore, we are targeted on finding all preposition. We evaluate the results using precision and recall at different threshold as following. Yamcha took near 50 hours to train, while CRF only took 20 minutes yet achieving near results.

do themselves a favor do us a favor do you a favor do me the favor do everyone a favor do him a favor

D=4 do us all a favor do the world a favor

Results of preposition prediction using different models.

do yourself a big favor do yourself a huge favor

8.2 Preposition Appearances Selection

do me a big favor

We evaluate the two before mentioned feature selection strategy as well as a combined approach.

do us both a favor do your self a favor do me a huge favor Figure 8. Detail analysis of 3,4,5-grams of the word pair (DO, FAVOR) and corresponding distances of 2,3,4.

Method

Precision

Coverage

Feature Set 1

46.8%

100%

Feature Set 2

57.4%

56.1%

Feature Set 1+2

51.4%

100%

Frequency Baseline

20%

100%

Human

75%

100%

Results for preposition selection. Approach 2 yield about ten percent higher precision then approach 1, but only cover about half of the missing prepositions. By combining the two approaches, 51.4% precision is achieved with 100% coverage.

9. Conclusion We present a novel and true web-scale approach to the prepositional error detection and correction, or candidate selection task. By using data mining techniques to find spatial relation of words with identifiable distance patterns, we were able to extract from an enormous web-scale database only a small, highly concentrated portion of the web-scale database for preposition selection while still maintaining a large coverage of vocabulary. Comparing to previous web-scale methods, our system performed similar precision with full coverage (51.4% v. 58%), but with a much smaller runtime database. (~500MB v. >10GB). Even though we couldn’t yet out perform previously non-web-scale methods, our system still proves to be lightweight and effective.

10. REFERENCES [1] Apache Hadoop. http://hadoop.apache.org/ [2] Dean, J. and Ghemawat, S. 2008. MapReduce: Simplified data processing on large clusters. Communications of the ACM (2008).

[6] WSJ Corpus, the Linguistic Data Consoritum. http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogI d=LDC2000T43 [7] Giles, J. 2005. Internet encyclopedias go head to head. Nature (2005) [8] Google, Freebase Wikipedia Extraction (WEX), http://download.freebase.com/wex/ [9] Fellbaum, C. 1998. WordNet: An electronic lexical database. The MIT Press (1998). [10] Peter, R. 1852. Roget’s Thesaurus. http://poets.notredame.ac.jp/Roget/contents.html [11] Smadja, F. 1993. Retrieving collocations from text: Xtract. Computational linguistics (1993). [12] Chang, F. and Dean, J. and Ghemawat, S. Bigtable: A distributed storage system for structured data. ACM Transactions Computer Systems (Volume 26 Issue 2, June 2008) and The 7th USENIX Symposium on Operating Systems Design and Implementation (2006, pp. 205-218). [13] Cassandra. http://cassandra.apache.org/ [14] Yarowsky. Word-sense disambiguation using statistical models of Roget's categories trained on large corpora. Proceedings of the 14th conference on Computational linguistics-Volume 2 (1992) pp. 454-460 [15] Gamon and Leacock. Search right and thou shalt find... Using Web Queries for Learner Error Detection. Fifth Workshop on Innovative Use of NLP for Building Educational Applications (2010) pp. 37 [16] Elghafari et al. Exploring the Data-Driven Prediction of Prepositions in English. Recall vol. 56 pp. 56.1

[3] Google Web1T 5-gram, the Linguistic Data Consortium. http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogI d=LDC2006T13

[17] Joel R.Tetreault and Martin Chodorow. The ups and downs of preposition error detection in ESL writing. In COLING, 2008

[4] Wikipedia:Database download. http://en.wikipedia.org/wiki/Wikipedia:Database_download

[18] Bergsma and Lin. Web-scale N-gram models for lexical disambiguation. Proceedings of the Twenty-First International Joint Conference on Artificial Intelligence (IJCAI-09) (2009)

[5] British National Corpus. http://www.natcorp.ox.ac.uk/

Term Project for 995202088: Mining Interesting Ngrams ...

results, problem definition, and formal evaluation will also be presented to show ..... In application development, we can also make use of concept hierarchies to ...

Download PDF

409KB Sizes 1 Downloads 160 Views

Report

Term Project for 995202088: Mining Interesting Ngrams ...

Recommend Documents