Introduction General intrinsic properties Text cleaning and processing Wordlist-based methods Syntactic functions Conclusions

Intrinsic Methods for Comparison of Corpora V´ıt Baisa and V´ıt Suchomel Natural Language Processing Centre Faculty of Informatics Masaryk University

December 6, 2013

Introduction General intrinsic properties Text cleaning and processing Wordlist-based methods Syntactic functions Conclusions

A need for comparison of corpora

There are large textual corpora from the web. . . but do we know what is inside?

Which corpus is generally better? Comparison based on inner properties of corpora.

Which corpus is better for a specific task? Comparison based on external use of corpus data.

Introduction General intrinsic properties Text cleaning and processing Wordlist-based methods Syntactic functions Conclusions

Intrinsic vs. extrinsic comparison

The paper describes 8 intrinsic methods of corpus comparison divided into the following groups: general intrinsic properties, text cleaning and processing, wordlist-based methods and syntactic analysis.

Extrinsic methods will be explored in a future paper.

Introduction General intrinsic properties Text cleaning and processing Wordlist-based methods Syntactic functions Conclusions

Data used in the experiment

The proposed methods were applied to compare two recent very large web-based Czech text corpora: Hector (Spoustov´a et al., 2010) czTenTen12 (Suchomel, 2012)

The majority of presented methods is language independent but both corpora must be in the same language.

Introduction General intrinsic properties Text cleaning and processing Wordlist-based methods Syntactic functions Conclusions

Size

A simple rule: The bigger the better. Because We need very large corpora to provide evidence about rare phenomena. (Pomik´alek et al., 2009) The measurement of words, tokens and sentences depends on the means of tokenization and sentence detection algorithms used for processing corpus data. CORPUS Hector czTenTen12

BYTES 17 GB 31 GB

TOKENS 3.285 bn 5.437 bn

WORDS 2.607 bn 4.458 bn

SENTENCES 219 m 303 m

Introduction General intrinsic properties Text cleaning and processing Wordlist-based methods Syntactic functions Conclusions

Diversity of sources

The more diverse source of the data, the better coverage of language by the corpus may be expected. Hector: constructed from manually selected web sites with large and good-enough-quality textual content (e.g. news servers, blog sites, discussion fora). czTenTen12: a general Czech web crawl. Constraining sources of a monolingual corpus to the corresponding national TLD – useful in the case of Czech. CORPUS czTenTen12

PAGES 9,747,315

DOMAINS 233,122

AVG 42

MED 4

TLDS 97.6 % cz

Introduction General intrinsic properties Text cleaning and processing Wordlist-based methods Syntactic functions Conclusions

Sentence length

millions of sentences

Sentence border detection – different solutions observed.

1087 106 105 104 103 102 101 100 10 0

czTenTen12 Hector

50

100

words

150

200

250

Introduction General intrinsic properties Text cleaning and processing Wordlist-based methods Syntactic functions Conclusions

Data duplicity The less duplicate texts in a corpus the better. However, a very strict deduplication results in removing usable data needlessly. Hector: paragraphs containing more than 30% seen 8-grams were removed czTenTen12: paragraphs containing more than 50% seen 7-grams were removed onion was used to remove sentences consisting of 50% seen 5-grams of sentences (with smoothing disabled). CORPUS Hector czTenTen12

BYTES -23.3 % -17.6 %

TOKENS -25.8 % -18.7 %

SENTENCES -23.6 % -18.4 %

Introduction General intrinsic properties Text cleaning and processing Wordlist-based methods Syntactic functions Conclusions

The test

The less paragraphs full of text in unwanted language the better. However, some level of foreign words cannot be avoided, e.g. in developers’ blogs, movie or music reviews. czTenTen12 log10 positions 0

10

Hector

The the

THE 10k

100

The the

THe 100k

THE

tHe

1M

ThE THe tHe

ThE 10M

Introduction General intrinsic properties Text cleaning and processing Wordlist-based methods Syntactic functions Conclusions

Filtering wordlists

Unknown words from corpus wordlists were filtered out by a morphologic analyzer. The bigger the size of the rest, the better. Czech fast analyser Majka was used. 100%

0

0.4

0.1

0.01 0.28 1.41 1.54

100

-0.21 -0.78

Hector

czTenTen12

% of wordlist

0.51

500

1k

5k

10k

50k

100k

500k

1M

5M

size of wordlist

-0.31 10M

Introduction General intrinsic properties Text cleaning and processing Wordlist-based methods Syntactic functions Conclusions

Keyword comparison Following Kilgarrif’s work, lowercase keywords were extracted to reveal in which words these corpora differ the most. Both recent web corpora contain more data from internet message boards and less news documents than the Czech National Corpus. Hector vs. SYN2000: taky, teda, ahoj, holky, m´am, fakt, moc, sem, dneska, takˇze, blog, nev´ım, m´aˇs, super, r´ada, ahojky (discussions of women). czTenTen12 vs. SYN2000: taky, m˚ uˇzete, moc, dˇekuji, takˇze, cca, m´am, dobr´y, opravdu, dle, ahoj, bych, jestli, d´ıky, hodnˇe, super (discussions). SYN2000 vs. Hector and czTenTen12: praha, vˇcera, korun, procent, ˇcesk´e, vl´ady, st´atn´ı, mili´ on˚ u, z´akona, trhu, ministr, ˇreditel, v´ystava, spoleˇcnost, nato, prezident, ˇctk (standard language, news, Prague).

Introduction General intrinsic properties Text cleaning and processing Wordlist-based methods Syntactic functions Conclusions

Syntactic functions

Syntactically correct sentences are good. Presence of the main syntactic roles – subject and predicate was checked. That rules out web garbage (navigation and labels, tables, program code, SEO keywords, link spam, generated texts,. . . ) but also syntactically problematic but otherwise quite common and understandable sentences. Set was used to carry out the experiment. CORPUS Hector czTenTen12

NCL 36.6 % 39.6 %

NSEN 19.0 % 23.6 %

PNSEN 23.7 % 29.2 %

Introduction General intrinsic properties Text cleaning and processing Wordlist-based methods Syntactic functions Conclusions

Future work

Explore other intrinsic methods: perplexity of language models, finding topics, measuring homogenity and heterogenity.

Develop extrinsic methods: word sketch evaluation (submitted to LREC 2014), morphological segmentation morfessor.

Introduction General intrinsic properties Text cleaning and processing Wordlist-based methods Syntactic functions Conclusions

Conclusion

Eight methods for a general systematic comparison of text corpora were developed and described. The methods were applied to two recent very large Czech web corpora. The related tools can be downloaded from the website of the project. http://nlp.fi.muni.cz/projekty/corpora_comparison

Intrinsic Methods for Comparison of Corpora - raslan 2013

Dec 6, 2013 - syntactic analysis. ... large and good-enough-quality textual content (e.g. news ... program code, SEO keywords, link spam, generated texts,. . . ).

100KB Sizes 3 Downloads 339 Views

Recommend Documents

No documents