Methods for detection of word usage over time Ondˇrej Herman FI MUNI

7. 12. 2013

Ondˇrej Herman (FI MUNI)

Detection of word usage over time

7. 12. 2013

1 / 19

Motivation natural language is not a static object word usage changes over time

Ondˇrej Herman (FI MUNI)

Detection of word usage over time

7. 12. 2013

2 / 19

Motivation natural language is not a static object word usage changes over time natural language corpora provide relevant data

3.0

25

6

2.5

20

5

2.0

3 10

1.0

2

5

0.5 0.0

4

15

1.5

2004

2008

(a) OEC

0

1 1984

1990

(b) BNC

0

1980

1990

2000

(c) Google ngrams

yearly occurences of the word ’ant’

Ondˇrej Herman (FI MUNI)

Detection of word usage over time

7. 12. 2013

2 / 19

Motivation natural language is not a static object word usage changes over time natural language corpora provide relevant data

3.0

25

6

2.5

20

5

2.0

3 10

1.0

2

5

0.5 0.0

4

15

1.5

2004

0

2008

(a) OEC

1 1984

1990

(b) BNC

0

1980

1990

2000

(c) Google ngrams

yearly occurences of the word ’ant’

difficult to interpret Ondˇrej Herman (FI MUNI)

Detection of word usage over time

7. 12. 2013

2 / 19

Overview

classical least-squares regression analysis robust regression methods

Ondˇrej Herman (FI MUNI)

Detection of word usage over time

7. 12. 2013

3 / 19

Linear regression 35 30 25 20 15 10 5 0

1980

1990

2000

’slight’ - Google ngrams

simple linear model: y = a + bx +  regression line calculated using method, that is, by Pnthe least-squares 2 minimizing the value of e = i=1 i Ondˇrej Herman (FI MUNI)

Detection of word usage over time

7. 12. 2013

4 / 19

Linear regression 35 30 25 20 15 10 5 0

1980

1990

2000

’slight’ - Google ngrams

polynomial model coefficient of determination (R 2 ) adjusted R 2 Ondˇrej Herman (FI MUNI)

Detection of word usage over time

7. 12. 2013

5 / 19

Weighted linear regression 5

5

4

4

3

3

2

2

1

1

0 0 1965 1970 1975 1980 1985 1990 1995 1965 1970 1975 1980 1985 1990 1995

(a) W = total counts

(b) W = log(total counts)

’evil’ - British National Corpus

linear model directly using the total counts as the weights skews the results Ondˇrej Herman (FI MUNI)

Detection of word usage over time

7. 12. 2013

6 / 19

Weighted linear regression 0.85 0.80 0.75 0.70 0.65 0.60

1

2

3

4

5

6

(a) adjusted R 2

7

8

1.2 1.0 0.8 0.6 0.4 0.2 0.0 −0.2 9

1980

1990

2000

2 (b) model with maximal Radj

’Chernobyl’ - Google ngrams

R 2 , the coefficient of determination, is the fraction of variance explained by the regression model R 2 increases with the degree of the regression model kitchen sink regression Ondˇrej Herman (FI MUNI)

Detection of word usage over time

7. 12. 2013

7 / 19

Weighted linear regression 3.5 3.0 2.5 2.0 1.5 1.0 0.5 0.0

1980

1990

2000

’slight’ - Google ngrams

linear model logarithmic transformation of frequencies

Ondˇrej Herman (FI MUNI)

Detection of word usage over time

7. 12. 2013

8 / 19

Linear regression - significance testing

t-test I

tests the significance of a single regression coefficient

F-test I

tests the significance of the whole model

Ondˇrej Herman (FI MUNI)

Detection of word usage over time

7. 12. 2013

9 / 19

Linear regression - significance testing 1.2 1.0 0.8 0.6 0.4 0.2 0.0

2.5 2.0 1.5 1.0 0.5 0.0

2004

2008

1980 1990 2000

(a) ’steep’ Oxford English Corpus, (b) ’carrot’ from Google ngrams, p = 0.414

p = 4.3 × 10−10

example F-test p-values

H0 : the mean predicts the behavior of the series well H1 : the given regression model predicts the behavior well Ondˇrej Herman (FI MUNI)

Detection of word usage over time

7. 12. 2013

10 / 19

Robust regression

Moore-Wallis test Mann-Kendall test Spearman’s ρ Theil-Sen metod

Ondˇrej Herman (FI MUNI)

Detection of word usage over time

7. 12. 2013

11 / 19

Moore-Wallis test also known as the sign-difference test

9 8 7 6 5 4 3 2 1 0

0

2

4

6

8

10 12 14 16

16 14 12 10 8 6 4 2 0

0

2

4

6

8

10 12 14 16

no trend is detected in the first series, a downward trend is detected in the second series asymtotically optimal on short series the power of the test is low Ondˇrej Herman (FI MUNI)

Detection of word usage over time

7. 12. 2013

12 / 19

Theil-Sen estimator defined as the median of the pairwise slopes of the samples: b 0 = med

yi − yj , xi − xj

i 6= j

5 3.5 3.0 4 2.5 2.0 3 1.5 2 1.0 0.5 1 0.0 −0.5 0 1965 1970 1975 1980 1985 1990 1995 1965 1970 1975 1980 1985 1990 1995

(a) ’spice’

(b) ’snow’

Behavior of the Theil-Sen estimator for words encountered in the British National Corpus Ondˇrej Herman (FI MUNI)

Detection of word usage over time

7. 12. 2013

13 / 19

Mann-Kendall test used to test the significance of a regression model fitted using the Theil-Sen estimator

S=

n X i X i=1 j=1

6 5 4 3 2 1 0

1976 1984 1992 (a) ’oil’, p = 0.021

6 5 4 3 2 1 0 −1

sgn(xi − xj ) sgn(yi − yj )

1976 1984 1992 (b) ’disk’, p = 0.009

6 5 4 3 2 1 0

1976 1984 1992

(c) ’slow’, p = 0.821

Words from the British National Corpus tested using the Mann-Kendall test with the trend line fitted using the Theil-Sen estimator Ondˇrej Herman (FI MUNI)

Detection of word usage over time

7. 12. 2013

14 / 19

Spearman’s ρ

calculated as the correlation coefficient of a linear model obtained by using the rank of the observations instead of the actual value yields almost the same results as the Mann-Kendall test the distribution of the test scores is more difficult to calculate

Ondˇrej Herman (FI MUNI)

Detection of word usage over time

7. 12. 2013

15 / 19

Slope normalization the slope estimates are not directly comparable, they need to be normalized

d=

b0 y¯

where bˆ is the estimated slope and y¯ is the mean of y , the observed frequencies.

On the next slide: the slopes obtained from Google ngrams of the 50 most common words from the Oxford English Corpus ordered by the slope relative to the mean d Ondˇrej Herman (FI MUNI)

Detection of word usage over time

7. 12. 2013

16 / 19

word which been his he It were be by there was has of had would all but one not the it will is at The this

d −1.256 −0.862 −0.836 −0.804 −0.744 −0.713 −0.69 −0.669 −0.645 −0.601 −0.572 −0.527 −0.512 −0.5 −0.496 −0.451 −0.427 −0.422 −0.4 −0.381 −0.359 −0.346 −0.338 −0.326 −0.308

ˆ b −36.231 −13.977 −23.698 −20.531 −9.081 −16.541 −33.798 −31.319 −7.4 −31.296 −9.966 −171.39 −12.061 −7.105 −8.423 −8.853 −7.862 −14.476 −194.423 −15.137 −4.771 −30.998 −10.113 −20.124 −8.888

word in have who are more from to and a for on with as an or they we their said that up I about can you

d −0.302 −0.249 −0.209 −0.206 −0.2 −0.178 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.098 0.279 0.488 0.504 0.678 1.523

ˆ b −50.594 −6.764 −2.853 −8.2 −3.146 −6.051 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 7.687 2.488 15.796 5.583 10.541 23.465

rej Herman (FI MUNI) from Google Detection of word usage over 50 timemost common words 7. 12. 2013 / 19 TheOndˇ slopes obtained ngrams of the from 17 the

Future work

anomaly detection piecewise linear model

Ondˇrej Herman (FI MUNI)

Detection of word usage over time

7. 12. 2013

18 / 19

Conclusion

Mann-Kendall test together with the Theil-Sen estimator give the best results standard linear regression model gives satisfactory results most of the time

Ondˇrej Herman (FI MUNI)

Detection of word usage over time

7. 12. 2013

19 / 19

Methods for detection of word usage over time

1990. 2000. 0. 1. 2. 3. 4. 5. 6. (c) Google ngrams yearly occurences of the word 'ant'. Ondrej Herman (FI MUNI). Detection of word usage over time. 7. 12. 2013.

221KB Sizes 0 Downloads 204 Views

Recommend Documents

Agents for Collecting Application Usage Data Over the ...
windowing system, the practice has been to collect as much data as possible — at .... specified by composing agents (as explained below). Figure 4 shows an ... serialized and stored in ASCII format in a file that is associated with a URL on a.

The Usability of Ambiguity Detection Methods for Context-Free ...
Problem: Context-free grammars can be ambiguous ... Overview. 1. Ambiguity in Context-Free Grammars. 2. .... Architecture and Software Technology, 2001.

A Framework for Real Time Detection of ... - IJRIT
widely used social networking sites. ... profiles [7], [8] for confusing users, blacklisting URLs for security reasons [9], [10] and tools for reporting spam. [11].

A Framework for Real Time Detection of ... - IJRIT
Twitter is one of the famous OSN sites that provide a platform for virtual communities. ... hackers who will make use of the service for sharing malicious content. ... profiles [7], [8] for confusing users, blacklisting URLs for security reasons [9],

Methods for detection of nucleic acid sequences in urine
Nov 19, 2004 - See application ?le for complete search history. (56). References Cited .... nal ofthe American Society ofNephrology (1999), 10(5): 12 pages. Vonsover et al. ...... express a preference for the neW test. Invasive prenatal.

The Usability of Ambiguity Detection Methods for ...
One way of verifying a grammar is the detection of ambiguities. Ambiguities are ... are intended to contain a certain degree of ambiguity (for instance program- ming languages that ... Electronic Notes in Theoretical Computer Science ..... part of th

Methods for detection of nucleic acid sequences in urine
Nov 19, 2004 - BACKGROUND. Human genetic material is an invaluable source of infor .... The target fetal DNA sequence can be, for example, a sequence that is ...... With the advent of broad-based genetic mapping initia tives such as the ...

Real-Time Detection of Malware Downloads via - UGA Institute for ...
Mar 19, 2016 - toriously ineffective against malware code obfuscation [11], whereas URL blacklists can often be .... clients who voluntarily agree to share information about file download events. In addition, the data is .... if u shares the same URL

6_GENMOB_Differences in time usage between genders.pdf ...
Color grading ... 6_GENMOB_Differences in time usage between genders.pdf. 6_GENMOB_Differences in time usage between genders.pdf. Open. Extract.

The Usage of Formal Methods in Quran Search System.pdf ...
Page 3 of 6. The Usage of Formal Methods in Quran Search System.pdf. The Usage of Formal Methods in Quran Search System.pdf. Open. Extract. Open with.

Payment of Over Time Allowance.PDF
... Fax : 01 1-237M013, R|y.22382. €-mail : [email protected], Website :www.nfirindia.org. EARLY DETECTION OF HIV / AIDS . PROLONGS QUALITY OF LIFE.

Face Detection Methods: A Survey
IJRIT International Journal of Research in Information Technology, Volume 1, Issue 11, November, 2013, Pg. 282-289 ... 1Student, Vishwakarma Institute of Technology, Pune University. Pune .... At the highest level, all possible face candidates are fo

9. Visible-Surface Detection Methods
Perspective Transformation (in a perspective viewing system):. After Modelling Transformation, Viewing Transformation is carried out to transform objects from the world coordinate system to the viewing coordinate system. Afterwards, objects in the sc

Protein Word Detection using Text Segmentation Techniques
Aug 4, 2017 - They call the short consequent sequences (SCS) present in ..... In Proceedings of the Joint Conference of the 47th ... ACM SIGMOBILE Mobile.

Word Usage and Posting Behaviors: Modeling ... - Research at Google
A weblog or “blog” is a web-accessible reverse- chronologically ordered set of essays (usually consisting of a few paragraphs or less), diary-like in nature, ...

heat detection methods for the year 2000
A heat detection program needs to be established and adhered to similar to the .... vulvar tissue in Holstein cows during ovarian cycles and after treatment of.

time-constraint boost for tv commercials detection
TIME-CONSTRAINT BOOST FOR TV COMMERCIALS DETECTION. Tie-Yan Liu1, Tao Qin2 and Hong-Jiang Zhang1. Microsoft Research Asia, 49 Zhichun Road, Haidian District, Beijing 100080, P. R. China1. Dept. Electronic Engineering, Tsinghua University, Beijing, 10

Pseudo-likelihood methods for community detection in ... - CiteSeerX
Feb 21, 2013 - works, and illustrate on the example of a network of political blogs. ... methods such as hierarchical clustering (see [24] for a review) and ...

Sonar Signal Processing Methods for the Detection and ... - IJRIT
and active sonar systems can be used to monitor the underwater acoustic environment for incursions by rapidly moving ... detection and tracking of a small fast surface craft (via its wake) in a highly cluttered shallow water ..... automatic detection

Pseudo-likelihood methods for community detection in ... - CiteSeerX
Feb 21, 2013 - approximation to the block model likelihood, which allows us to easily fit block models to ..... web, routing, and some social networks. The model ...

Sonar Signal Processing Methods for the Detection and Localization ...
Fourier transform converts each block of data x(t) from the time domain to the frequency domain: X ( f ) . The power spectrum | X ( f ) ... the hydrophone is 1 m above the sea floor (hr=1m). The model ... The generalized cross correlation processing

Improved Text-Detection Methods for a Camera-based ...
visually impaired persons have become an important ... The new method we present in the next section of the .... Table 1 Results of Small Character Data set p.

Ambiguity Detection Methods for Context-Free Grammars
Aug 17, 2007 - occur in derivations in which every live production is used at most once. (The live produc- tions of a CNF grammar are those of the form A → BC.) His algorithm consists of searching those derivations for duplicate strings (like .....

Revision of rates of Over Time Allowance.PDF
Mula sa: Brad W. Neville , DDS ; Terry A. Day, MD, FACS. ... Revision of rates of Over Time Allowance.PDF. Revision of rates of Over Time Allowance.PDF. Open.