simple linear model: y = a + bx + regression line calculated using method, that is, by Pnthe least-squares 2 minimizing the value of e = i=1 i Ondˇrej Herman (FI MUNI)
Detection of word usage over time
7. 12. 2013
4 / 19
Linear regression 35 30 25 20 15 10 5 0
1980
1990
2000
’slight’ - Google ngrams
polynomial model coefficient of determination (R 2 ) adjusted R 2 Ondˇrej Herman (FI MUNI)
linear model directly using the total counts as the weights skews the results Ondˇrej Herman (FI MUNI)
Detection of word usage over time
7. 12. 2013
6 / 19
Weighted linear regression 0.85 0.80 0.75 0.70 0.65 0.60
1
2
3
4
5
6
(a) adjusted R 2
7
8
1.2 1.0 0.8 0.6 0.4 0.2 0.0 −0.2 9
1980
1990
2000
2 (b) model with maximal Radj
’Chernobyl’ - Google ngrams
R 2 , the coefficient of determination, is the fraction of variance explained by the regression model R 2 increases with the degree of the regression model kitchen sink regression Ondˇrej Herman (FI MUNI)
(a) ’steep’ Oxford English Corpus, (b) ’carrot’ from Google ngrams, p = 0.414
p = 4.3 × 10−10
example F-test p-values
H0 : the mean predicts the behavior of the series well H1 : the given regression model predicts the behavior well Ondˇrej Herman (FI MUNI)
Detection of word usage over time
7. 12. 2013
10 / 19
Robust regression
Moore-Wallis test Mann-Kendall test Spearman’s ρ Theil-Sen metod
Ondˇrej Herman (FI MUNI)
Detection of word usage over time
7. 12. 2013
11 / 19
Moore-Wallis test also known as the sign-difference test
9 8 7 6 5 4 3 2 1 0
0
2
4
6
8
10 12 14 16
16 14 12 10 8 6 4 2 0
0
2
4
6
8
10 12 14 16
no trend is detected in the first series, a downward trend is detected in the second series asymtotically optimal on short series the power of the test is low Ondˇrej Herman (FI MUNI)
Detection of word usage over time
7. 12. 2013
12 / 19
Theil-Sen estimator defined as the median of the pairwise slopes of the samples: b 0 = med
Behavior of the Theil-Sen estimator for words encountered in the British National Corpus Ondˇrej Herman (FI MUNI)
Detection of word usage over time
7. 12. 2013
13 / 19
Mann-Kendall test used to test the significance of a regression model fitted using the Theil-Sen estimator
S=
n X i X i=1 j=1
6 5 4 3 2 1 0
1976 1984 1992 (a) ’oil’, p = 0.021
6 5 4 3 2 1 0 −1
sgn(xi − xj ) sgn(yi − yj )
1976 1984 1992 (b) ’disk’, p = 0.009
6 5 4 3 2 1 0
1976 1984 1992
(c) ’slow’, p = 0.821
Words from the British National Corpus tested using the Mann-Kendall test with the trend line fitted using the Theil-Sen estimator Ondˇrej Herman (FI MUNI)
Detection of word usage over time
7. 12. 2013
14 / 19
Spearman’s ρ
calculated as the correlation coefficient of a linear model obtained by using the rank of the observations instead of the actual value yields almost the same results as the Mann-Kendall test the distribution of the test scores is more difficult to calculate
Ondˇrej Herman (FI MUNI)
Detection of word usage over time
7. 12. 2013
15 / 19
Slope normalization the slope estimates are not directly comparable, they need to be normalized
d=
b0 y¯
where bˆ is the estimated slope and y¯ is the mean of y , the observed frequencies.
On the next slide: the slopes obtained from Google ngrams of the 50 most common words from the Oxford English Corpus ordered by the slope relative to the mean d Ondˇrej Herman (FI MUNI)
Detection of word usage over time
7. 12. 2013
16 / 19
word which been his he It were be by there was has of had would all but one not the it will is at The this
rej Herman (FI MUNI) from Google Detection of word usage over 50 timemost common words 7. 12. 2013 / 19 TheOndˇ slopes obtained ngrams of the from 17 the
Future work
anomaly detection piecewise linear model
Ondˇrej Herman (FI MUNI)
Detection of word usage over time
7. 12. 2013
18 / 19
Conclusion
Mann-Kendall test together with the Theil-Sen estimator give the best results standard linear regression model gives satisfactory results most of the time
1990. 2000. 0. 1. 2. 3. 4. 5. 6. (c) Google ngrams yearly occurences of the word 'ant'. Ondrej Herman (FI MUNI). Detection of word usage over time. 7. 12. 2013.
windowing system, the practice has been to collect as much data as possible â at .... specified by composing agents (as explained below). Figure 4 shows an ... serialized and stored in ASCII format in a file that is associated with a URL on a.
Problem: Context-free grammars can be ambiguous ... Overview. 1. Ambiguity in Context-Free Grammars. 2. .... Architecture and Software Technology, 2001.
widely used social networking sites. ... profiles [7], [8] for confusing users, blacklisting URLs for security reasons [9], [10] and tools for reporting spam. [11].
Twitter is one of the famous OSN sites that provide a platform for virtual communities. ... hackers who will make use of the service for sharing malicious content. ... profiles [7], [8] for confusing users, blacklisting URLs for security reasons [9],
Nov 19, 2004 - See application ?le for complete search history. (56). References Cited .... nal ofthe American Society ofNephrology (1999), 10(5): 12 pages. Vonsover et al. ...... express a preference for the neW test. Invasive prenatal.
One way of verifying a grammar is the detection of ambiguities. Ambiguities are ... are intended to contain a certain degree of ambiguity (for instance program- ming languages that ... Electronic Notes in Theoretical Computer Science ..... part of th
Nov 19, 2004 - BACKGROUND. Human genetic material is an invaluable source of infor .... The target fetal DNA sequence can be, for example, a sequence that is ...... With the advent of broad-based genetic mapping initia tives such as the ...
Mar 19, 2016 - toriously ineffective against malware code obfuscation [11], whereas URL blacklists can often be .... clients who voluntarily agree to share information about file download events. In addition, the data is .... if u shares the same URL
IJRIT International Journal of Research in Information Technology, Volume 1, Issue 11, November, 2013, Pg. 282-289 ... 1Student, Vishwakarma Institute of Technology, Pune University. Pune .... At the highest level, all possible face candidates are fo
Perspective Transformation (in a perspective viewing system):. After Modelling Transformation, Viewing Transformation is carried out to transform objects from the world coordinate system to the viewing coordinate system. Afterwards, objects in the sc
Aug 4, 2017 - They call the short consequent sequences (SCS) present in ..... In Proceedings of the Joint Conference of the 47th ... ACM SIGMOBILE Mobile.
A weblog or âblogâ is a web-accessible reverse- chronologically ordered set of essays (usually consisting of a few paragraphs or less), diary-like in nature, ...
A heat detection program needs to be established and adhered to similar to the .... vulvar tissue in Holstein cows during ovarian cycles and after treatment of.
TIME-CONSTRAINT BOOST FOR TV COMMERCIALS DETECTION. Tie-Yan Liu1, Tao Qin2 and Hong-Jiang Zhang1. Microsoft Research Asia, 49 Zhichun Road, Haidian District, Beijing 100080, P. R. China1. Dept. Electronic Engineering, Tsinghua University, Beijing, 10
Feb 21, 2013 - works, and illustrate on the example of a network of political blogs. ... methods such as hierarchical clustering (see [24] for a review) and ...
and active sonar systems can be used to monitor the underwater acoustic environment for incursions by rapidly moving ... detection and tracking of a small fast surface craft (via its wake) in a highly cluttered shallow water ..... automatic detection
Feb 21, 2013 - approximation to the block model likelihood, which allows us to easily fit block models to ..... web, routing, and some social networks. The model ...
Fourier transform converts each block of data x(t) from the time domain to the frequency domain: X ( f ) . The power spectrum | X ( f ) ... the hydrophone is 1 m above the sea floor (hr=1m). The model ... The generalized cross correlation processing
visually impaired persons have become an important ... The new method we present in the next section of the .... Table 1 Results of Small Character Data set p.
Aug 17, 2007 - occur in derivations in which every live production is used at most once. (The live produc- tions of a CNF grammar are those of the form A â BC.) His algorithm consists of searching those derivations for duplicate strings (like .....
Mula sa: Brad W. Neville , DDS ; Terry A. Day, MD, FACS. ... Revision of rates of Over Time Allowance.PDF. Revision of rates of Over Time Allowance.PDF. Open.