Statistical parameters of Ivan Franko’s novel Perekhresni stežky (The Cross-Paths) Solomija Buk and Andrij Rovenchak
1
Introduction
The year 2006 is the 150th anniversary of Ivan Franko (1856–1916), the prominent Ukrainian writer, poet, publicist, philosopher, sociologist, economist, translator-polyglot and public figure. His incomplete collected works have been published in 50 volumes (Franko 1976–86). His name is connected to the notion of national identity in the Western Ukraine. Franko’s works are characterised by intensive plots and interesting topics. In this paper, we make an analysis of the novel Перехреснi стежки (The Cross-Paths, also referred to as The Crossroads in Encyclopædia Britannica). The events of the novel unfold at the turn of the 20th century. The story is about a young lawyer, Evgenij Rafalovyˇc, who comes to the provincial town Halytchyna (Galicia) to continue his practice. There he meets his former teacher, Stalski, who tells Evgenij about his matrimonial life, and in particular, that he has not spoken to his wife for ten (!) years as punishment. Stalski appears to be married to Regina, Evgenij’s Jugendliebe . . . The novel is about lawlessness and justice, meanness and nobleness, consciousness and subconsciousness. Social motives, psychologism, love and tragedy are intertwined here in an intricate way. The novel has been translated into French (Franko 1989), see also some excerpts in Anthologie (2004), and into Russian (Franko 1956). The present paper is the first attempt in Ukrainian linguistics to make a comprehensive quantitative study of a particular artistic work using modern techniques. Previous word-indices of Ukrainian writers (Vašˇcenko 1964, Žovtobrjukh 1978; Kovalyk et al. 1990; Luk’janjuk 2004) were compiled manually, with the aim of establishing the number of occurrences of a particular word, rather than to make an analysis of such data. Small efforts in the quantitative study of Ivan Franko’s fairy-tales were made recently, see Holovatch & Palchykov (2005). The present study is based on the frequency dictionary compiled by the authors using the edition Franko (1979), applying the principles consistent with those described in our recent
40 Solomija Buk and Andrij Rovenchak
paper (Buk & Rovenchak 2004). We have also analyzed the main differences between this edition and the first one (Franko 1900).
2
Basic principles of the text analysis
We consider a token as a word in any form (a letter or alphanumeric sequence between two spaces), irrespective of the language. Thus, ‘1848’, ‘60ий’, ‘§136’ were each treated as one token. We have partially restored the use of letter ґ [g], eliminated from the Ukrainian alphabet in 1933 during Stalin’s rule as a step toward removing the differences between Ukrainian and Russian orthography. The letter г was left to denote both [H] and [g] sounds having, however, a sense-distinguishing role: гнiт ‘oppression’ versus ґнiт ‘wick’, грати ’to play’ versus ґрати ‘bars’, etc. The letter ґ was reintroduced into the Ukrainian orthography in 1993, but to a much narrower extent. We have tried to restore the use of this letter using the edition of Franko (1900) and following modern Ukrainian orthographical tendencies. First, in the proper names, Реґiна ‘Regina’, Ваґман ‘Wagman’, Рессельберґ ‘Resselberg’. Second, in loan words from Polish, German, Latin: ґратулювати ← gratulowa´c, ґешефт ← Geschäft, морґ ← Morgen (a measure of area), абнеґацiя ← abnegation, etc. And, of course, in those words which are now traditionally written with ґ: ґанок , ґатунок , ґрасувати, ґрунт, etc.
2.1
Euphony
In Ukrainian, some words appear in different phonetic variants caused by the ‘phonetic environment’ (i.e., the notion of euphony, cf. Polish w/we, z/ze, Russian с/со, к/ко, also English indefinite article a/an or initial consonants mutations in Irish). They are initial в/у, вiд/вiдо, з/iз/зi/зо, i/й and respective prepositions and conjunctions, and final -ся/сь. Such word variants were joined under one (the most frequent) form. Instead, the vernacular variants are given separately. For instance, адука(н)т and адвокат (‘advocate’), переграф and параграф (‘paragraph’), the second form in the examples being the normative one.
Statistical parameters of Ivan Franko’s novel Perekhresni stežky 41
2.2
Homonyms
The problem of homonyms is one of the most complicated problems slowing down the process of automatic text processing. This is connected to a very high frequency of auxiliary parts of speech which have the same form in Ukrainian (as well as other Slavic languages). For instance, in the text under investigation, one has 1 956 occurrences of що, distributed as follows: 1 360 – conjunction (‘that’, ‘which’), 495 – pronoun (‘what’), 101 – particle. The token а occurs 1 065 times as a conjunction (‘and’, ‘but’), 33 times as a particle and 6 times as an interjection. The token як is found 389 times as a conjunction (‘as’), 125 – as and adverb (‘how’), 55 – as a particle. Note, however, that the translations are very approximate due to the wide range of the word meanings. The ‘full-meaning’ words occupy rather lower ranks: мати appears 35 times as the verb ‘to have’ in Infinitive, versus four occurrences as the noun ‘mother’ in Nominative Singular. While the above examples are standard and expectable for Ukrainian, we have also met with some specific parallel forms: н´аймити, the noun ‘hireling’ in Plural Nominative, and найм´ ити, the verb ‘to hire’ in Infinitive; густ´ı, the adjective ‘dense’ in Singular Accusative Feminine, and г´ устi (in fact, ґ´ устi ) the noun ‘taste’ in Singular Genitive. The analysis of the homonyms could not be fully made in an automatic way, and even a contextual analysis was not sufficient. Therefore, a manual control was necessary. Interestingly, the problem of homonyms appears even in a small subset of words written in the Latin script: we had to distinguish Latin and German in, German definite articles die (Plural and Feminine), Latin maxima (adjective, Feminine from maximus and noun, Plural from maximum).
3
Statistical data
The text size N is 93 885 tokens. In the novel, forty five tokens are alphanumeric, 208 are written in the Latin script (in German (87), Latin (55), Polish (38), French (14), Czech (9), and Yiddish (4) languages, and once the letter ‘S’ is used to describe the form of a river), all the remaining being Ukrainian. – The number of different word-forms is 19 391; the number of different words (lemmas) – vocabulary size V – is 9 962. – Mean word length is 4.83 letters and mean sentence length is 9.8 words.
42 Solomija Buk and Andrij Rovenchak
– Vocabulary richness (the variety index) calculated as the relation of the number of words to the text size equals V /N = 0.106. – Vocabulary density is calculated as the ratio of text size and the vocabulary size N/V = 9.42. In other words, a new word is encountered at every 9–10th word. – The number of hapax legomena V1 is 4901, thus making up 49.2 per cent of the vocabulary and 5.22 per cent of the text. These parameters are also known as the exclusiveness indices of text and vocabulary, respectively. – The concentration indices are connected with the number of words with an absolute frequency equal and higher than 10: N10 = 74 965 for the text and V10 = 1 128 for the vocabulary. The concentration indices are therefore N10 /N = 79.6% for the text and V10 /V = 11.3% for the vocabulary. The main feature of the Franko (1900) edition influencing the statistical parameters of the investigated novel in comparison to the modern text (Franko 1979) is the usage of the verbal reflexive particle -ся. In modern Ukrainian, it is written together with the respective verb, unlike the orthographical rules of 1900 (cf. also the shortened variant -сь written in one word in both older and modern texts). In the novel, this particle is used 2 496 times in 1 485 different verbal forms. This frequency corresponds to the second (!) highest rank, after i/й ‘and’ (3 211). Such a result correlates with, e.g., modern Polish, where the corresponding word sie˛ also belongs to the most frequent ones (PWN 2005).
4
Distributions and linguistic laws
We have analyzed the distribution of word-forms with respect to the number of letters, and found that such a dependence has two maxima, see Figure 1a. As the size of our sample is quite large, this fact can signify that some other unit must be considered as a proper, or natural one. A phoneme (sound) and a syllable appeared to be an appropriate alternative. The dependence between the fraction of word-forms W containing exactly ϕ phonemes can be approximated by the following (empirical) formula: W = Aϕb e−αϕ , 2
A=
2α(b+1)/2
. Γ b+1 2
(1)
Statistical parameters of Ivan Franko’s novel Perekhresni stežky 43
In (1), the value of A is obtained from the normalization condition ∞
1=
Aϕb e−αϕ dϕ. 2
(2)
0
0.16
0.16
0.14
0.14
Fraction of word-forms
Fraction of word-forms
The fitting parameters are as follows: b = 0.6347, α = 0.0258, see Figure 1b. The results regarding fitting in this work were obtained using the nonlinear least-squares Marquardt–Levenberg algorithm implemented in the GnuPlot utility, version 3.7 for Linux. If a syllable is used as a length unit, we
0.12 0.10 0.08 0.06 0.04 0.02 0.00
0.12 0.10 0.08 0.06 0.04 0.02 0.00
0
2
4
6
8
10 12 14 16 18 20
0
Number of letters
2
4
6
8
10 12 14 16 18 20
Number of sounds
(a) Letters
(b) Sounds
Figure 1: The distribution of word-forms (fraction of unity, vertical axis) with respect to the number of constituting units
have utilized a formula similar to Altmann-Menzerath law (Altmann 1980), with the argument shifted by unity: W = B(s + 1)d e−γ(s+1) ,
B=
γ d+1 . Γ(d + 1)
(3)
In the above formula, W is the fraction of word-forms containing exactly s syllables. The reason to introduce the shift is a high frequency of non-syllabic words (particles б, ж, prepositions в, з, conjunction й), which were not treated as proclitics, in contrast to, e.g., the approach by Grzybek & Altmann (2002) for similar Russian words. We have put the length of such words to be zero. Thus, the distribution function has to be non-zero at the origin (s = 0). The fitting parameters are as follows: d = 5.805, γ = 2.245, see Figure 2a. In order to check the validity of the Menzerath law, we have also studied the dependence of the mean syllable length M on the word length s (measured in syllables). We have used the formula
44 Solomija Buk and Andrij Rovenchak
M = M∞ + B sc .
(4)
The constant M∞ denotes a possible asymptotic value of the mean syllable length in a very long word, the exponent c is a negative number. In this way, we also obtain an infinite value of the syllable length for non-syllabic words (s = 0). The fitting parameters are as follows: M∞ = 1.984, B = 1.464, c = −1.119; see Figure 2b – the right-most point was excluded from the fit due to poor statistical reliability.
4
Mean syllable length
Fraction of word-forms
0.4
0.3
0.2
0.1
3
2
0.0 0
1
2
3
4
5
6
7
8
9
10
Number of syllables
(a) Fraction of word-forms with respect to constituting syllables
0
1
2
3
4
5
6
7
8
9
10
Number of syllables
(b) Menzerath’s Law
Figure 2: The distributions regarding syllabic structure of the words
The form M = Asb ecs (see, e.g., Köhler 2002) appeared to give a poorer fit, leading in particular to large mean syllable length of long words due to the exponential increase. We have calculated the parameters of the Zipf law fitting our frequency data in different ranges of ranks. One has the word frequency F connected with its rank r via simple relation: F(r) = A/rz . The values of the exponent z can be related to the different types of vocabulary. Visually in Figure 3a we can see three such rank domains: 10 < r < 200 (z = −0.999), 200 < r < 1000 (z = −1.05), r > 1000 (z = −1.20). The parameters of the Zipf–Mandelbrot law F(r) = A/(r + C)b were also calculated for the whole rank domain: A = 25000, b = 1.14, C = 5.2; see Figure 3a for the results. The portion of text T covered by first r ranked words can be fitted by the dependence T (r) = k ln r + T0 . While for 10 < r < 200 the growth of the text coverage is characterized by k = 0.133, it slows a bit for 200 < r < 2000 with
Statistical parameters of Ivan Franko’s novel Perekhresni stežky 45 1.0
z = –1.116
1000
0.8
z = –1.202
Text Coverage
Frequency
z = –0.9988 100
10
0.6
k = 0.08346 0.4
k = 0.1153 0.2
1 1
10
100
1000
10000
Rank
(a) Zipf’s law
0.0
k = 0.1328 1
10
100
1000
10000
Rank
(b) Text coverage
Figure 3: The transition to different regimes
k = 0.1155 and even more for the larger values of r, k = 0.833 for r > 2000; see Figure 3b for details.
5
Comparison
To complete our paper, we adduce the comparison of the top-ranked words in five different languages (see Table 1, p. 46). The Ukrainian text is the novel under consideration, the English is Ulysses by James Joyce (Ulysses n.d.), the Japanese is Kokoro by Natsume S¯oseki1 , Russian corresponds to the vocabulary of Lermontov (FDL n.d.), and Polish is from PWN (2005). As expected, the majority of these words are auxiliary parts of speech, irrespective of language. Interestingly, in the text of a particular writing (Ukrainian, English and Japanese examples) some common features are found: namely, the names of the characters have very high frequency, allowing them to reach the highest ranks, together with addresses пан, Mr., S. Also, the nouns denoting human body-parts are quite frequent, in particular ‘hand’ which is fouind in all but the Polish list (the reason is probably a large fraction of journalistic texts in the PWN corpus). These phenomena require additional interlingual studies.
1. The frequency data on Kokoro by Natsume S¯oseki were kindly granted by Dr. Katsutoshi Ohtsuki (NTT Cyber Space Laboratories, Yokosuka-shi, Kanagawa, Japan.)
46 Solomija Buk and Andrij Rovenchak Table 1: The top-ranked words, with percentual frequencies in the right columns r
Ukrainian
1 i/й -ся 2 вiн 3 не 4 в/у 5 я 6 на 7 з/iз/зi/зо 8 що (conj) 9 бути 10 той 11 сей/цей 12 до 13 а (conj) 14 вона 15 пан 16 ви 17 але 18 що (pron) 19 свiй 20 (в/у)весь 21 вони 22 за 23 Євгенiй 24 знати 25 такий 26 який 27 би/б 28 як (conj) 29 мати (v) 30 про 31 мовити 32 ще 33 себе 34 ну 35 ж/же 36 коли 37 могти 38 по 39 ми 40 то (conj) 41 ти 42 вiд 43 один 44 так (adv) 45 мiй 46 щоб/щоби 47 сам 48 Стальський 49 говорити 50 то (part) 51 та (conj) 52 тiлько/-и 53 нi 54 для 55 рука
Polish 3.420 2.659 2.632 2.394 2.304 1.842 1.606 1.598 1.449 1.388 1.299 1.222 1.143 1.134 0.962 0.937 0.898 0.749 0.685 0.649 0.590 0.559 0.536 0.534 0.456 0.456 0.445 0.425 0.414 0.406 0.406 0.395 0.381 0.379 0.374 0.373 0.371 0.367 0.365 0.361 0.349 0.343 0.342 0.328 0.323 0.320 0.304 0.291 0.278 0.274 0.269 0.258 0.256 0.245 0.244 0.241
Russian
w/we
3.237 и
i by´c sie˛ z/ze na nie on do ten to z˙ e a który o mie´c jak (adv) tak ja co rok od po ale taki móc przez za dla ju˙z czy bardzo tylko swój no to wszystko wiedzie´c inny bo czas człowiek sam praca oraz jeden mówi´c te˙z lub jeszcze przy przed(e) my pan mo˙zna
2.589 2.104 2.069 1.779 1.689 1.535 1.437 1.178 1.109 1.005 0.853 0.773 0.650 0.644 0.585 0.501 0.445 0.441 0.436 0.424 0.398 0.394 0.384 0.325 0.323 0.321 0.309 0.288 0.265 0.262 0.262 0.255 0.241 0.240 0.232 0.227 0.215 0.207 0.203 0.187 0.187 0.184 0.177 0.176 0.175 0.174 0.168 0.168 0.164 0.161 0.154 0.153 0.153 0.152
я в он не быть на она ты с как этот но весь мой вы они что (conj) что (pron) тот к свой а (conj) так бы один за мочь мы у же знать сказать твой от нет по ли рука который когда из ни любить уже хотеть о (prep) душа кто для если чтобы о (int) говорить себя
English
Japanese
4.117 the
5.653 K
3.207 2.523 2.441 2.239 1.294 1.290 1.260 1.158 1.079 0.990 0.838 0.762 0.738 0.722 0.716 0.645 0.634 0.557 0.551 0.512 0.453 0.444 0.436 0.432 0.430 0.410 0.396 0.393 0.353 0.348 0.329 0.322 0.304 0.301 0.294 0.292 0.275 0.275 0.274 0.273 0.269 0.266 0.264 0.257 0.248 0.248 0.242 0.237 0.230 0.228 0.222 0.219 0.214 0.214
3.356 3.074 2.726 2.710 2.263 1.877 1.869 1.381 1.104 0.985 0.950 0.892 0.799 0.756 0.731 0.723 0.716 0.639 0.500 0.488 0.480 0.448 0.411 0.377 0.363 0.351 0.339 0.338 0.338 0.317 0.313 0.279 0.277 0.273 0.271 0.266 0.263 0.258 0.245 0.233 0.219 0.209 0.209 0.205 0.202 0.196 0.191 0.185 0.184 0.182 0.175 0.171 0.169 0.168
he of and a/an be to in I she that with it on say for have you they all at by as from do or Bloom out what not my up one like their Mr. there but no come so then when man if about which Stephen old your who hand down this over
* (v) '÷ *K K ¹S * (a) @ p r úK ¤ K õK ® ×
qS M ×K ' K ® À zK íK Ú º K , > Õ > ) $ L K R I, úK 2 B LK ±O 3
K ± ®
4.923 3.966 1.945 1.695 1.490 1.331 1.101 1.090 0.840 0.764 0.749 0.639 0.622 0.619 0.525 0.517 0.483 0.477 0.463 0.440 0.406 0.395 0.383 0.372 0.369 0.366 0.358 0.349 0.335 0.332 0.309 0.301 0.301 0.295 0.290 0.287 0.281 0.275 0.267 0.261 0.258 0.253 0.247 0.230 0.230 0.224 0.221 0.219 0.213 0.210 0.210 0.207 0.202 0.202 0.199
Statistical parameters of Ivan Franko’s novel Perekhresni stežky 47
References Altmann, Gabriel 1980 “Prolegomena to Menzerath’s law”. In: Glottometrika 2. Bochum: Brockmeyer, 1–10. Anthologie 2004 Anthologie de la littérature ukrainienne du XIème au XXème siècle. Paris / Kyiv: Société Scientifique Ševˇcenko en Europe. Buk, Solomija; Rovenchak, Andrij 2004 “Rank–Frequency Analysis for Functional Style Corpora of Ukrainian”. In: Journal of Quantitative Linguistics, 11; 161–171. FDL ˇ n.d. Castotnyj slovar’ jazyka M. Yu. Lermontova. [Frequency dictionary of Lermontov’s language]. [http://feb-web.ru/feb/lermenc/ lre-lfd/lre/lre-7172.htm] Franko, Ivan 1900 Perekhresni stežky. [Cross-paths]. Lviv: Vydanje red. “Lïteraturnonaukovoho vistnyka”. 1956 “Razdorož’e.” [Crossroads]. In: Ivan Franko, Soˇcinenija v 10-ti tomach. T. 5. [Works in 10 volumes]. Vol. 5. Moskva: Goslitizdat, 161– 486. 1976–86 Zibrannja tvoriv u 50-ty tomakh. [Collected works in 50 volumes]. Kyiv: Naukova Dumka. 1979 “Perekhresni stežky.” [Cross-paths]. In: Ivan Franko, Zibrannja tvoriv u 50-ty tomakh. T. 20. [ Collected works in 50 volumes]. Vol. 20. Kyiv: Naukova Dumka, 173–459. 1989 Les Chemins croisés: Roman / Trad. de l’ukrainien par G. Maxymovytch. Kyiv: Dnipro. Grzybek, Peter; Altmann, Gabriel 2002 “Oscillation in the frequency-length relationship”. In: Glottometrics, 5; 97–107. Holovatch, Yurii; Palchykov, Vasyl 2005 “Lys Mykyta and Zipf Law”. In: Statistical Physics 2005: Modern Problems and New Applications, August 28–30, 2005, Lviv, Ukraine: Book of abstracts; 136. [http://www.physics.wups.lviv.ua/Franko/lys.pdf] Köhler, Reinhard 2002 “Power Law Models in Linguistics: Hungarian”. In: Glottometrics, 5; 51–61. Kovalyk, Ivan; Ošˇcypko, Iryna; Poljuha, Levko 1990 Leksyka poetyˇcnych tvoriv Ivana Franka. [= Vocabulary of Ivan Franko’s poetry]. Lviv: Lviv University Press.
48 Solomija Buk and Andrij Rovenchak Luk’janjuk, Kornij M. (Ed.) 2004 Jurij Fedjkovyˇc: Slovopokažˇcyk movy tvoriv pysjmennyka. [Jurij Fedkovyˇc: Word-index of the writer’s language]. Chernivtsi: Misto. PWN 2005 Korpus J˛ezyka Polskiego Wydawnictwa Naukowego PWN. [Polish language corpus of scientific publishing house PWN]. [http://korpus. pwn.pl/stslow_en.php] Ulysses n.d. Ulysses by James Joyce. A Ranked Concordance. [http://www.doc. ic.ac.uk/~rac101/concord/texts/ulysses/ulysses_ranked. html] Vašˇcenko, Vasylj (Ed.) 1964 Slovnyk movy Ševˇcenka. T. 1 & 2. [Vocabulary of Shevchenko’s language. Vols. 1 & 2]. Kyiv: Naukova dumka. Žovtobrjukh, Mykhajlo (Ed.) 1978–79 Slovnyk movy H. Kvitky-Osnov’janenka. [Vocabulary of Kvitka-Osnov’janenko’s language]. Kharkiv: Kharkiv University Press.
Bibliographic description:
Buk, Solomija and Andrij Rovenchak (2007): Statistical parameters of Ivan Franko’s novel Perekhresni stežky (The Cross-Paths). In: Grzybek, Peter and Reinhard Köhler (eds.), Exact methods in the study of language and text: dedicated to Professor Gabriel Altmann on the occasion of his 75th birthday (Quantitative Linguistics; 62). Berlin; New York: Mouton de Gruyter, pp. 39–48.