STABILITY OF LANGUAGE FEATURES: A COMPARISON OF THE WALS AND JM TYPOLOGICAL DATABASES Oleg Belyaev ABSTRACT The problem of stability of linguistic features is not completely new in the field of linguistic typology. Being already brought up by Greenberg (1978), it has been extensively explored in the recent decades, perhaps most promimently in Nichols (1992). The creation of typological databases such as WALS (Haspelmath et al. (2005)) or “Jazyki Mira” (Polyakov and Solovyev (2006)) containing hundreds or thousands of different languages allows the stability metrics to operate with substantially greater statistics. One of the first explorations of such new possibilities has been made in Wichmann and Holman (pending), based on WALS data. This paper aims to present the results gained from applying the metric from the above-mentioned paper to the JM data as a means of comparison of the two databases. A correspondence between the numbers gained and statements in the literature, which shows a strong correlation between WALS and JM, is also provided.1 KEYWORDS stability, typology, Jazyki Mira, WALS, database, computational linguistics INTRODUCTION Having a model of how a language's grammar evolves based on more-or-less precise principles has been a persistent goal in modern linguistics. It has been approached from various sides and with different methods. One of them deals with the notion of typological features and their susceptibility to evolution through time, which is, ideally, a relativity for a given feature to change or remain unchanged. One of the first typologists to explicitly formulate the notion of stability was Joseph Greenberg (1978: 76), who is known for emphasizing diachronical explanations for typological distributions and, interestingly, explicitly introduces the idea of 'stability': If a particular phenomenon can arise very frequently and is highly stable once it occurs, it should be universal or near universal (...). If it tends to come into existence often and in various ways, but its stability is low, it should be found fairly often but distributed relatively evenly among genetic linguistic stocks (...). If a particular property rarely arises but is highly stable when it occurs, it should be fairly frequent on a global basis but be largely confined to a few linguistic stocks (...). If it occurs 1 The author would like to deeply thank Søren Wichmann, Eric W. Holman, Vladimir Polyakov, Valery Solovyev, and Dmitry Egorov for their collaboration and advice on this project, without which such a comparison would have never been possible.
only rarely and is unstable when it occurs, it should be highly infrequent or nonexistent and sporadic in its geographical and genetic distribution (...) A definitive and thorough analysis of the problem has been carried out in the classical work by Johanna Nichols (1992), where a stability metric (consisting, in effect, of several submetrics) has been proposed and calculated on a sample of languages. Notable recent works on the topic include Dahl (2004) and Maslova (2004). There is, however, room for improvement both in the metrics used and in the data to which calculations are applied. In the latter, progress has been especially significant since the appearance of large typological databases, which allow research to be based not on a relatively limited and hand-picked set of languages and features, but on hundreds or thousands of fully-described database entries. Prominent examples of such databases are the World Atlas of Language Structures (Haspelmath et al. (2005)) in its digital form and “Jazyki Mira” (cf. Polyakov and Solovyev (2006) for a thorough description). The first one contains 142 features, 2-9 values each, for 2650 languages covering the whole world, and about 58000 datapoints. “Jazyki Mira”, on the other hand, has a smaller selection of 317 languages, covering only Eurasia, but contains 3827 binary features. While having a binary structure certainly exaggerates the result, it can still be asserted that JM's language descriptions are more thorough than WALS', although lacking in some important areas of grammar. It is true for both databases that not all features for all languages are attested; however, WALS does have a way of knowing whether this feature has been assigned a value, while JM, due to its binary structure, does not. Moreover, significantly different approaches to language descriptions make any direct comparison between the two questionable, as will be elaborated on further in this paper. In spite of these issues and differences, these two databases nevertheless present an amount of data larger than anything used before in typological studies. One of the first papers exploring the features' inherent properties based on this data is Wichmann and Holman (pending), which deals with the notion of stability. It both introduces a new metric and applies it to the data of the WALS database. This paper deals with applying the same metric to the data of JM and comparing the results with those gained from WALS and with the statements in the literature. It is organized as follows. In Section 2, an overview of the metric proposed by Wichmann and Holman is presented; in Section 3, a comparison of the WALS and JM results is shown and discussed; in Section 4, our findings are compared with the same statements in the literature as the WALS ones, again for the sake of comparison; finally, in Section 5 preliminary conclusions are drawn and prospects for future research on the problem are given. SECTION 2. DESCRIPTION OF THE METRIC USED Wichmann and Holman define stability as the probability of a given feature to change during an arbitrary period of time. In terms of a precise definition, this does reflect the notion of stability as it is typically understood; however, one should keep in mind that the values that can be computed based on typological distributions can never correspond to probabilities in a true mathematical sense; at best, they are approximate measures as to how one feature is
prone to change compared to another. This is especially so in our case, where, as it will soon become clear, negative values for stability are possible, which should be completely ruled out in a strictly probabilistic interpretation. It can also be said that defining only one measure for 'stability' is too crude, and different values are required for a feature's susceptibility to change as a result of language contact and as a result of a language's internal development. Devising such metrics should certainly be a focal point of later studies, though this does not in any way render a 'general' understanding of stability entirely meaningless. Before explaining how the calculations are carried out in detail, it is important to give some details on the metric. It contains two different components: one (R) for frequency of language pairs across families, and the other (U) for frequency among unrealted languages. The rationale behind such a decision is that, while a given feature may be more likely to have equal value in pairs of related languages, it does not necessarily imply high stability if it is the same among pairs of unrelated languages. This is perceived to be more precise than a simple frequency metric, and this point is proven by means of simulations described in Wichmann and Holman (pending). It should also be noted that for JM, all the features are considered to be attested for all languages, since there is no means to verify attestation for most of them. The metric itself is calculated as follows. For a given feature, in each genus (with genera as defined by Dryer (2001), which essentially corresponds to genera in WALS) all the pairs of languages for which this feature is attested are taken.Then, the proportion of those of them having the same value is calculated. A weighted average of such proportions (ri) for all genera is taken, with weights (wi) being square roots of the numbers of attested languages in corresponding genera, in this way: ∑ r i∗wi . R= (1) ∑ wi This gives the value for R. U is calculated in a similar way. All the pairs of unrelated languages for which the feature is attested are taken, and after that the proportion of those of them which have the same value for the feature is calculated. This gives us U. The final value for stability is calculated as follows: R−U S= . (2) 1−U While, as per Wichmann and Holman, this is supposed to be a correction for baseline widely used in the literature, it does allow negative values for S, which does not correspond well to the notion that stability is probability.Negative stability, however, does mean that such a feature is even more likely to be found among pairs of unrealted languages than related ones. As expected, negative values are not very frequent; in our JM calculations, most of them are for features where U exceeds 0.9 and, thus, statistics is very scarce. Results of our computations and their comparison with the WALS data will be discussed in the next section.
SECTION 3. RESULTS OF STABILITY MEASURES ON JM DATA COMPARED TO WALS RESULTS Since JM contains 3827 different features, with only some of them really applicable to stability metrics, it would be impossible to present the result without some kind of direction as to what exactly we should select. Therefore, JM results are presented in this paper only insofar as they are compared to WALS results or statements in the literature. This chapter deals with the former. One of the first issues which arose while creating such a comparison is the choice of what features of one database exactly correspond to features of an another. It is obvious that while measuring stability, for any such correspondence to make any sense, it is important that it be as precise as possible. While there are a few cases, like the order of Subject, Verb and Object, where it is almost ideal, for the most part feature correspondences are rather dubious, which puts under question the whole idea of such a comparison. Nevertheless, the results gained for JM do mean something, and it is worthwhile to publish them. In the original paper, stabilities where U was more than 90% were delimited with brackets to indicate that statistics for such features is too low for the stability metric to really make sense. Due to space constraints only the features where U is less than 90% both for WALS and JM are included in the following table. The four-way categorization of features is taken from the original paper, where it was very stable: 50.6 – 80.8 stable: 34.1 – 48.3 unstable: 22.6 – 32.9 very unstable: -24.9 – 22.0 for 'full' WALS features and very stable: 51.8 – 100.0 stable: 32.8 – 51.7 unstable: 19.2 – 32.7 very unstable: -62.8 – 18.9 for their binary values. Since all features in JM are binary, only the second scheme is used for it. Also the stabilities for JM are presented rounded, though in our kind of comparison it is not that important. WALS feature
WALS stability
10 Vowel nasalization
57% stable)
JM feature
(very 29 ..назализованные/неназализованные 38% (stable) (nasals)
1. Contrastive nasal vowels 57% present stable)
(very
2. Contrastive nasal vowels 57% absent stable)
(very
13 Tone
JM stability
48,3% (stable)
484 .тон
39% (stable)
14 Fixed Stress Locations
26,1% (unstable)
458 ..фиксированность (fixed stress)
1. No fixed stress
11,5% (very 459 ...несвязанное (no fixed stress) unstable)
5% (very unstable)
2. Initial: 1st syllable
37% (stable)
461 ....начальный слог (1st syllable)
13% (very unstable)
7. Ultimate: last syl.
41,1% (stable)
468 ....конечный слог (last syllable)
32% (unstable)
26 Prefixing vs. Suffixing in 41,5% (stable) Inflectional Morphology 2. Predominantly suffixing
66,8% stable)
3442 .модели словоформы (word form models)
(very 3446 ..преимущественно суффиксальная 21% (unstable) (pred. suf.)
31 Sex-based and Non-sex- 80,8% based Gender Systems stable)
(very 1187 ..мотивация согласовательных классов (motivation for gender)
2. Sex-based gender system
(very 1203 ...пол (sex-based motivation)
33 Coding Plurality
of
81.1% stable)
Nominal 41,3% (stable)
59% stable)
(very
1258 .способы выражения числа (coding of plurality — not explicitly nominal, however)
1. Plural prefix
63,2% stable)
(very 1260 ..аффиксация (affix)
7. Plural word
25,4% (unstable)
1271 ..служебные слова (functional words) 18% (very unstable)
54 Distributive Numerals
44,5% (stable)
1329 ..распределительные 38% (stable) (разделительные) (distributives)
55 Numeral Classifiers
38,7% (stable)
1265 ..классификаторы (classifiers)
65 Perfective/Imperfective 36% (stable) Aspect 73 The Optative
56,7% stable)
34% (stable)
33% (stable)
1845 ...совершенный/несовершенный 22% (unstable) (perfective/imperfective)
(very 2362 ...желательность (optative)
20% (unstable)
81 Order of Subject, Object 53,3% and Verb stable)
(very 3657 .линейный порядок членов предложения (sentence word order)
Subject-object-verb (SOV)
69,5% stable)
(very 3662 ...SOV
47% (stable)
Subject-verb-object (SVO)
59,2% stable)
(very 3661 ...SVO
63% stable)
Lacking a dominant word 24,5% order (unstable)
3659 ..свободный (free word order)
85 Order of Adposition and 70,8% Noun Phrase stable)
(very no correspondence
1. Postpositions
(very 2888 ..послелог (postpositions)
80,1% stable)
(very
6% (very unstable)
66% stable)
(very
2. Prepositions
76,5% stable)
(very 2890 ..предлог (prepositions)
71% stable)
(very
87. Order of Adjective and 50,6% Noun stable)
(very no correspondence
Modifying adjective precedes 59,2% noun (AdjN) stable)
(very 3669 ...адъективное опред. предшествует 14% (very определяемому (AdjN) unstable)
100 Alignment of Verbal 34,1% (stable) Person Marking
3640 .строй (sentence structure probably refers to alignment in VPs)
—
2. Accusative alignment
35,8% (stable)
3644 ..номинативный (accusative)
63% stable)
(very
3. Ergative alignment
41,8% (stable)
3653 ..эргативный (ergative)
76% stable)
(very
102 Verbal Person Marking 19,3% (very 3718 ..актант определяет форму 27% (unstable) unstable) предиката (predicate agrees w/agent) Person marking of only the A 38,9% (stable) argument
3721 ....по лицу <- 3719 ...субъектное 30% (unstable) согласование (person subject agreement)
Person marking of only the P -24,4% (very 3729 ....по лицу <- 3727 ...объектное 39% (stable) argument unstable) согласование (person object agreement) 106 Constructions
Reciprocal 21,5% (very no correspondence unstable)
2. All reciprocal 17,9% (very 1790 ...реципрок (reciproc, distinctness 53% constructions are formally unstable) probably implied) stable) distinct from reflexive constructions. 4 The reciprocal and 16,2% (very 1789 ...рефлексив-реципрок reflexive constructions are unstable) reciproc) formally identical.
(very
(reflexive- 24% (unstable)
107 Passive Constructions
28,3% (unstable)
1780 ..пассив (passive)
33% (unstable)
112 Negative Morphemes
27,1% (unstable)
2810 ..морфологическое <2804 34% (stable) .отрицание (morphological negation)
Negative affix
36,8% (stable)
2812 ..отрицательные аффиксы (negative 32% (unstable) affixes)
117 Predicative Possession
34,9% (stable)
no correspondence
2.Genitive Possessive
8,1% (very 1526 ..генитив <- 1521 .падежное 8% (very unstable) оформление посессивного отношения unstable) (genitive poss. marking) Table 1. WALS and JM stabilities compared
It can be seen that, in terms of pure numbers, there is not much, if any, correlation between the two sets of values. Except for some odd correspondence (like in the last feature, 117.2 Genetive Possessive), they are quite different. On the other take, if one takes the four-way
categorization, similarity becomes much more apparent. In fact, categories are adjacent or lie in the same half of the spectrum for most of the features. For those of them where this is not the case, however, explanations can differ. It can either be a wrong correspondence, bad or not completely filled data on one of the sides, or a fundamental difference between the databases, perhaps owing itself to the Eurasian orientation of JM (which may mean that stability could in fact be areal). In any case, it is hard to tell without thoroughly analyzing each feature's distribution in both databases, and having documentation for all of the JM features, which at the moment is, unfortunately, not available. In any case, some kind of empirical testing for the JM results, like the simulation carried out in Wichmann and Holman (pending), would certainly be in order. SECTION 4. COMPARISON OF JM RESULTS WITH STATEMENTS IN THE LITERATURE A comparison with statements in the literature is substantially different from the previous one, since it is now unnecessary to compare precise numbers, but only more-or-less rough categorizations with subjective statements by typologists. The comparison from the original paper has been taken as a base, with only those statements left where JM has some feature which could somehow reflect them. Requirements on making correspondences are more relaxing in this case, since no equality between WALS and JM is needed per se: the only condition is for the JM feature to be able to represent the subject matter on which the statement has been made. Statement literature
in
the WALS feature
Place of adposition appears to be stable (Nichols 1995: 352); adpositions are stable (Croft 1996: 206-7)
WALS stability
WALS agr-t
85 Order of 70.8% Adposition (very and Noun stable) Phrase
JM feature
JM stability
JM agr-t
2888 ..послелог 65% (very (postpositions) stable) Yes
2890 ..предлог 71% (very Yes (prepositions) stable)
SVO is possibly stable Subject-verb- 59.2% (Nichols 2003: 286. object (SVO) (very 305; Croft 1996: 206-7) stable)
Yes
3661 ...SVO
63% (very Yes stable)
SOV is possibly stable Subject(Croft 1996: 206-7) object-verb (SOV)
Yes
3662 ...SOV
47% (stable)
Verb-initial word order Verb-subject- 44.5% is stable (Croft 1996: object (VSO) (stable) 206-7)
Yes
3663 ...VSO
(4%) [very [No] unstable]
Tones are stable but 13 Tone also areal (Nichols 1995: 343; 2003: 307)
Yes
484 .тон
39% (stable)
1. 80.1% Postpositions (very stable) 2. Prepositions
76.5%(very stable)
69.5% (very stable)
48.3% (stable)
Yes
Yes
Genders are stable 30 Number of 72.9% (Nichols 1995: 343) Genders (very stable)
Yes [not 1153 ..род (gender) clear whether compara ble]
Accusative alignment is 99.2 37.0% stable (Nichols 2003: Alignment of (stable) 286) Case Marking of Pronouns: Nominative accusative (standard) 100.2 35.8 Alignment of (stable) Verbal Person Marking: Accusative alignment
3644 ..номинативный (accusative)
76% (very Yes stable)
63% (very stable)
3653 ..эргативный 76% (very (ergative) stable)
Yes. but less data than in WALS
Yes and no
100.3 41.8 Alignment of (stable) Verbal Person Marking: Ergative alignment Nasal vowels are unstable (Greenberg 1978: 76, 1995:151, Croft 1996: 206-7; 2003: 235)
10.1 Vowel 57% (very nasalization: stable) Contrastive nasal vowels present
(no) [not clear whether compara ble]
Numeral classifiers do not have high probabilities for inheritance (Nichols 2003: 299)
55 Numeral 38,7% Classifiers (stable)
No (but 1265 see ..классификаторы values 2 (classifiers) and 3)
1 Numeral 53,6% classifiers are (very absent stable) 2 Numeral 31,4% classifiers are (unstable) optional 3 Numeral 24,3% classifiers are (unstable) obligatory
29 38% ..назализованные/н (stable) еназализованные (nasals)
33% (unstable)
No
Yes
Head-dependent 23 Locus of 27.5% marking in S is stable Marking in (unstable) (Nichols 1995: 343) the Clause
No
3718 ..актант 27% определяет форму (unstable) предиката (actor determines the predicate's form)
No
3739 ..предикат (9%) [very определяет форму unstable] актанта (predicate determines the actor's form) Singular/plural opposition (vs. neutralization thereof) in noun inflection is stable (Nichols 1995: 343).
34.6 18.6% Occurrence (very of Nominal unstable) Plurality: Plural in all nouns, always obligatory
Ergativity has a low 99.4 74.9% probability of Alignment of (very inheritance (Nichols Case stable) 2003: 295) Marking of Pronouns: Ergative — absolutive
No
3370 ...число 33% (number in noun (unstable) inflection)
No
1393 59% (very No ...эргатив/абсолюти stable) в (ergative/absolutive case present)
A-removing inflection 107 Passive 28,3% No 1780 ..пассив 33% (or very regular Constructions (unstable) (passive) (unstable) derivation) on verbs (passive, etc.) is stable (Nichols 1995: 343) Table 2. WALS and JM stabilities compared with statements in the literature
No
No
It can clearly be seen that yes-no assessments are the same in almost all cases, save for just one exception (numeral classifiers, which is perhaps due to misrepresentation of this feature in JM). This kind of result cannot be ignored and shows that, while the structures of the databases are fundamentally different, the results one gets from applying a metric to them both is, objectively, almost equal. The curious thing is that even in the cases where one database does not comply with the literature, so does the other. There may, after all, be some similarities between the way WALS and JM feature sets are organized – or at least filled – but it would seem that the main conclusion that may be drawn from these results is that the metric applied is, indeed, objective, and that stability in such an understanding is predominantly not areal-dependent. To prove whether the metric is correct or wrong in its assessment is, however, a completely different matter. There do exist methods, however, to more-or-less objectively verify the performance of a metric, like the simulation carried out by Wichmann and Holman, while the same cannot be said about most statements made in the literature.
SECTION 5. CONCLUSION AND FUTURE RESEARCH A key to understanding how languages develop, stability will undoubtedly continue to be one of the main focuses of quantitative research for some years to come. Applying an identical metric to two substantially different databases and getting similar results in many areas has shown that such a metric is an objective assessment of some fundamental properties of linguistic features. It does not mean that such a metric tells us all of how features of a language change over time. It is possible that what we call 'stability' is actually an amalgamation of two or more different linguistic traits. Nevertheless, it is important to confirm that the numbers counted are not mere constructs, but do reflect something in language's behaviour. On the other hand, one cannot ignore a seeming absence of correlation between numerical values of the two results. This can be explained in different ways: it may be that the differences in the databases' structures are to blame; in some cases, the data may not yet be completely fit for such analyses. Either possibility should be explored and assessed. One can conclude that these first steps look quite promising. The areas for improvement are clear: the databases need to be constantly updated, their numbers of languages and features increased, their data corrected; furthermore, new metrics and methods, building up on the existing ones, should be devised for analysis of the immense data linguists now have at their disposal. REFERENCES 1. 2. 3. 4. 5. 6. 7. 8.
Dahl, Östen (2004). The Growth and Maintenance of Linguistic Complexity. Amsterdam: John Benjamins, 2004. Dryer, Matthew S. 2001. List of genera, available at: http://linguistics.buffalo.edu/people/faculty/dryer/dryer/genera . Greenberg, Joseph H. (1978) Diachrony, synchrony and language universals. In Universals of Human Language, Vol. III: Word Structure, ed. by Joseph H. Greenberg, Charles A. Ferguson, and Edith A. Moravcsik, pp. 47–82. Stanford: Stanford University Press, 1978. Haspelmath, Martin, Matthew Dryer, David Gil, and Bernard Comrie (eds.) (2005). The World Atlas of Language Structures. Oxford: Oxford University Press, 2005. Available online in database form at: http://www.wals.info . Maslova, Elena S. (2004) Динамика типологических распределений и стабильность языковых типов. [Dynamics of typological distributions and stability of language types.] Voprosy jazykoznanija 5, p. 3–16. Nichols, Johanna (1992). Linguistic Diversity in Space and Time. Chicago: The University of Chicago Press, 1992. Polyakov, Vladimir N. and Solovyev, Valery D. (2006) Компьютерные модели и методы в типологии и компаративистике. [Computer models and methods in typology and comparative linguistics] Kazan, 2006. Wichmann, Søren, and Eric W. Holman (pending publication). Assessing temporal stability for linguistic typological features. Available online: http//email.eva.mpg.de/~wichmann/WichmannHolmanIniSubmit.pdf .