2011 kovarovic et al da.pdf

Viewer
Transcript

Journal of Archaeological Science 38 (2011) 3006e3018

Contents lists available at ScienceDirect

Journal of Archaeological Science journal homepage: http://www.elsevier.com/locate/jas

Discriminant function analyses in archaeology: are classiﬁcation rates too good to be true? Kris Kovarovic a, *, Leslie C. Aiello b, Andrea Cardini c, d, e, Charles A. Lockwood f,1 a

Department of Anthropology, Durham University, South Road, Durham DH1 3LE, UK The Wenner-Gren Foundation, 470 Park Avenue South, 8th Floor, New York City, NY 10016, USA c Dipartimento di Biologia, Universitá di Modena e Reggio Emilia, via Campi 213, 41100 Modena, Italy d Hull York Medical School, University of Hull & University of York, Heslington, York YO10 5DD, UK e Centre for Forensic Science, The University of Western Australia, 35 Stirling Highway, Crawley WA 6009, Australia f Department of Anthropology, University College London, 14 Taviton Street, London WC1H 0BW, UK b

a r t i c l e i n f o

a b s t r a c t

Article history: Received 24 June 2010 Received in revised form 19 June 2011 Accepted 23 June 2011

The use of discriminant function analyses (DFA) in archaeological and related research is on the increase, however many of the assumptions of this method receive a mixed treatment in the literature. Statisticians frequently use complex statistical models to investigate analytical parameters, but such idealised datasets may be hard to relate to “real-life” examples and the literature difﬁcult to assess. Using two faunal datasets that are more typical of archaeological and related research, one comprised of sizecorrected linear measurements of bovid humeri and another of 3D geometric morphometric (GMM) shape data of African monkey skulls, and two simulated datasets, we illustrate some of the most important but often ignored issues of DFA. We speciﬁcally show why it is paramount to address “overﬁtting” by cross-validation when applying this method and how the probability of correctly classifying cases by chance can be properly and explicitly taken into account. Crown Copyright Ó 2011 Published by Elsevier Ltd. All rights reserved.

Keywords: Discriminant function analysis Resampling Over-ﬁtting Cross-validation Classiﬁcation accuracy

1. Introduction Discriminant function analysis (DFA)2 is an increasingly common analytical tool employed in archaeology and allied disciplines when the determination of group membership is the aim of the analysis. It is certainly difﬁcult to deny the utility of a method that can predict group afﬁliation in disciplines dealing fundamentally with unknowns in the archaeological, palaeontological or geological record. DFA can be used for studies ranging from the determination of the style or point of origin of ceramics (e.g. Carrano et al., 2009; Abbott and Watts, 2010) to the identiﬁcation of site activity areas (e.g. Hjulström and Isaksson, 2009; Wilson et al., 2009). The application of DFA is also a well-established practice with regard to faunal material where it is increasingly used to predict taxonomic afﬁliation, habitat preference, dietary group, body size or sex (e.g. Elton, 2001; Mendoza and Palmqvist, 2006; Kovarovic and Andrews, 2007; Palmqvist et al., 2008; Plummer et al., 2008; Germonpré

* Corresponding author. Tel.: þ44 (0) 191 334 1628; fax: þ44 (0) 191 334 1615. E-mail address: [email protected] (K. Kovarovic). 1 deceased. 2 Discriminant Analysis (DA) is a common synonym. The analysis is also similar to Canonical Discriminant Analysis (CDA) and Canonical Variate Analysis (CVA).

et al., 2009; Sapir-Hen et al., 2009; Klein et al., 2010). The method is naturally comparative where data regarding modern fauna, for which the taxonomy and behaviour are known, provide the basis for making predictions about unknown or unclassiﬁed individuals, which are the fossil or archaeological specimens. For example, ecomorphology studies, in which speciﬁc dietary and locomotor adaptations to different environments are evaluated, are particularly reliant on DFA. This technique is used by zooarchaeologists and palaeoecologists investigating the habitat preferences of representatives of mammalian families commonly found at archaeological or palaeontological sites, particularly (but certainly not limited to) bovids (Kappelman, 1991; Plummer and Bishop, 1994; Kappelman et al., 1997; Sponheimer et al., 1999; Kovarovic and Andrews, 2007). The aim is to predict the habitat preferences of a diverse and abundant aspect of the mammal community in order to reconstruct the distribution of habitats that could have supported that community in the past. Ecomorphology has experienced something of a renaissance in the literature in the past several years with a number of new models for habitat prediction (DeGusta and Vrba, 2003, 2005a, 2005b; Kovarovic and Andrews, 2007; Weinand, 2007; Plummer et al., 2008). DFAs are also frequently utilised in geometric morphometrics (GMM), the modern extension of traditional morphometrics (Adams et al.,

0305-4403/$ e see front matter Crown Copyright Ó 2011 Published by Elsevier Ltd. All rights reserved. doi:10.1016/j.jas.2011.06.028

K. Kovarovic et al. / Journal of Archaeological Science 38 (2011) 3006e3018

has in the context of an overall study, but, more signiﬁcantly, in the ways in which the assumptions are expressed, explored and accounted for. We conducted a cursory survey of the variety of research projects making use of DFA in the recent archaeological literature. Using the Journal of Archaeological Science’s ScienceDirect website with “discriminant function analysis” as the search parameter in “all ﬁelds”, we looked for articles published or made available online between January 2009 and March 2010. The search returned 15 results for 2009 and another 15 for the ﬁrst three months of 2010; out of these 30 articles, 27 of them used DFA (the remaining three mentioned the analysis in the context of previous work or possible approaches). A summary of the titles in question can be obtained directly from us. This survey was clearly not a systematic, in-depth investigation of all relevant literature, but it does illustrate the breadth of research questions that can be approached with this method. Furthermore, a brief reading of each piece illustrated the range of ways in which the statistical considerations are addressed and, in fact, only nine papers reported crossvalidated results, which we will argue are the more important results in a predictive DFA. Additionally, further surveys conducted in the same fashion for each year in the past decade illustrate an equally important fact: DFA is becoming increasingly important (Fig. 1). Due to its relative simplicity compared to more sophisticated and younger methods such as neural networks (Bell and Croson, 1997; Delicado, 2000; Bescoby et al., 2004), it is likely that this trend will continue. Building on the limited example provided by DeGusta and Vrba (2003) and the general need to explore the assumptions of this increasingly used technique, our aims here are: 1) to illustrate the problem of over-ﬁtting (i.e., when the model ﬁts well with a given set of data, but has a poor performance outside that) using two datasets that typify archaeological and related research, as well as two simulated datasets; 2) to consider a variation in this problem by controlling for the parameter that can most easily vary (especially when using geometric morphometric methods, where sets of dozens or even hundreds of variables are not uncommon): the number of predictors of group membership relative to sample size. We do not investigate overall differences in sample size, but it has been shown previously that this parameter exacerbates problems we are exploring in our study (e.g. Titus et al., 1984; White and Ruttenberg, 2007). Results from analyses and simulations will be used to illustrate major problems and suggest straightforward procedures for their mitigation. Other known issues of DFA will be concisely discussed to provide readers with a more complete picture of the potential caveats of this method. 2. Discriminant analysis: principles and problems Advice on how to implement the analysis in many programs can be found in the statistical literature and many textbooks (a few 25 20

number of articles

2004). GMM is more efﬁcient in separating size and shape, is considered statistically more powerful and is highly effective in the visualization of results (Rohlf and Marcus, 1993). It is encountering an increasing success in archaeology and zooarchaeology where it is used to understand differences between groups like populations, species and sexes (e.g. Buck and Strand Viðarsdóttir, 2004; Bignon et al., 2005; Larson et al., 2007; Escudé et al., 2008; Elewa, 2010). Shape coordinates from these analyses can be used to successfully predict the group afﬁliation of unknowns (Strand Viðarsdóttir et al., 2002; Berge and Penin, 2004; Wallace, 2006). As with all statistical analyses, DFA involves a series of assumptions regarding its implementation and there are a number of factors that limit its use under certain circumstances. One problem in particular has been recently highlighted by DeGusta and Vrba (2003) in which they present an ecomorphological model of habitat prediction based on bovid astragali. They call attention to the fact that DFAs must provide a group assignment for each individual and that the analysis is designed to maximise differences between the groups. This results in an overall tendency to correctly assign individuals at a higher rate than that expected by chance alone, even if the dataset itself is comprised of predictor variables that in fact bear no real relationship to the groups one is trying to discriminate. A result like this can be mistakenly interpreted as representing the successful attribution of individuals to their groups on the basis of meaningful functions, when the correct assignments are merely the result of an inherent property of the analysis. The example provided by DeGusta and Vrba (2003) illustrates that for their dataset of modern bovid astragali distributed unequally across four habitat categories, a DFA of their eight measurements successfully predicts the correct habitat afﬁliation of 67% of their specimens. They subsequently randomly assigned their specimens to incorrect habitat groups in the same proportion as the correct group sizes and generated new functions that successfully predicted, on average, only 45% of the sample. The implication is that only when a model correctly predicts more than 45% of their individuals does it reﬂect a true correspondence between astragali morphology and habitat. They refer to this cutoff point (i.e., the average 45% accuracy in random groups of the same size as the original ones) as the “baseline of accuracy”. Their example illustrates how a “meaningless” analysis performs better than chance, since the basic rule of probability suggests that only 25% would be correctly assigned if four groups are under consideration (eg. 1/4 ¼ 25%). In general, when the results have not been cross-validated (a procedure we describe in more detail below), this inevitably results in a correct classiﬁcation rate that is much higher than chance expectations, a problem referred to as “overﬁtting”. Also, DeGusta and Vrba’s (2003) example was speciﬁc to one dataset and was not generalised to others; there are a number of parameters that in fact have an effect on the accuracy rate, including the number of groups, number of predictor variables and sample sizes in each group. Structured resampling experiments can be useful for investigating these analytical parameters in DFA (e.g. White and Ruttenberg, 2007). These are especially helpful for social and historical scientists like archaeologists, who may not be aware of the purely theoretical statistical literature. In this respect, resampling provides more tangible examples of, for instance, potential errors in accuracy rates, and can be applied to either simulated or real datasets. Although simulated datasets allow accurate modelling of parameters, real datasets (as in DeGusta and Vrba, 2003) have the advantage of exemplifying the analysis on data which better reﬂect the reality of the smaller, imperfect assemblage-based sets of observations with which we tend to work in archaeology. Researchers vary tremendously not only in the questions they address with the DFA method and the importance that this analysis

3007

15 10 5 0 2001

2002

2003

2004

2005 2006 year of publication

2007

2008

2009

2010

Fig. 1. The number of articles in Journal of Archaeological Science published or made available online between January 2001 and March 2010 which mention DFA.

3008

K. Kovarovic et al. / Journal of Archaeological Science 38 (2011) 3006e3018

suggestions include Neff and Marcus, 1980; Albrecht, 1992; McGarigal et al., 2000; Brace et al., 2006; Tabachnick and Fidell, 2007; Strauss, 2010). For the uninitiated, the main principles and assumptions of the technique are summarized here in qualitative terms with special emphasis on aspects that are the main focus of this study. 2.1. Classiﬁcation functions and probabilities When DFA is used to assign individuals to groups (predictive discriminant analysis, Huberty and Hussein, 2003; Everitt and Howell, 2005), it utilises predictor variables to determine the linear dimensions along which known groups are best separated. This new set of variables, known as discriminant functions, is derived by building linear combinations of the original variables that maximize between to within-group variance. The number of functions is equal to the number of groups minus one and, unless there are fewer predictor variables than groups, they only account for the variance which best discriminates groups according to the model. An interesting property of the method is that, as the multivariate space is scaled by the inverse of the pooled withingroup variation, distances - usually Mahalanobis distances - in this transformed space are independent of the scale of measurement and differences are expressed in units of standard deviations. The groups are clustered around the centroid, or mean discriminant score, for each group on each function. Classiﬁcation is based on each case’s proximity to the groups’ centroids in the new multivariate statistical space, and probabilities are calculated to express the likelihood that the case belongs to each of the groups. The smallest distance and highest probability determines the case’s group assignment. Probabilities derived from Mahalanobis distances are called posterior probabilities, can be weighted according to previous knowledge about samples and the populations from which they are from (a priori probability e see below) and are generally standardized, so that across all groups they sum to 1. If they are not standardized, however, they are called typicality probabilities and provide an estimate of whether an observation is a multivariate outlier. To put it simply, when that happens, the observation it is so far from the closest group centroid to make it unlikely that it belongs to that or any other group in the analysis. 2.2. Predictor variables As a rule of thumb, the number of predictor variables should not exceed the number of individuals in the smallest group considered in a given dataset (Hair et al., 1998; Tabachnick and Fidell, 2007) and a ratio of 5e20 individuals per variable is sometimes suggested (Hair et al., 1998; Strauss, 2010). But, from a mathematical point of view, the required minimum number of predictors can be calculated by the simple formula (N e G) V, where N is the total number of individuals in the entire sample, G is the number of groups and V is the number of variables. It is easy to imagine situations in which one must reduce the number of predictor variables entered into the analyses either because samples are small, as it is not uncommon in archaeology or taxonomy (Marcus, 1990) or because of a large number of variables, as it is often the case in geometric morphometrics (Adams et al., 2004). Although this is sometimes not advisable, as it inevitably implies a loss of information and potentially increases the dependency of the model on a speciﬁc set of data (i.e., reduces its generalizability), two main strategies are used to achieve this goal. One may opt to use a stepwise DFA procedure. The basic direct method enters all predictors into the equations simultaneously; the stepwise method enters (or removes) them one-by-one and, using a speciﬁed statistical criterion, evaluates the variable’s contribution

to the overall discrimination. At each step the resulting model includes only variables that have contributed to a greater discrimination than models conducted without that variable in the previous step. In this way the best set of predictors should be identiﬁed, although this method is sensitive to small variations in the predictors and may over-state their importance to the discriminatory model. Indeed, several authors have criticised the generalizability of functions from a set of predictors selected this way, raising doubts about the appropriateness of stepwise DFA (e.g., Huberty, 1984; Huberty and Hussein, 2003). Alternatively, predictors can be reduced empirically by ﬁrst conducting a data reduction analysis such as a principal components analysis and then using a number of the ﬁrst principal components large enough to minimize the loss of information to build the discriminant functions. Even if there is no guarantee that the variance which maximally discriminates groups resides in the ﬁrst principal components, Cardini and Elton (2008a) found that results from DFA on the ﬁrst principal components were fully congruent with a classiﬁcation based on full shape distances. 2.3. More assumptions and considerations As anticipated, DFAs are not “universal” in their application because the functions are speciﬁc to the dataset and it can be difﬁcult to apply the results or extend the interpretations to other samples. DFA performs better when the predictor variables are not highly correlated (i.e. collinear), assumes both independent observations and multivariate normally distributed data (i.e., normality of each of the variables and all their possible linear combinations), and is sensitive to the presence of outliers. DFA further assumes equality of within-group variance-covariance matrices (homoscedasticity), commonly evaluated using Box’s M test for homogeneity or an equivalent chi-square test (Tabachnick and Fidell, 2007; Rohlf, 2009). This assumption is, however, frequently violated in archaeological and taxonomic applications (Marcus, 1990) or untestable because at least one of the groups may have a sample size equal to or smaller than the number of variables. When this happens, test statistics like the chi-square test for homogeneity of variancecovariance matrices, are undeﬁned (Rohlf, 2009). Some statisticians argue that the analysis is robust to this violation, especially when sample sizes are large (e.g. Tabachnick and Fidell, 2007), however many practitioners disagree (e.g. Williams, 1983). Some advocate the use of a quadratic discriminant analysis, where the variance-covariance matrix is not pooled across groups, under these circumstances (e.g. Plummer et al., 2008). Unfortunately, quadratic functions require larger samples than linear discriminant functions making it somewhat difﬁcult to apply in an archaeological context, particularly where taxonomy is considered (Naylor and Marcus, 1994). In addition to these issues, one may also want to consider whether the probability of group membership should be calculated based on an equal chance of each case belonging to each group or if that chance is, for instance, proportional to the group sizes themselves. We do not address group size in our resampling experiments below, but here and above we acknowledge that group sizes are frequently unequal in archaeological and palaeo- datasets. Variations in this parameter result in an additional layer of complexity, which generally makes potential problems even more serious, as brieﬂy mentioned in the Discussion. 2.4. Assessing the success of predictive DFAs The assumptions and considerations outlined above all play into one major question: how to assess the success of a predictive DFA?

K. Kovarovic et al. / Journal of Archaeological Science 38 (2011) 3006e3018

The most straightforward way to assess the model is to consider the classiﬁcation rate of the total sample, usually reported as a percentage. This is a simple and easily understood statistic that may appear useful when comparing the success rates of different analyses. However, the success of different analyses cannot be directly measured and compared in this manner because differences in sample compositions, the number of groups and the variables all potentially have consequences for the results. For instance, 50% accuracy could be good for 10 groups of equal sample size, but is clearly no better than chance for just two groups. Thus, classiﬁcation rates should be corrected for chance expectations and the amount of over-ﬁt assessed. In order to determine how tightly the model is bound to the comparative dataset, it is often suggested that a “test sample” is held out, sometimes 10e20% of the total although it could be 50% or more (Hair et al., 1998). This test sample is then entered into the analysis as a group of unknowns to see how well the resulting functions can classify these cases into their known groups. The test sample can be a random selection of cases, an equal number of each type of case or it may be more directly selective, such as the removal of entire species or types, etc., to determine if they are driving the analysis. An alternative statistical cross-validation procedure is the leaveone-out or jackknife approach. In this type of analysis one case is left out and the discriminant functions are re-calculated, effectively stripping the analysis of that one case’s potential contribution to the group discrimination and thus avoiding issues of circular reasoning (using functions derived from data which are the same data classiﬁed by the functions) and over-ﬁtting. The case that has been withheld is then assigned to one of the groups on the basis of these new functions. This is repeated for every case in the dataset and a percentage of the total cases correctly assigned to their known group is reported as the cross-validated success rate. This will almost always be a more conservative approach with lower but more realistic success rates. Indeed, a cross-validated success rate that is much lower than the original result suggests that the discriminant functions from those samples are ’too good to be true’, and unlikely to be valid for accurate predictions of group afﬁliation of unknowns. When group sample sizes are the same, the proportion of specimens correctly classiﬁed by chance is simply 1/G, where G is the number of groups. However, when the number of individuals varies across groups, and one assumes proportional a priori probability, the proportional chance criterion may be used (Sanchez, 1974). This statistic provides a way to compare between the success rates of different analyses regardless of differences in group composition. A few indices have also been proposed which directly correct accuracy rates for the effect of chance (McGarigal et al., 2000). For instance, the TAU statistic described in the next section takes chance and prior probabilities into account and is more directly comparable across different analyses. An alternative, and potentially more accurate way, involves complex randomization procedures (such as that in White and Ruttenberg, 2007), but since these represent a computationally intensive approach which is not available in common statistical software, it is unlikely to feature regularly in the literature. 3. Methods 3.1. Real datasets (bovids and monkeys) Two real datasets of modern bovid and monkey specimens were analysed. Both datasets have been and continue to be used in ongoing research. These data do not meet the assumptions of homoscedasticity and a few variables did not follow a normal

3009

distribution. In the bovid dataset, there was collinearity among some of the predictors. Thus, these datasets present a more realistic scenario in terms of how assumptions of DFA may not be met in real life data but still provide interesting clues about group prediction. Collinearity could have been reduced by ﬁrst performing a principal component analysis and then taking only a number of the ﬁrst axes (as in the monkey dataset). This was not done to exemplify DFAs on the real (‘size-corrected’) variables, instead of their linear combinations, in one of the two datasets. For simplicity, we did not address the issue of missing data, which we acknowledge is common when working with fragmentary archaeological material. The ﬁrst dataset is an unpublished bovid ecomorphological dataset comprised of 20 linear measurements of the humerus. The sample originally includes 300 adult specimens from 73 species found globally, with anywhere from one to 8 individuals in each species. They were assigned to a six habitat classiﬁcation system representing increasing amounts of vegetation cover: grassland/ tree-less (G/T), wooded-bushed grassland (WBG), light woodlandbushland (LWB), heavy woodland-bushland (HWB), forest (F) and, providing an element of vertical terrain, a montane category (M). There were 37e60 individuals in each habitat group. The data were log-transformed and size corrected according to the species geometric mean. The second dataset is derived from previous work investigating patterns of morphological evolution and the phylogenetic signal in Old World monkeys (Cardini and Elton, 2008a, 2008b, 2008c). It is made up of 3D anatomical landmark coordinates of originally more than 1000 guenon (Primates, Cercopithecini) skulls. Data collected using a 3D digitizer include 86 landmarks on both the cranium and the mandible (the landmark conﬁguration and specimen provenance are described in the series of publications mentioned above). Data were analysed using Procrustes based geometric morphometrics (Adams et al., 2004; Sanﬁlippo et al., 2009). To control for factors other than the number of variables relative to sample size and to keep the design of the analysis simple and easy to interpret, we created a fully balanced experimental design. In order to do this, both datasets were reduced to 100 individuals in ﬁve groups from the original total bovid and monkey samples of more than 300 and 1000 specimens, respectively. Thus, 20 individuals were randomly selected to represent variation within each group. Groups were ﬁve of the original six habitat classes for the bovids (“forest” was randomly chosen for exclusion) and ﬁve species of guenons (Cercopithecus ascanius, Cercopithecus cephus, Cercopithecus mitis, Cercopithecus nictitans, Cercopithecus petaurista - all of them representing different branches of the arboreal radiation of this group of monkeys). For the geometric morphometric dataset, shape coordinates were subjected to a principal component analysis and the ﬁrst 20 PCs selected to summarize total shape variation (Cardini et al., 2010). These were ﬁnally used to classify skulls according to species. 3.2. Simulated datasets Analyses were also performed on two simulated datasets consisting of 20 variables and 100 specimens subdivided into ﬁve groups. Variables were built by creating randomly distributed numbers from a normal distribution with a mean of zero and a unit standard deviation. Variables were thus normally distributed. They also satisﬁed the assumption of homogeneity of variance-covariance matrices. In the ﬁrst dataset, different random numbers, again extracted from a normal distribution with zero mean and unit standard deviation, were added to each group and each variable to create small differences in means. For instance, for the ﬁrst random variable, 0.2 was added to the 20 individuals of group 1, 0.4 to the

3010

K. Kovarovic et al. / Journal of Archaeological Science 38 (2011) 3006e3018

20 individuals of group 2 and so on for all groups and all variables. Differences among means were almost always smaller than within group variation, being on average about one-tenth of within group standard deviations (differences: mean, 0.120; SD, 0.308; range, 0.002e1.515; 95th percentile, 0.952). Thus, this dataset simulates a case where differences are small but consistent, and all assumptions of DFA are met. The second dataset was created as the previous one but without adding any quantity to the random variables. Thus, this dataset simulates a case where all assumptions of DFA are met and no differences at all, if not due to sampling error, are present across groups. As all datasets (real and simulated) are the same size and the design is perfectly balanced, the expected proportion of cases correctly classiﬁed by pure chance is thus simply the total sample size divided by the number of groups. Expressed as a percentage this is equal to 20%. 3.3. Analyses On each dataset the same set of analyses were carried out: 1) First, the dataset was analysed using the original (‘observed’) groups both without and with jackknife cross-validation. This analysis was repeated by ﬁrst including all 20 predictor variables and then removing two of them at a time. For the geometric morphometric dataset, the higher order principal components (19th and 20th) were removed ﬁrst, and then the others were removed sequentially until only the ﬁrst and second were left. The sequential exclusion of higher order PCs in DFAs using shape data is a standard procedure to test the sensitivity of the analysis to the number of variables while maximizing information (Sheets et al., 2006). Thus, the ﬁrst series of the analyses represent the results of DFA on the observed data using a different number of variables. When all 20 predictors are included, more potentially useful information is considered to discriminate groups but samples become relatively small relative to the number of variables, whereas when fewer predictors are used, one is in the opposite situation. However, as all analyses are balanced and meet the desirable minimum of Nwithin group V, we are in a far better situation than a worst case, but not uncommon, scenario with very small and heterogeneous sample sizes which only satisfy the minimum mathematical requirement of N e G V. This implies that the issues we will show using these examples can only be more serious in those instances, as convincingly demonstrated by Titus et al. (1984) and White and Ruttenberg (2007). 2) In the second stage, group afﬁliation was randomly reassigned within each dataset. This eliminates group differences, if there were any. Then, as in 1), the same series of DFA was done, either without performing or performing a jackknife cross-validation on datasets with all predictor variables or their subsets (i.e., all except two, four, six, eight etc.). The randomization procedure was repeated 100 times for each analysis. If there is no issue of over-ﬁtting, with ﬁve groups of 20 individuals, these analyses should never produce classiﬁcation rates higher than the 20% random chance baseline.

0 TAU ¼ @Nc

X i ¼ 1;G

1,0 Pi Ni A

@N

X

1 Pi Ni A

i ¼ 1;G

where: N ¼ total sample size; Nc ¼ total number of cases correctly assigned by the DFA; Pi ¼ prior probability of group membership in the i-th of the G groups; Ni ¼ number of cases in the i-th group. Thus, in our example with perfectly balanced samples, N is 100, prior probabilities Pi are the same for all ﬁve groups (i.e., 1/G ¼ 1/5) and Ni is also constant (i.e., 20). Then, if, for instance, 80 individuals (Nc) are correctly classiﬁed overall:

TAU ¼ ½80 5 ð1=5 20Þ=½100 5 ð1=5 20Þ ¼ 60=80 ¼ 0:75 or, as a percentage, 75% ‘chance-corrected’ classiﬁcation accuracy. This means that when the effect of predicting the group correctly by mere chance is taken into account, the total classiﬁcation accuracy is 75% instead of 80%. It can be easily seen that when Nc ¼ 20, i.e., only 20 individuals are correctly classiﬁed on average by the DFA, then TAU is zero, which is our expectation when there is no improvement at all above the ’random chance baseline’. All analyses were done in PASW Statistics 18 (2009). Datasets with randomized group afﬁliation were built in NTSYSpc 2.1c (Rohlf, 2009) using a resampling routine with 100 permutations. Proﬁle plots were drawn and TAUs computed using OpenOfﬁce and Gnumeric. Scatterplots of DFA scores and 95% conﬁdence ellipses were obtained in PAST 2.00 (Hammer et al., 2001). 4. Results Scatterplots (also called ordinations; Sneath and Sokal, 1973) of discriminant scores on the ﬁrst two discriminant axes are shown in Fig. 2 for the ﬁrst set of analyses (step 1 of the methods). Although these plots only illustrate the two main axes (about 69%e87% of between group variance depending on the set of data) of a multidimensional space, they help to show evident violations of assumptions. For instance, in the bovid dataset (Fig. 2a) heavy woodland-bushland (crosses) shows a pattern of variation which is clearly different compared to montane (empty circles) in magnitude and to light woodland-bushland (ﬁlled squares) in direction. Also, elongated ellipses suggest non-circular variation around means, as one would expect if assumptions of DFA were satisﬁed. Thus, scatterplots indicate non-homogeneity of variancecovariance matrices congruently with results of Box’s M tests (results not shown) for this dataset. The other datasets (Fig. 2bed) are less strongly affected by heteroscedasticity, having roughly circular ellipses of similar size. This is congruent with nonsigniﬁcant Box’s M tests for simulated datasets (Fig. 2ced) but not with the signiﬁcant Box’s M of the monkey dataset. A closer look reveals that actually both C. petaurista and C. cephus have smaller and slightly elongated ellipses. Other deviations from circularity might be evident on the third and fourth DFA axes. 4.1. Bovids

For each dataset, results were summarized by plotting the average classiﬁcation rates against the number of variables used in the analysis for the 1) observed and 2) randomized datasets. For the series of DFAs with randomized group afﬁliation, the average and 95 percentile conﬁdence interval of the 100 repeats were shown. Proﬁle plots were also drawn for ’chance-corrected’ rates using TAU (McGarigal et al., 2000):

(1) The observed classiﬁcation accuracy (hit rate) is slightly less than 80% when all 20 predictors are included (Fig. 3a, solid black line). With fewer predictors, the hit rate initially remains about constant and then decreases to about 65% when there are only 12e10 of the original variables. With even fewer predictors, hit rates become smaller and smaller with an approximate

K. Kovarovic et al. / Journal of Archaeological Science 38 (2011) 3006e3018

3011

Fig. 2. Scatterplots of discriminant scores for the observed data on the ﬁrst two discriminant axes (DF1, DF2) with 95th percentile conﬁdence ellipses based on these two DF for each group. (a) Bovids: plus, grassland/tree-less (G/T); empty square, wooded-bushed grassland (WBG); ﬁlled square, light woodland-bushland (LWB); cross, heavy woodlandbushland (HWB); empty circle, montane (M). (b) Monkeys: plus, Cercopithecus ascanius; empty square, C. cephus; ﬁlled square C. mitis; cross, C. nictitans; empty circle, C. petaurista. (ced) Simulated data (c) with and (d) without group differences: plus, group 1; empty square, group 2; ﬁlled square, group 3; cross, group 4; empty circle, group 5.

40% classiﬁcation accuracy when only two variables are left. This trend is mirrored by the cross-validated analysis (Fig. 3b, solid black line), but everything is shifted towards lower hit rates ranging from little more than 60% to less than 40% when the smallest number of predictors is used. (2) When the analyses are repeated after randomizing group afﬁliation (Fig. 3aeb, grey lines), hit rates are virtually always above the 20% chance expected accuracy but drop to the expected average of about 20% after cross-validation.

only slightly below it after cross-validation) and only drop below 80e60% when the number of predictors is about the same as or less than the number of groups (i.e., between six and two). (2) For randomized data, accuracy is better than chance when no crossvalidation is done, but drops to an average of less than 20% (0% TAU) after cross-validation. As for the bovids, cross-validated accuracy is always signiﬁcantly better than chance (upper grey lines, Fig. 4b and d). 4.3. Simulated dataset with differences

TAU conveys the same information in a more effective way (Fig. 3ced). Even after correcting for chance, discriminant analyses are better than the expected zero percent accuracy not only for the real data (Fig. 3c, black solid lines) but also for those where differences were eliminated by randomizing afﬁliation (Fig. 3c, grey lines). However, cross-validated TAUs are better than 0% only for the real data (Fig. 3d, black solid line). Randomized data have an average of about zero. Cross-validated real data always perform better than average expected chance (20% random chance baseline or 0% TAU), and also better than the majority of analyses from datasets with randomized group afﬁliation (dotted grey line corresponding to the upper limit for 95% percentiles). This suggests that they are indeed signiﬁcantly better than chance. 4.2. Monkeys (1) The outcome of the analysis of the monkey dataset (Fig. 4) mirrors that of the bovids. The main difference is that observed hit rates are remarkably large (above 80% before cross-validation or

(1) Despite the small (compared to within group variance) differences, accuracy is close to 100% when a large number of variables is used in a simulated dataset which satisfy DFA assumptions (Fig. 5, solid black lines). Reducing the number of predictors reduces accuracy with a sharp drop when less than four variables are analysed. (2) Randomizing group afﬁliation produces hit rates/TAUs which are better than chance if no cross-validation is performed. After cross-validation, the average accuracy is slightly below random chance expectations (20% hit rate, 0% TAU). As in previous examples, real data are signiﬁcantly better than chance (above upper conﬁdence percentile). 4.4. Simulated dataset without differences (1e2) In the last analyses, where assumptions are met again as in the last example but no differences are present (Fig. 6), the noncross-validated analyses are better than random chance (Fig. 6a and c) and the cross-validated ones are on average no better (Fig. 6b and d). Observed (solid line) is within the range of accuracy from

3012

K. Kovarovic et al. / Journal of Archaeological Science 38 (2011) 3006e3018

randomized data, which is expected given that there were no differences in the ﬁrst place (i.e., even before randomizing group afﬁliation). 5. Discussion Results of all analyses consistently indicate that over-ﬁtting occurs in DFA. Not only was classiﬁcation accuracy always higher in non-cross-validated DFAs than in cross-validated ones, but also the accuracy of predictions for randomized groups was virtually always better than the 20% chance baseline (0% for TAU). In contrast, after the cross-validation, only datasets where differences were present had classiﬁcation accuracies well above chance. The empirical random chance baseline was 20% or slightly less on average. When it was less, as in the monkey and the simulated dataset with differences, this was likely due to the effect of removing a specimen (the ’left-out’ of the cross-validation) from fairly small samples, which may have increased the chance of having it classiﬁed in one of the other groups. The upper limit for the empirical random chance baseline was always below the observed cross-validated classiﬁcation accuracy with the exception of the simulated dataset where no differences were present. This had an accuracy which fell clearly within the

range of ’random chance accuracy’. In this respect, it does seem that empirical random chance baselines can help not only to say whether the percentage of correctly classiﬁed individuals is larger than expected by chance (about 20%) but also whether it is significantly larger (i.e., an outlier compared to the upper limit of the distribution of hit rates from randomized groups). As expected, when the number of predictor variables decreased, accuracy also decreased. Indeed, all the rest being equal, one must be aware that increasing the number of predictors may increase classiﬁcation accuracy and also group separation in scatterplots of non-cross-validated DFAs, even if those predictors are random numbers which do not add any relevant information on group differences. If, in the fourth example, where no group differences are present as variables are random numbers from the same normal distribution, we had added another 30 variables similarly made of random numbers, a clear group separation would have been evident in the scatterplots and the classiﬁcation accuracy in a noncross-validated DFA (i.e., the one generating the scores used in the scatterplot) would have been 89%. Both the apparent separation and the high accuracy, which drops to about 20% after crossvalidation, are the result of over-ﬁtting. In contrast, when meaningful predictor variables are added, predicting their effect on classiﬁcation accuracy in cross-validated

Fig. 3. Classiﬁcation accuracy (hit rates) of the bovid humerus DFAs plotted against the number of predictor variables including (a) non-cross-validated analyses, (b) cross-validated analyses, (c) the TAU statistic for non-cross validated analyses and (d) the TAU statistic for cross-validated analyses. Solid black line: observed hit rates; grey line: mean of the randomised group afﬁliation DFAs; dashed grey line: 95th percentiles for the analyses of DFAs with randomised group afﬁliation.

K. Kovarovic et al. / Journal of Archaeological Science 38 (2011) 3006e3018

DFAs is more complicated. They may or may not improve accuracy. One should consider the possibility that the cost of adding predictors may not be justiﬁed in terms of the parsimony of the model, if the improvement in classiﬁcation is tiny. There may also be a peaking effect for which, paradoxically, adding more variables beyond a given threshold can make the classiﬁcation worse (Jain and Waller, 1978). Where the threshold lies depends on how informative the added predictors are relative to those already in the analysis and on the group sample sizes. Thus, the predictor variables should be selected with the greatest care. For the purpose of variable selection, stepwise DFAs should be interpreted with caution, or even avoided, if one is mainly concerned about generalizability. For traditional morphometrics, highly collinear variables should be avoided (another assumption often violated, as in our example dataset on bovids). In geometric morphometrics, one often performs a principal component analysis of the Procrustes shape coordinates, thus creating variables which are uncorrelated. Then, if samples are large enough, one can use all shape information (i.e., all principal components). If not (as it often happens with landmark data), one could reduce dimensionality by including only a given number of the ﬁrst principal components, as we did in the monkey example. However, this is always a risky operation as it is hard to know what

3013

was lost by excluding those higher order components. Indeed, a principal component analysis does not “know” anything about a priori groups and its axes are simply built to maximize total sample variance. Ideally, one should try to avoid dimensionality reduction and instead try to increase sample size, which would also help to make the model more accurate and generalizable (Jain and Waller, 1978). When this is not possible, one may need to explore alternative methods with less strict requirements (e.g., Cardini and Elton, 2008a). At the very least, a sensitivity analysis should be performed to test how strongly the outcome of the analysis seems to be affected by changes in the number of predictor variables. When this was done (results not shown) in our monkey dataset, we found that cross-validated hit rates were highest and fairly stable when approximately 6e25 of the ﬁrst principal components were included in the DFAs. This observation, together with the selection procedure, suggests that the summary of shape variation included in the ﬁrst 20 PCs was adequate to pick up the main group differences. DFA can be used not only with a predictive purpose but also with a descriptive one (Huberty and Hussein, 2003; Everitt and Howell, 2005). Although these two aspects are related, in the case of a descriptive DFA, the question is not whether predictors accurately classify specimens into groups, but whether and how groups differ

Fig. 4. Classiﬁcation accuracy (hit rates) of the monkey skull DFAs plotted against the number of predictor variables including (a) non-cross-validated analyses, (b) cross-validated analyses, (c) the TAU statistic for non-cross validated analyses and (d) the TAU statistic for cross-validated analyses. Solid black line: observed hit rates; grey line: mean of the randomised group afﬁliation DFAs; dashed grey line: 95th percentiles for the analyses of DFAs with randomised group afﬁliation.

3014

K. Kovarovic et al. / Journal of Archaeological Science 38 (2011) 3006e3018

in terms of the set of variables being analysed. Thus, specimens are plotted onto the discriminant axes using different symbols for groups, after testing that groups are statistically signiﬁcant in a one-way multivariate analysis of variance (Everitt and Howell, 2005). This multivariate test is customary and often part of the output of a DFA in many statistical applications (e.g., SPSS and PAST). The use of DFA in a descriptive context raises another issue related to over-ﬁtting, and this is that the space of the DFA axes is a statistical space. That means that distances are ’distorted’ compared to the space of the original variables in order to maximize between to within group differences. Therefore, it may happen that three groups whose means are equidistant in the space of the original variables could look as if two of them are much closer than the third in the space of a DFA (Klingenberg and Monteiro, 2005, see Fig. 1). For the same reason, one expects circular scatters around means if the assumptions of DFA are not violated. This, however, will only be evident as long as one is using the same scale on all axes when plotting scores of a DFA. Finally, as plots summarise group differences in the space of the discriminant functions based on the whole sample, those plots have the same problem as non-cross-validated classiﬁcation rates: they are internally biased, as the ordination results are produced using functions derived from the same data to which they are applied

(Everitt and Howell, 2005). Thus, scatterplots should be used with caution or even replaced by principal components of the original variables, which preserve the geometry of the observed space. If there is a good group separation on principal components, especially if it is on some of those explaining more variance, those groups might truly be different. If one cannot see clear differences on principal components, the interpretation has to be done with the greatest care, as these variables only maximize total sample variance and could therefore miss differences which do not happen to align with any of the axes of greatest variance. To control for factors other than the number of predictors we chose to analyse datasets where all groups have the same sample size, but this is seldom the situation in archaeological research. Heterogeneous sample size, as it is easy to guess, makes problems even more serious. It is known that unequal group sizes may sometimes lead to very high classiﬁcation accuracy, but the improvement over random correct classiﬁcation may be very small (Titus et al., 1984). Larger groups often capture a greater amount of variation between individuals in the group (especially when multiple species are considered in faunal analyses), potentially attracting more specimens and inﬂating classiﬁcation accuracy (Jain and Waller, 1978; Daniels and Darcy, 1983; Titus et al., 1984; Hair et al., 1998; White and Ruttenberg, 2007).

Fig. 5. Classiﬁcation accuracy (hit rates) of the simulated data (with group differences) DFAs plotted against the number of predictor variables including (a) non-cross-validated analyses, (b) cross-validated analyses, (c) the TAU statistic for non-cross validated analyses and (d) the TAU statistic for cross-validated analyses. Solid black line: observed hit rates; grey line: mean of the randomised group afﬁliation DFAs; dashed grey line: 95th percentiles for the analyses of DFAs with randomised group afﬁliation.

K. Kovarovic et al. / Journal of Archaeological Science 38 (2011) 3006e3018

White and Ruttenberg (2007, p. 303) provide a concise explanation using their resampling experiments for two groups: “when n ¼ 200 with a 4:1 ratio of sample sizes, reclassiﬁcation success must be about 80% to exceed null expectations. This deviation from 1/G appears to result from increased chance reclassiﬁcation success for the larger group. With both sets of priors [uniform or empirical e i.e., proportional to Nwithin group], increasing sample size reduces the variance in the null expectation but has minimal effect on the deviation from 1/G”. Thus, 1/G is not an appropriate random chance baseline when sample sizes are fairly heterogeneous, as correct classiﬁcations by chance are likely to increase. Regardless of whether sample sizes are the same or different, TAU will provide the same correction for chance correct classiﬁcation if prior probabilities are equal; however, the correction will be stronger (i.e., leading to a more pronounced reduction in percentages of specimens correctly classiﬁed above chance expectations) if prior probabilities are empirical. This means that with equal a priori probability (0.5) for two groups and a cross-validated hit rate of 80%, one would get the same TAU (60%) regardless of whether sample size was the same (e.g., 50 individuals in each group) or different (e.g., 80 versus 20) across groups. To take into account the fact that it is more likely to get groups right by chance when sample sizes are highly

3015

heterogeneous, one could employ prior probabilities proportional to group sample size in computing TAU (then, in the example above, it would drop to 37.5%). If Nwithin group is highly heterogeneous, however, the most accurate way of computing a random chance baseline is likely to be relying on randomizations, as we have done. When DFAs have a predictive purpose, one should not only consider available groups to decide afﬁliation (as we did for simplicity, given the large number of analyses) but also the possibility that a specimen has a small chance to belong to any of those groups. As we mentioned, these probabilities are called typicality probabilities and they provide crucial information especially when unknowns must be classiﬁed. For instance, in the bovid dataset there are four individuals for which heavy woodland-bushland is considered the most likely habitat group, but whose typicality probability is below 0.01. This means that even if all other groups are less likely than this habitat, these four individuals are outside the 99% conﬁdence intervals around the multivariate mean of that group when all discriminant functions are simultaneously considered. If they were archaeological or fossil specimens, we could conclude that they have a low chance of truly belonging to any of the putative habitats. However, a second possibility is that they represent natural variation in their groups which has been poorly

Fig. 6. Classiﬁcation accuracy (hit rates) of the simulated data (without group differences) DFAs plotted against the number of predictor variables including (a) non-cross-validated analyses, (b) cross-validated analyses, (c) the TAU statistic for non-cross validated analyses and (d) the TAU statistic for cross-validated analyses. Solid black line: observed hit rates; grey line: mean of the randomised group afﬁliation DFAs; dashed grey line: 95th percentiles for the analyses of DFAs with randomised group afﬁliation.

3016

K. Kovarovic et al. / Journal of Archaeological Science 38 (2011) 3006e3018

sampled. This speaks to the need for large sample sizes and a sampling strategy that accurately encompasses the breadth of variation within each group. Where outliers are archaeological specimens, clearly one should revisit the specimens to look for evidence that measurements were incorrectly recorded or obscured by breakages or adhering matrix. A consideration of typicality probabilities may help address another issue highlighted by DeGusta and Vrba (2003) and critically assessed by Plummer et al. (2008): “Imagine two astragali, both predicted by the function to belong to the Forest . group. The conﬁdence value of one prediction might be 99%, while . the other prediction might be only 50%. So even though both predictions are the ‘same’ (Forest) . it would be a mistake to assume that they are of equal reliability” (DeGusta and Vrba, 2003, p. 1015). Their solution is to ﬁnd a probability cut-off point for each dataset that results in a subset of specimens which has only a 5% total misclassiﬁcation rate; for their dataset, this is 75%. In other words, every specimen with a conﬁdence value for its predicted habitat that is below 75% is ignored. This is criticised by Plummer et al. (2008) because computational inaccuracies potentially spuriously reduce the misclassiﬁcation rate and also because a cutoff may differentially affect a priori groups, some ending up having few or no specimens which pass through the cut-off ﬁlter. We agree that focusing only on predictive accuracy, even when this is crossvalidated, is somewhat limited as the differences in the probability of belonging to a given group are not considered. The use of an arbitrary cut-off to take into account the differences in the conﬁdence of classifying a specimen in one or the other group is, however, unnecessary. By using typicality probabilities from crossvalidated analyses to exclude individuals outside the 95% conﬁdence interval around the mean for the most likely group, their lower accuracy and the fact that they might not actually belong to any of the available a priori groups is acknowledged. This is based on the same multivariate normal model behind the DFA and makes a simple and consistent use of its standard output. Lastly, there is one issue that may have a strong impact on the application of DFAs in archaeology, but which was beyond the scope of our study: observations used to build discriminant functions must be independent. However, when groups are species or made of species, they are highly unlikely to be independent because of the phylogenetic hierarchy (Harvey and Pagel, 1991; Rohlf, 2001). For instance, if a group representing an open habitat consists of 10 closely related antelopes and just one species of bison, the number of independent observations is less than 11, and actually likely to be closer to two if antelopes radiated recently compared to the time since the common ancestor of both antelopes and bison. Thus, although antelopes are analysed as 10 independent observations, they will be very similar as most of their traits were inherited from a recent common ancestor. In fact, their contribution to the amount of variation characteristic of the open habitat will be about the same of that of the single bison species. To address this issue of phylogenetic non-independence, comparative methods have been developed (Felsenstein, 1985; Harvey and Pagel, 1991; Rohlf, 2001; Garland et al., 2005). Like other commonly employed statistical methods (e.g., correlation and regression), DFA does not take phylogeny into account. However, to our knowledge, there has not been yet an attempt to address the problem in the context of DFA or to understand the generalizability of DFAs using multiple species data. Theoretical studies are thus desirable which may help to understand if and how non-independence can be taken into account in DFAs. Predictive DFAs may require special care as the archaeological or fossil specimens one wishes to afﬁliate to a group would need to be ﬁrst placed in a phylogenetic context using data other than the

predictor variables. This is likely to be particularly difﬁcult for fossils. 6. Summary and recommendations 6.1. Issues directly addressed in our study The main messages of our paper are: 1) Cross-validated DFA results are more realistic and should always been reported. 2) The random chance baseline should be clearly stated or ’chance-corrected’ classiﬁcation rates and associated indices (e.g., TAU) should be used. 3) The choice and number of predictor variables needs to be carefully considered. Adding too many could amplify overﬁtting or the peaking effect, whereas excluding some of them could remove potentially interesting information.

6.2. Related issues 1) One should test assumptions and state if they are violated even when the aim is purely predictive and the goodness of the model is said to be judged by its results (i.e., the cross-validated accuracy). Violations of assumptions may have a direct impact on results, and also affect generalizability of functions. 2) DFA is a statistical space, which tends to bias upward ordinations of group differences. Scatterplots of discriminant scores must be appropriately drawn and interpreted with the greatest caution when used to search for patterns of similarity relationships among groups (for example, see Fig. 1 in Klingenberg and Monteiro, 2005). 3) Unequal sample size across groups and small samples are likely to increase over-ﬁt and reduce generalizability of results. Thus, large samples should be used whenever possible. Also, sample size heterogeneity may have consequences on the estimate of the random chance baseline. An elegant although time consuming way to do this is by using resampling statistics, as exempliﬁed in this paper. This method can also provide conﬁdence intervals for classiﬁcation accuracy. 4) The possibility that a case, especially if it is a poorly known and rare fossil or archaeological specimen, may not belong to any of the a priori groups should be taken into account. Thus, typicality probability should be carefully inspected to detect potential outliers. 5) There is a strong need of comparative methods in DFAs as, especially when working with different species, observations are unlikely to be independent and thus violate one of the most basic assumptions of statistical analyses. Acknowledgements We are grateful to the many people who had stimulating discussions with us about this work including Peter Andrews, Nick Walton, Carlo Meloro, Chris Klingenberg, Jim Rohlf, Norm Campbell, David Sheets and Sarah Elton. AC is deeply grateful to all the museum curators and collection managers who allowed him to study their specimens and who greatly helped during the data collection, which was funded by a grant from The Leverhulme Trust. More information on monkey specimens and the museums where they were measured can be found in Cardini and Elton (2008a). KK thanks those who facilitated the bovid data collection in Europe and the US over several years: The Natural History Museum, London: Paula Jenkins, Daphne Hills, Louise Tomsett; Rob Kruszynski; Smithsonian National Museum

K. Kovarovic et al. / Journal of Archaeological Science 38 (2011) 3006e3018

of Natural History, Washington DC: Linda Gordon; American Museum of Natural History, NYC: Bob Randall, Eileen Westwig, Neil Duncan; Field Museum, Chicago: Bill Stanley; Powell-Cotton Museum, Kent, UK: Malcolm Harmon; Zoological Museum, Copenhagen, Denmark: Hans Baagoe, Mogens Andersen; Swedish Museum of Natural History, Stockholm: Per Ericson, Olavi Gronwall, Lars Werdelin; Natural History Museum Vienna, Austria: Barbara Herzig, Helen Jousse, Alexander Bibl; Naturalis, Leiden, The Netherlands: Lars van der Hoek Ostende, John de Vos, Hein van Grouw; Museum of Natural History, Berlin, Germany: Frieder Meyer, Detlef Willborn, Irene Mann; Royal Museum of Central Africa, Tervuren, Belgium: Emmanuel Gilissen, Wim Wendelin, Garin Cael; Hungarian Natural History Museum, Budapest: Gabor Csorba, Laszlo Peregovits, Zoltan Vos. Funding for the bovid data collection was provided by The Leverhulme Trust. This research also received support from the SYNTHESYS Project http:// www.synthesys.info/which is ﬁnanced by European Community Research Infrastructure Action under the FP6 "Structuring the European Research Area" Programme and the Royal Society International Joint Project. AC thanks Sarah Elton for the use of their joint dataset and the many museum curators who facilitated the guenon data collection (see the companion papers Cardini and Elton, 2008a; 2008b). The original guenon project was supported by The Leverhulme Trust. Finally, we wish to acknowledge our colleague, coauthor and friend, Charles Lockwood. He did not live to see the completion of so much of the work in which he was involved, but we hope that he would be happy with and proud of this ﬁnal product. We are grateful for having known such a remarkable human being.

References Abbott, D.R., Watts, J., 2010. Identical rock types with different chemistry: sourcing phyllite-tempered Hohokam pottery from the Phoenix basin, Arizona. J. Archaeol. Sci. 37, 1612e1622. Adams, D.C., Rohlf, F.J., Slice, D.E., 2004. Geometric morphometrics: ten years of progress following the ‘revolution’. Ital. J. Zool. 71, 5e16. Albrecht, G., 1992. Assessing the afﬁnities of fossils using canonical variates and generalized distances. J. Hum. Evol. 7, 49e69. Bell, S., Croson, C., 1997. Artiﬁcial neural networks as a tool for archaeological data analysis. Archaeometry 40, 139e151. Berge, C., Penin, X., 2004. Ontogenetic allometry, heterochrony, and interspeciﬁc differences in the skull of African apes, using tridimensional Procrustes analysis. Am. J. Phys. Anthropol. 124, 124e138. Bescoby, D.J., Cawley, G.C., Chroston, P.N., 2004. Enhanced interpretation of magnetic survey data from archaeological sites using artiﬁcial neural networks: a case study from Butrint, Southern Albania. Archaeological Prospection 11, 189e199. Bignon, O., Baylac, M., Vigne, J.-D., Eisenmann, V., 2005. Geometric morphometrics and the population diversity of Late Glacial horses in Western Europe (Equus caballus arcelini): phylogenetic and anthropological implications. J. Archaeol. Sci. 32, 375e391. Brace, N., Kemp, R., Snelgar, R., 2006. SPSS for Psychologists, third ed. Palgrave Macmillan, London. Buck, T.J., Strand Viðarsdóttir, U., 2004. A proposed method for the identiﬁcation of race in sub-adult skeletons: a geometric morphometric analysis of mandibular morphology. J. Forensic Sci. 49, 1159e1164. Cardini, A., Elton, S., 2008a. Variation in guenon skulls I: species divergence, ecological and genetic differences. J. Hum. Evol. 54, 615e637. Cardini, A., Elton, S., 2008b. Variation in guenon skulls II: sexual dimorphism. J. Hum. Evol. 54, 638e647. Cardini, A., Elton, S., 2008c. - Does the skull carry a phylogenetic signal? Evolution and modularity in the guenons. Biol. J. Linn. Soc. Lond 93, 813e834. Cardini, A., Diniz Filho, J.A.F., Polly, P.D., Elton, S., 2010. Biogeographic analysis using geometric morphometrics: clines in skull size and shape in a widespread African arboreal monkey. In: Elewa, A.M.T. (Ed.), Morphometrics for Nonmorphometricians, Lecture Notes in Earth Sciences, vol. 124. Springer-Verlag Publishers, Berlin, pp. 191e218. Carrano, J.L., Girty, G.H., Carrano, C.J., 2009. Re-examining the Egyptian colonial encounter in Nubia through a compositional, mineralogical, and textural comparison of ceramics. J. Archaeol. Sci. 36, 785e797. Daniels, M.R., Darcy, R., 1983. Notes on the use and interpretation of discriminant analysis. Am. J. Pol Sci. 27, 359e381. DeGusta, D., Vrba, E., 2003. A method for inferring paleohabitats from the functional morphology of bovid astragali. J. Archaeol. Sci. 30, 1009e1022. DeGusta, D., Vrba, E., 2005a. Methods for inferring paleohabitats from the functional morphology of bovid phalanges. J. Archaeol. Sci. 32, 1099e1113.

3017

DeGusta, D., Vrba, E., 2005b. Methods for inferring paleohabitats from discrete traits of the bovid postcranial skeleton. J. Archaeol. Sci. 32, 1115e1123. Delicado, P., 2000. Statistics in archaeology: new directions. In: Barcelo, J.A., Briz, I., Vila, A. (Eds.), New Techniques for Old Times, Computer Applications and Quantitative Methods in Archaeology, Proceedings of the CAA98 Conference. ArcheoPress (BAR International Series 757), Oxford, pp. 29e37. Elewa, A.M.T. (Ed.), 2010, Morphometrics for Nonmorphometricians, Lecture Notes in Earth Sciences, vol. 124. Springer-Verlag Publishers, Berlin. Elton, S., 2001. Locomotor and habitat classiﬁcations of cercopithecoid postcranial material from Sterkfontein member 4, Bolt’s farm and Swartkrans members 1 and 2, South Africa. Palaeontologia Africana 37, 115e126. Escudé, E., Montuire, S., Desclaux, E., Quéré, J.-P., Renvoisé, E., Jeannet, M., 2008. Reappraisal of ‘chronospecies’ and the use of Arvicola (Rodential, Mammalia) for biochronology. J. Archaeol. Sci. 35, 1867e1879. Everitt, B., Howell, D., 2005. Encyclopedia of Statistics in Behavioral Science. Wiley & Sons, Chichester, Sussex. Felsenstein, J., 1985. Phylogenies and the comparative method. Am. Nat. 125, 1e15. Garland Jr., T., Bennett, A.F., Rezende, E.L., 2005. Phylogenetic approaches in comparative physiology. J. Exp. Biol. 208, 3015e3035. Germonpré, M., Sablin, M.V., Stevens, R.E., Hedges, R.E.M., Hofreiter, M., Stiller, M., Després, V.R., 2009. Fossil dogs and wolves from Palaeolithic sites in Belgium, the Ukraine and Russia: osteometry, ancient DNA and stable isotopes. J. Archaeol. Sci. 36, 473e490. Hair, J.F., Anderson, R.E., Tatham, R.L., Black, W.C., 1998. Multivariate Data Analysis. Prentice Hall, New Jersey. Hammer, Ø, Harper, D.A.T., Ryan, P.D., 2001. PAST: Paleontological statistics software Package for Education and data analysis. Palaeontol Electronica 4, 9. http:// palaeo-electronica.org/2001_1/past/issue1_01.htm. Harvey, P.H., Pagel, M.D., 1991. The Comparative Method in Evolutionary Biology. Oxford University Press, Oxford. Hjulström, B., Isaksson, S., 2009. Identiﬁcation of activity area signatures in a reconstructed Iron Age house by combining element and lipid analyses of sediments. J. Archaeol. Sci. 36, 174e183. Huberty, C.J., Hussein, M.H., 2003. Some problems in reporting use of discriminant analyses. J. Exp. Educ. 71, 177e191. Huberty, C.J., 1984. Issues in the use and interpretation of discriminant analysis. Psychol. Bull. 95, 156e171. Jain, A.K., Waller, W.G., 1978. On the optimal number of features in the classiﬁcation of multivariate Gaussian data. Pattern Recognit 10, 365e374. Kappelman, J., 1991. The palaeoenvironment of Kenyapithecus at Fort Ternan. J. Hum. Evol. 20, 95e129. Kappelman, J., Plummer, T., Bishop, L., Duncan, A., Appleton, S., 1997. Bovids as indicators of Plio-Pleistocene palaeoenvironments in East Africa. J. Hum. Evol. 32, 229e256. Klein, R.G., Franciscus, R.G., Steele, T.E., 2010. Morphometric identiﬁcation of bovid metapodials to genus and implications for taxon-free habitat reconstruction. J. Archaeol. Sci. 37, 389e401. Klingenberg, C.P., Monteiro, L.R., 2005. Distances and directions in multidimensional shape spaces: implications for morphometric applications. Syst. Biol. 54, 678e688. Kovarovic, K., Andrews, P., 2007. Bovid postcranial ecomorphological survey of the Laetoli paleoenvironment. J. Hum. Evol. 52, 663e680. Larson, G., Cucchi, T., Fujita, M., Matisoo-Smith, E., Robins, J., Anderson, A., Rolett, B., Spriggs, M., Dolman, G., Kim, T.H., Thuy, N.T., Randi, E., Doherty, M., Due, R.A., Bollt, R., Djubiantono, T., Grifﬁn, B., Intoh, M., Keane, E., Kirch, P., Li, K.T., Morwood, M., Pedriña, L.M., Piper, P.J., Rabett, R.J., Shooter, P., Van den Bergh, G., West, E., Wickler, S., Yuan, J., Cooper, A., Dobney, K., 2007. Phylogeny and ancient DNA of Sus provides insights into Neolithic expansion in Island Southeast Asia and Oceania. Proc. Natl. Acad. Sci. U.S.A 104, 4834e4839. Marcus, L.E., 1990. Traditional morphometrics. In: Rohlf, F.J., Bookstein, F. (Eds.), Proceedings of the Michigan Morphometrics Workshop. University of Michigan Museum of Zoology Special Publication 1, pp. 95e130. McGarigal, K., Cushman, S., Stafford, S., 2000. Multivariate Statistics for Wildlife and Ecology Research. Springer-Verlag, New York. Mendoza, M., Palmqvist, P., 2006. Characterizing adaptive morphological patterns related to habitat use and body mass in Bovidae. Acta Zool. Sin 52, 971e987. Naylor, G.J.P., Marcus, L.F., 1994. Identifying isolated shark teeth of the genus Carcharhinus to species: relevance for tracking phyletic change through the fossil record. Am. Mus. Novit 3109, 1e53. Neff, N.A., Marcus, L.F., 1980. A Survey of Multivariate Methods for Systematics. Privately published. Palmqvist, P., Pérez-Claros, J.A., Janis, C.M., Figueirido, B., Torregrosa, V., Gröcke, D.R., 2008. Biogeochemical and ecomorphological inferences on prey selection and resource partitioning among mammalian carnivores in an early Pleistocene community. Palaios 23, 724e737. Plummer, T.W., Bishop, L.C., 1994. Hominid paleoecology at Olduvai Gorge, as indicated by antelope remains. J. Hum. Evol. 27, 47e75. Plummer, T.W., Bishop, L.C., Hertel, F., 2008. Habitat preference of extant African bovids based on astragalus morphology: operationalizing ecomorphology for palaeoenvironmental reconstruction. J. Archaeol. Sci. 35, 3016e3027. Rohlf, F.J., Marcus, L.F., 1993. Geometric morphometrics: reply to M. Corti. Trends Ecol. Evol. 8, 339. Rohlf, F.J., 2001. Comparative methods for the analysis of continuous variables: geometric interpretations. Evolution 55, 2143e2160. Rohlf, F.J., 2009. NTSYSpc, Version 2.21c. Exeter Software, Setauket, New York. Sanchez, P.M., 1974. The unequal group size problem in discriminant analysis. J. Acad. Market. Sci. 2, 629e633.

3018

K. Kovarovic et al. / Journal of Archaeological Science 38 (2011) 3006e3018

Sanﬁlippo, P., Cardini, A., Mackey, D., Hewitt, A., Crowston, J., 2009. Optic disc morphology - rethinking shape. Prog. Retin. Eye Res. 28, 227e248. Sapir-Hen, L., Bar-Oz, G., Khalaily, H., Dayan, T., 2009. Gazelle exploitation in the early Neolithic site of Motza, Israel: the last of the gazelle hunters in the southern Levant. J. Archaeol. Sci. 36, 1538e1546. Sheets, H.D., Covino, K.M., Panasiewicz, J.M., Morris, S.R., 2006. Comparison of geometric morphometric outline methods in the discrimination of age-related differences in feather shape. Front. Zool. 3, 15. Sneath, P.H.A., Sokal, R.R., 1973. Numerical Taxonomy. Freeman, San Francisco. Sponheimer, M., Reed, K.E., Lee-Thorp, J.A., 1999. Combining isotopic and ecomorphological data to reﬁne bovid palaeodietary reconstruction: a case study from the Makapansgat limeworks hominin locality. J. Hum. Evol. 36, 705e718. Strand Viðarsdóttir, U., O’Higgins, P., Stringer, C.B., 2002. A geometric morphometric study of regional differences in the ontogeny of the modern human facial skeleton. J. Anat. 201, 211e229. Strauss, R.E., 2010. Discriminating groups of organisms. In: Ashraf, E. (Ed.), Morphometrics for Nonmorphometricians. Springer-Verlag Publishers, Berlin, pp. 73e91.

Tabachnick, B.G., Fidell, L.S., 2007. Using Multivariate Statistics, ﬁfth ed. Allyn & Bacon, Boston. Titus, K., Mosher, J.A., Williams, B.K., 1984. Chance-corrected classiﬁcation for use in discriminant analysis. Am. Midl. Nat. 111, 1e7. Wallace, S.C., 2006. Differentiating Microtus xanthognathus and Microtus pennsylvanicus lower ﬁrst molars using discriminant analysis of landmark data. J. Mammal 87, 1261e1269. Weinand, D.C., 2007. A study of parametric versus non-parametric methods for predicting paleohabitat from Southeast Asian Bovid astragali. J. Archaeol. Sci. 34, 1774e1783. White, J.W., Ruttenberg, B.I., 2007. Discriminant function analysis in marine ecology: some oversights and their solutions. Mar. Ecol. Prog. Ser. 329, 301e305. Williams, B.K., 1983. Some observations of the use of discriminant analysis in ecology. Ecology 64, 1283e1291. Wilson, C.A., Davidson, D.A., Cresser, M.S., 2009. An evaluation of the site speciﬁcity of soil elemental signatures for identifying and interpreting former functional areas. J. Archaeol. Sci. 36, 2327e2334.

2011 kovarovic et al da.pdf

have the advantage of exemplifying the analysis on data which. better reflect the reality of the smaller, imperfect assemblage-based. sets of observations with which we tend to work in archaeology. Researchers vary tremendously not only in the questions they. address with the DFA method and the importance that this ...

Download PDF

867KB Sizes 0 Downloads 185 Views

Report

2011 kovarovic et al da.pdf

Recommend Documents