(Nearest) Neighbor's Party: Prototype Methods for ...

Viewer
Transcript

Your (Nearest) Neighbor’s Party: Prototype Methods for Comparative Voting Behavior Piero Stanig, Hertie School of Governance, Berlin. [email protected].

∗

Abstract Political parties should be understood as political expressions of coalitions of social groups, and not only as teams that compete for votes in elections. Describing the social constituencies of political parties is therefore an important empirical objective. I propose to adapt a method developed in the machine learning literature, Learning Vector Quantization, to detect and describe social constituencies of political parties from post-election and general social survey data. I first discuss why regression models might not be the best choice when analyzing comparative voting behavior data. I then describe the method I propose, and I evaluate its ability to correctly locate centroids of constituencies in artificial data that mimics voting behavior data. Finally, I show, in a brief application, how the model can be deployed on real data, and the type of questions it can help address empirically.

1

Introduction

The electoral supporters of a political party can be thought of as a coalition of social groups that the party, through programmatic appeals or image management, is able to attract. Given the impossibility of targeting individualized policy promises to each voter, parties need to frame their policy promises with broad social groups as targets, e.g., parties can promise to help the middle class, increase the income of workers, defend the freedom of entrepreneurs, represent Christian values, and so on.1 One natural question that should be the ∗ CJ Yetman, Hertie School of Governance, Berlin, contributed to the development of the C implementation of the GRLVQ estimator presented here. Previous version of this paper were presented at the 2012 PolMeth Summer Meeting, at the 2013 MPSA meeting, and the 2013 EPSA meeting. Marco Steenbergen and conference discussants and participants provided insightful comments. All responsibility for any mistake is mine only. 1 On the effectiveness of buying votes or making particularistic promises rather than broad programmatic appeals, see Wantchekon 2003, Vicente and Wantchekon 2009

1

starting point of any analysis of voting behavior should then be: “what are the social groups that supported this or that party?” Yet, the question that the standard practice in voting behavior studies asks and answers is different (albeit clearly related): “what features of an individual voter make her more likely to support this or that party?” or more accurately, “what is the probability that a voter with these characteristics is going to support this or that party?” The normal practice in the study of voting behavior involves modeling the probability for a given voter of choosing a party conditional on observable covariates of the voter or of the district, state, province or region of residence. For instance, a logit (either binary or multinomial) regression or an analogous model (e.g., probit) is fit to the data. This type of models allows the analyst to describe patterns in voting behavior, and predict, based on the observable features of a given voter, what is their most likely vote choice. Yet, they do not directly provide one piece of information that is relevant from the substantive political point of view: namely, what are the social groups that are most important for a given party, or, in other words, what is the social base of support of a given party. One can think of a political party as a coalition of constituencies, defined in terms of demographic, occupational, and cultural factors. Describing the social coalition behind a given political party would be a crucial step in understanding political parties, party systems, and electoral behavior. In this paper, I suggest that a change of perspective, from that of the choice process of a (utilitymaximizing) voter, to party electorates as social coalitions, can expand and improve the way in which the study of voting behavior, and comparative voting behavior in particular is carried out. It would also allow research to re-focus, in an analytical fashion, on political parties as something more than just “carriers” of policy platforms. Parties can and should be considered political expressions of coalitions of social groups. What is suggested here is far from new, and recently Bawn et al. (2012) proposed a theoretical model with this flavor. Yet, the study of parties as social coalitions has been hindered by the fact that the most used (family of) statistical models used in voting behavior analysis channels the focus, by its very nature, towards voting behavior as a choice (and party systems as “supply sides”) rather than electoral coalitions as outcomes. While it is undubitable that questions about party choice have an important role to play in the study of voting behavior, this should be a decision dictated by substantive considerations and not by methodological limitations. The adaptation of methods devised in other fields (like statistical learning and machine learning) makes it possible to directly answer questions about the social coalitions that support

2

different political parties. The main task, if one wants to characterize social constituencies of political parties, is to isolate them, in a rigorous way, from voting behavior data. This amounts to detect subsets of the space defined by the observable features of voters, and describe them, for instance by summarizing the location of their centroids. I propose a novel implementation of a classifier algorithm, Learning Vector Quantization, developed in the statistical learning and machine learning literature with the main aim of predicting the “class” of an observation based on its distances from “prototypes” of the various classes. I show how this approach can applied in practice to isolate party constituencies or party “bases”. I implement and deploy one version of LVQ, the Generalized Relevance Learning Vector Quantization (GRLVQ), derived from a proposal by Hammer and Villman (2002). I show, with extensive experiments on artificial data, the remarkable accuracy with which one can back out the location of the centroids of constituencies in the multi-dimensional space defined by voter-level covariates. I then show, with some simple applications, the type of questions that can be answered from cross-national survey data using this approach. As a data-reduction tool, the main aim of LVQ-type methods is simple: represent a large set of observations by a smaller set. This approach should appear natural to researchers who work on voting behavior, and in particular comparative voting behavior: what is of interest in the literature is not to describe accurately the voting behavior of specific individual voters, but isolate general patterns that allow one to link party choice to social groups defined by observable characteristics, like income, region of residence, ethnicity, religiosity, etc. While in the statistical learning and machine learning literature the main objective is prediction itself, the location of prototypes might be the most important quantity of interest for political science. On the one hand, prototype methods can represent parsimoniously a configuration, based on few “prototypes”. But they can also be seen as estimation and inference methods, if one postulates the existence of “true” values around which party constituencies are centered. One appealing feature of the method I propose has to do with the fact that, unlike related methods, for instance Gaussian mixtures (Hastie and Tibshirani 1996), it does not require any parametric assumption about the distribution of the observable covariates or of the “dependent variable”. The approach I propose is a generalization of an approach to describing “typical voters” of a party by computing summaries of the distribution of the observable characteristics of voters of a given party. It seems

3

intuitevely appealing to compute (or estimate) the average income of Democratic or Republican voters, or their modal religious affiliation, or the median level of education of German Green voters. Similarly, it might be interesting to know what percentage of Democratic voters are Latinos, or what percentage of Republican voters are Mormons, or what percentage of German Christian Democratic voters are union members. For instance, Karreth et al. (2013) study the Social Democratic parties in Germany and Sweden and Labour in the U.K., and look at percentages of voters of each party that self-identify ideologically as left, centrist, or right. Similarly, Ezrow et al. (2012) compute averages of the ideological self-placement of likely voters of political parties to estimate the preferences of partisan constituencies. Prototype methods can be understood with the same intuition, but they are more powerful, and more accurate, than computing simple descriptive statistics of the voters of a given party, for two closely related reasons. First of all, the method can isolate more than one prototype per party, hence it can detect different constituencies of any given party (and, in the limit, arbitrarily many, as many prototypes as there are observations). Data-driven methods can be used to decide what is the “‘correct” (or the most suitable, based on some criterion) number of prototypes, and therefore the number of distinct social constituencies, of each party. Furthermore, means or medians, dimension by dimension, might be misleading. If a party is supported only by the very rich and the very poor, the mean voter of that party might be a middle income voter. But with multi-modal distributions, measures of central tendency do not necessarily identify high-probability subsets of the support of the distribution. So that same party might have, in the limit, no middle income voters even if the average income of its voters is in the middle income range. In addition, interactions between observable features are not detected, and overlooked, when computing dimension-by-dimension summaries. Assume, for instance, that a party draws its support from two constituencies: the very poor and very religious, and the very rich and mostly secular. The averages, dimension by dimension, would tell us that the “typical” voter of this party is middle class and not very religious, while, in fact, middle class somewhat religious voters might not show significant support for that party. The plot in figure 1 shows a situation like the one just described. Each party has two distinct constituencies. In the left panel, the position of the mean voter of each party is displayed. Notice that the average voter of party A is located away from actual party A voters, and in the middle of a group of party B voters. The right plot in figure 1 shows, on the other hand, what answer would be accurate from a political scien-

4

●

●

●

●

● ●●

●

●

●

●

●

●

2

● ● ●● ●

●

●●

●

● ● ●

0

●

●

●

●

●

●

●●

● ●● ●●

●

●

● ● ● ●●● ● ●● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●●● ● ●● ● ● ● ●● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ●● ● ● ●● ● ●● ● ● ● ● ●● ●●● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ●

●

party B

● ● ●

●

●

●

●

● ●

●●

● ●

● ● ● ●

● ●

● ● ●

● ● ●

●

● ●

B

●

●

● ●

● ●

●

● ● ● ● ● ● ● ● ● ● ●● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ●●● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ●● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ●● ● ●● ●● ● ● ●● ●● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ●● ● ● ● ● ●●

●

● ●●

● ●

●

●

● ●

●

● ● ● ● ● ● ● ● ●● ●● ●● ●

● ●

●

● ● ●

●

●

● ●●

● ● ●

● ● ● ●

8 6

● ●●

● ● ●

● ● ● ● ● ● ● ● ● ● ●● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ●●● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ●● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ●● ● ●● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ●● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ●● ● ● ● ● ● ●

●

●

●

mean of party B

●

●

● ●

●

● ● ●

●

●

● ● ● ● ● ● ●● ●●● ● ●●● ● ●● ● ● ● ● ● ●● ●● ●● ● ● ● ●● ● ● ●● ●● ●●● ●● ● ● ● ● ● ●●● ● ●● ● ● ● ● ● ● ●● ● ● ●● ● ●●● ●●● ●●● ● ● ●● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ●● ●● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ●● ● ● ● ● ● ● ●

2

4

mean of party A

party A

●

●

● ●

●

●●

● ●

A

●

income

● ● ● ●

income

party A

● ● ●● ●

●

● ●

●

●

●

● ●

●

● ● ● ● ● ● ● ● ●● ●● ●● ●

●

●

●

● ●●

● ●● ●●

●

●

● ● ● ●●● ● ●● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ●● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●●● ● ● ● ●● ● ●● ● ● ● ● ●● ●●● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ●● ● ● ● ●● ● ●

●

●

●●

A

party B

● ● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●●● ● ●●● ● ●● ● ● ● ● ● ●● ●● ●● ● ● ● ●● ● ● ● ●● ●●● ●● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ●●● ●● ●● ● ●● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ●● ●● ●● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ●● ● ● ● ● ● ● ● ●

●

●

●●

B

● ●

●

●

● ●●

0

8 6

● ●●

● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●●● ● ●●● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ●● ● ● ●● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ● ●

●

●

●

● ●

●

● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●●● ●●● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ●● ● ● ●● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

●

4

●

● ● ●

●

●

●

●

●

0.0

●

0.5

1.0

0.0

religiosity

0.5

1.0

religiosity

Figure 1: Position of voters, in a two-dimensional space (defined by income and religiosity), in a hypothetical two-party system. Each party has two distinct constituencies. In the left panel, the position of the mean voter of each party is displayed. Notice that the average voter of party A is located away from actual party A voters, and in the middle of a group of party B voters. In the right panel, four protoypes (the squares labeled with “A” and “B”) have been placed close to the centers of each party constituency. tific point of view. Four protoypes (the squares labeled with “A” and “B”) are placed close to the center of each one of the four clusters of points. These are what one would consider the four “typical” voters in the electorate.

1.1

Estimating the location of prototypes: learning vector quantization

The LVQ is a machine-learning method, originally proposed by Kohonen (1989, 1998), and analogous, to an extent, to neural networks. The classifier was initially developed as a black-box prediction tool, estimated based on heuristic criteria. Hastie et al. (2001. chapter 13) provide a textbook treatment. The LVQ was later recast as an algorithm for the minimization of a function of “residuals” by Sato and Yamada (1995). The version of LVQ I implement and I present here is derived from a proposal by Hammer and Villmann (2002). The use of LVQ has been suggested in political science by Honaker (2011) for a different purpose: isolate cases to be studied (in a qualitative/case study fashion) based on observable characteristics of the case itself. The main intuition behind the approach of LVQ is the following. For each observation i, a vector of covariates xi is observed, along with a class membership, a categorical variable yi that specifies to which 5

class an observation belongs. In the applications presented here, the covariates are observable features of survey respondents, and their class is the party they declare to have voted or plan to vote in an upcoming election. The aim is to isolate a set of prototypes of voters for each party. These are defined by class memberships y˜ and features x ˜, that define their location in the space of the predictors. Assembled together, the prototypes constitute a codebook that can be used to predict the class (the party choice) of individuals, based on a nearest-neighbor rule: individuals are predicted to belong to the class of the prototype they are closest to. The LVQ searches for “central” voters of a given constituency, and defines their location (in the multidimensional space defined by the covariates) based on their ability to correctly predict the vote choice of the voters that surround them. In other words, the prototypes chosen by LVQ are centroids of party constituencies that are also predictive of voting behavior. In practical terms, this means that the method moves the prototypes away from areas where party constituencies overlap, towards areas that are more densely populated of voters of a given party. This means that one isolates some centroids right in the “base” of a given party. The LVQ is what is called in the machine learning literature a “supervised learning” method. In political science terms, this means that one is trying to predict an observed outcome (in this case party choice) based on observed predictors. Other approaches that are superficially similar to the prototype method are cluster analysis (Hartigan 1975) and latent class analysis (Lazarsfeld 1950; Vermunt 2010). The main difference is that these are “unsupervised learning” methods, in that the “outcome” of interest (the class or cluster membership, or in terms of the application here, the party choice) is not observed, but the predictors are. The aim is grouping observations into clusters or unobserved (hence “latent”) classes based on the features (the values of the predictors) of the observations themselves. In its simplest version, LVQ works as follows. One starts with a rough guess of the location of prototypes. For instance, the starting values for the codebook can be based on a simpler classifier, or on a random selection of a limited number of actual observations. Then, one iterates over each observation. The prototype closest to the observation is found, and its class is compared with that of the observation. If the prediction is correct, the prototype itself is moved towards the observation by some small amount, along the direction of their distance. If, on the other hand, the class of the observation is not the same as the class

6

of the prototype, the prototype is moved away from the observation. So, for instance if the observation is a Republican voter, and the closest prototype is Democratic, the prototype itself is moved away from that voter. The procedure passes through each observation several times. The result of the procedure is an updated codebook that lists, for every prototype, its location in predictor space. Hence, one learns that a prototypical voter of party A has a given age, a given income, etc. This means that an observation with those characteristics is a centroid of a cloud of points that belong to that class. It also means that observations that are closer to the location of this particular prototype than to any other prototype in the codebook are predicted to belong to that class, or, in terms of the applications here, to vote for the party for which the prototype stands for. As a byproduct of the LVQ estimation, one not only learns about the position, in predictor space, of the prototypes, but also about the relative size of each party’s constituencies. This allows one to make statements like “the data show the existence of two main constituencies of party A, a smaller one comprising higher-income, relatively secular voters, a larger one comprisign low-income devoutly religious voters...”. The plot in figure 2 illustrates how the LVQ updates the positions of the prototypes, and how the prototypes converge to the correct locations regardless of their initial position. There are four observations (the letters A and B indicating their class membership) and one prototype per class (the squares with labels A and B). Initially, the prototype for class A is located closer to the observations of class B, and the prototype for class B is closer to observations of class A. The algorithm cycles trough the observations until convergence to the “correct” locations.

1.2

Prototypes versus models of conditional probabilities of party choice

The prototype method can be considered as an alternative to regression models for the conditional probability of party choice. Both approaches use the same information, a categorical “dependent variable” with information about party choice, and observed features of respondents (predictors or “independent variables”) to predict the party choice of the respondent. Yet, they differ in several ways, some of them more apparent, some other that deserve a slightly more detailed explanation. The first of the more apparent differences is that the prediction in a conditional probability model is, in its raw form, a probability of choice. This can be then discretized, for instance choosing as predicted party the one with highest probability. The proto7

1.0

A A

0.0 −0.5

X2

0.5

B

−1.0

A B −1.0

B −0.5

0.0

0.5

1.0

X1 Figure 2: Illustration of how the updating of LVQ works. There are four observations (the letters A and B) and one prototype per class (the squares with labels A and B). Initially, the prototype for class A is located closer to the observations of class B, and the prototype for class B is closer to observations of class A. At every iteration, the location of one of the protypes is updated, until both of them converge to their “correct” location.

8

type method, on the other hand, directly yields a categorical prediction (it is a “hard bounds” classifier, in machine learning terms). Another visible difference is that the conditional probability model returns coefficients that can be interpreted

2

as effects of changes in the value of the predictor on the probability of

choice. The coefficients are expressed in units of linear prediction (e.g., on the logit scale) per unit change of X. The prototype method on the other hand returns locations of the prototype in units of the explanatory variables themselves. The prototypes “live” in the same space in which observations are located. There are two additional important ways in which prototype methods differ from conditional probability models. The first one has to do with the fact that the regression-like approaches model conditional probabilities of party choice, a quantity that is not always the main object of substantive interest, while prototype methods aim at partitioning the space defined by the predictors into subsets such that if an observation falls in the subset, it is predicted to belong to a given class (i.e., to vote for a given party). The second one has to do with some assumptions about the relationship between predictors and outcome that are embedded in the conditional probability model, in particular about continuity of the predicted values as a function of each predictor and, as a consequence, convexity of the constituencies; prototype methods do not make this assumption by default. I discuss these in turn.

Conditional probabilities might not be what one is interested in The main aspect to keep in mind, in fact, is that conditional probability models do not necessarily answer the question posed at the beginning, which has to do with the detection of party constituencies, summarized, for instance, by describing a “typical” member of such constituencies. One can think about this difference by using an analogy that clarifies what the (more or less implicit) theoretical model and question associated with each method is. Conditional probabilities models, in a sense, look at voting behavior from the point of view of the voter, who knows her own characteristics, and needs to decide which party fits her best: “given that I am in this age group, and belong to this ethnic group, and I have this income, what party is best for me?” Unsurprisingly, the similarity of the foundation of this type of approach with that of theoretical models of vote choice, framed in terms of utility maximization, lead them to enjoy considerable success in the study of voting behavior. See, for instance, Adams et al. 2005. 2

Even if interpreting all these as “effects” is not necessarily warranted.

9

On the other hand, one could say that the prototype approach looks at voting behavior from the point of view of the political party rather than that of the voter. Prototype methods adopt the perspective of a party directorate that looks at election results and post-election surveys, and tries to understand which constituencies it successfully attracted. A party leader might notice, for instance, that the core constituencies are Hispanics, urban high-income liberals, and suburban blue-collar workers. It is important to keep in mind that methods that model individual choice, as a utility maximization problem, cannot directly answer questions like “what is the ‘base’ of party A?” If the questions have to do with what social groups support a given party, or how different parties “fish” in different ponds, conditional probability models give answers at best indirectly. It is also worth mentioning that, if applied in a mechanistic fashion, they can give wildly misleading answers, if the conditional probability of party support given covariates is (incorrectly) interpreted as an estimate of the importance, for a given party, of voters with a given configuration of values of the covariates, and voters with the highest predicted probability of supporting a party are mistaken for “typical” voters of that party. An unemployed African American woman with a Ph.D. might have a very high predicted probability of voting for the Democratic candidate according to a model fit on U.S. survey data, but this voter is far from the “typical” Democratic voter.3 These considerations apply particularly strongly if the relevant substantive question involves comparisons over time in the constituencies of parties: the population distribution of the observable covariates (and, in substantive terms, the relative size of different social groups) can change dramatically over time, and conditional probabilities of party choice are not informative at all about these changes. For instance, it might be of substantive interest to understand whether a party’s vote share wanes because its voters move to a competitor, or because demographic change makes certain social groups become less relevant. An example might help clarify the intuition. Let’s say that, as of 1975, a social-democratic party receives most of its support from formal sector blue-collar workers. In addition, most of the blue-collar workers support the social-democratic party. If one runs a logit regression, the coefficient on an indicator variable for “blue-collar worker” would be large, and both substantively and statistically significant. The inference one would make, from the logistic regression evidence alone, would be that the “typical” social-democratic voter is a blue-collar worker. This inference, albeit unwarranted, if based exclusively on the results of such a 3 This interpretation would clearly be bad data analysis and bad political science, so one could think it is not worth discussing. Yet, a prototype approach might prevent this type of mistake by construction.

10

statistical analysis, would not be substantively incorrect. Notice that the coefficient in the logit model would not say anything about how many of the social-democratic voters are blue-collar workers, only what fraction of the blue-collar workers are social-democratic voters. A banal case of confusing likelihood and posterior, that would not be troubling in its substantive implication as long as the size of the blue-collar worker group in society is large enough. But consider the case in which, 25 years later, the same proportion of blue-collar workers supports the social-democratic party, but due to changes in the economic structure of the country, there are many fewer blue-collar workers in the country. The logit coefficient would stay the same, but the relative importance, as a party constituency, of blue-collar workers would be greatly diminished. Clearly, one could combine the information about the prevalence of a given social group in society with the information about that group’s propensity to support a given party. For instance, one could multiply the predicted probability for “typical” members of the social group by the size of the social group, analogously to what is done in post-stratification ( Park et al. 2006; Lax and Phillips 2009). Then, one can back out the relative importance of that social group as a party constituency. The method proposed here, on the other hand, directly answers the question. If applied to the same problem, a prototype classifier like LVQ would detect “in one step” that in 1975 a prototypical socialdemocratic voter is the “blue-collar worker”, while in 2000 such a prototype might either not be detected or, if still present, would turn out to be surrounded by a very small number of observations, or, in other words, to represent only a small constituency.

Non-convexity of constituencies

It might often be the case that constituencies are not convex in predictor

space. This might be true both in one-dimensional and multi-dimensional settings. Some examples will clarify what this means and what the implications are. Consider an example of the one-dimensional case first. It might be that a political party is supported by the very uneducated (with, say, less than a high-school degree) and the very educated (say, with master’s level or higher) but not by those with high school diplomas or a first university degree (e.g., a B.A. in the U.S. system). There might be a “hole”, in the education dimension, that separates two distinct constituencies of the same party. A “vanilla” conditional probability model that includes a linear term for education might miss this feature of the electorate: the coefficient on

11

education might be close to zero, or positive (negative) depending on which of the two constituencies is larger. But it would not be able to detect the presence of a “hole”. Analogously, if two constituencies of a given party are separated in a two-dimensional setting, so that, like in the example above, it is mostly rich secularists and poor religious voters who support a given party, but not the religious rich, the secular poor, and the middle class regardless of religiosity, there is a “hole” at the center of the two-dimensional space defined by religion and income that the inclusion of linear terms for income and religiosity would not detect. These examples show that if constituencies are not convex, simple linear models cannot detect the pattern in the data. To capture potentially interesting patterns in the shape of party constituencies, like the ones in the examples, one needs to include non-linear terms, and interactions between predictors, in order to capture the non-convexities. In fact, in the uni-dimensional case, there is a non-monotonic (“U-shaped”) relationship between education level and propensity to vote for the party, that could be detected by including, for instance, a polynomial in education. In the multi-dimensional case, an interaction between income and religiosity would be able to detect that both poor church-going voters and rich secularists (but not those in between them) tend to vote for a given party. The “effect” of income, indeed, would go in two different directions depending on the religiosity of the voter: positive, increasing the probability of supporting the party, for non-religious voters, and negative, decreasing the probability of supporting the party, for religious voters. Yet, this solution is not without problems. First of all, it often requires using some extra-data knowledge: one needs to know in advance, when setting up the model, which variables might display non-monotonicities and interactions. In the context of a single-country study, it is not very complicated to exploit extra-data knowledge (or initial exploratory data analysis) to come up with a reasonable, and manageably small, set of interaction terms to include. But this is not necessarily straightforward in the study of comparative voting behavior, if many countries are included in the analysis. In addition, there are computational and statistical limitations: models with deep interactions are hard to estimate, they are not parsimonious, they run the risk of overfitting, and in many cases there is not enough information in the data ( “multicollinearity” or “micronumerosity”) to include all the possible interactions between variables. Solutions to this problem involve some type of regularization, either as penalties in a frequentist framework (like in the lasso family of estimators, Tibshirani 1996; Hastie et al. 2001) or as regularization priors in a Bayesian framework as suggested (but not yet implemented) by Gelman (e.g., Ghitza and Gelman 2013). A related proposal is that

12

of using a semi-parametric approach based on kernel smoothers, that allows to detect non-linearities and interactions. Langrock et al (2012) deploy the method on German voter data. There might be a lot of promise in a “deep-interactions and regularization” approach: these models should have the ability to correctly detect non-convexities in the supporters of parties, avoiding, at the same time, to engage in wild overfitting. Both regularization and kernel approaches should be able to correctly detect non-convexities, yet they do not help much if one aims at making claims regarding which are the most important constituencies of a given party. The solve the second problem addressed here, but not the first. Prototype methods, on the other hand, can deal with the issue of isolating constituencies of parties in the electorate, and accommodate by construction any order of interaction between predictors.

2

The generalized learning vector quantization

The interpretation of LVQ in its original formulation relies simply on heuristics and its success was due to its good performance as a black-box predictive device. Following Sato and Yamada (1995), one can also frame the LVQ algorithm as a minimization algorithm, where the minimand is some loss function based on quantities that can be given the interpretations of residuals. The generalized relevance learning vector quantization that I implement is based on the minimization of a monotonically increasing function of the distances between observations and prototypes. Observations, indexed by i, are defined by a class membership yi and a K-vector xi with element xik the value of predictor k for observation i. The aim is to minimize C=

X

F (µ(xi ))

(1)

i

where µ(xi ) is a residual for observation i. The choice of how the residual is defined determines which P P variant of LVQ is fit. First of all, define Dλ (x, y) = k λk (xk − yk )2 with weights λk ≥ 0 and k λk = 1 the weighted squared Euclidean distance. The use of a weighted distance (that is itself updated during the estimation) is what makes this a relevance LVQ. For every observation i define the distances d1 = Dλ (xi , W1 ) from the closest prototype of the same class and d2 = Dλ (xi , W2 ) from the closest prototype of a different class. In terms of the application proposed here, these are respectively the distance from the closest typical voter of the party that a given voter 13

actually chose, and the distance from the closest typical voter of a party that a given voter did not choose. The aim is to find locations, for the prototypes, so that they are close to voters of their parties, and far from voters of the other parties. Then, the residual is chosen to be µ(xi ) =

d1 −d2 d1 +d2

(when d1 = d2 = 0, the quantity

is not defined and is set equal to 0). This leads to the generalized learning vector quantization. In addition, in equation 1 a function F () appears. A simple choice for the function F can be the inverse-logit function (also called “sigmoid” in the machine-learning literature). Notice that by choosing this sigmoid, that is steepest at 0, the cost function is more sensitive to changes in the distance exactly for those observations that are mid-way between two prototypes.

2.1

The basic estimator

The basic estimator is obtained by passing through each of the data points repeatedly, and updating the position of the prototypes, and the weights, according to the equations above. The codebook is initialized by randomly selecting kj observations from category j in the data. The updates, which involve moving the prototypes based on whether they correctly or incorrectly predict observations, can be interpreted as steps in a stochastic gradient descent minimization of the cost function in (1). In my implementation, a random ordering of the observations is chosen. The algorithm passes ten times through each observations, then the order of the observations is reshuffled, ten passes through the data take place, and the whole process is repeated 20 times. The algorithm passes through each observation 200 times, updating the location of two prototypes at each step.4 The algorithm to estimate the location of the prototypes is the following: • initialize the codebook by randomly selecting nj observations for party j ∈ {1, . . . , J} • initialize the weights as λk =

1 K ∀k

• iterate over m ∈ {1, 2, . . . , M } where M is the number of passes through each observation (M = 100 for instance) • sort the observations randomly 4

My implementation is part in R and part in C. Reshuffling observations in the R environment is much simpler than reshuffling them in C, but iterations are much faster in the latter.

14

• iterate over observations i ∈ {1, 2, . . . , N } • identify the closest prototype of the same class as the observation (call it W1 (i))) and the closest prototype of a different class (call it W2 (i))) and calculate d1 and d2 5 • move the closest prototype of the same class by ∆W1 = α+ F 0 (µ)

d2 (xi − W1 ) (d1 + d2 )2

(2)

• move the closest prototype of a different class by ∆W2 = −α− F 0 (µ)

d1 (xi − W2 ) (d1 + d2 )2

(3)

• update the weight λk for dimension k ∈ {1, 2, . . . , K} by ∆λk = −F 0 (µ)

• normalize the weights so that

P

k

(d2 (xik − W1k )2 − d1 (xik − W2k )2 ) (d1 + d2 )2

(4)

λk = 1

Notice that, by using the inverse-logit function, the amount of updating (that depends on the first derivative of the cost function), all else equal, is largest when a given observation is located exactly at the same distance from the closest correct and the closest wrong prototype. (There, µ is equal to 0 and the inverse-logit is steepest at 0). In other words, more updating takes place when the algorithm passes through observations that are hard to classify rather than observations that already lie close to the center of a constituency of the correct party and far from the center of the constituencies of other parties. At every step, also the weights λ are updated. The weight updating implies that dimensions get more weight when they are more determinant in discriminating between voters of different parties. Indeed, the updating process implies that the weight of dimensions on which an observation is close to the correct prototype and far from the wrong prototype is increased, and vice-versa. If the variables are standardized d1 d2 If d1 = d2 = 0, the quantities (d1 +d 2 and (d +d )2 are not defined, and they are therefore set equal to 0. This is equivalent 2) 1 2 to skipping the updating when the position of two prototypes and an observation are exactly coincident. 5

15

so that they are approximately on the same scale, the weights can be considered direct estimates of the “importance” of a given characteristic in a given political system. Otherwise, they perform both the function of scaling factors and of importances.6 The parameters α+ , α− , and are the learning rates. The location of the solution of the minimization does not depend on the learning rates, but the speed of the convergence, and the tendency of the prototypes to settle in some local minimum, depend on them. Preliminary experiments with artificial data showed that when there are many categories, if α+ = α− the algorithm tends to place some prototypes outside of the range of the data, where they are no longer updated. Prototypes, pushed towards the boundary, and then outside of the convex hull of the data, are then no longer “active”, in the sense that they no longer happen to be picked as closest to an observation at the update stage. The algorithm, then, gets stuck and gives non-sensical results, in that most prototypes lie outside of the range of the data.The intuition here is that any prototype is more likely to be “pushed” when selected as the closest incorrect prototype than to be “pulled” when selected as the closest correct prototype. This happens because, with C > 2 classes (e.g, parties that are being modeled) there are more ways for a prototype to be selected as incorrect than to be selected as correct. Indeed, the phenomenon does not take place when there are two classes of approximately equal size. In many of the machine learning contributions (e.g, Villmann et al. 2005; Hammer and Villmann, n.d.; Schneider et al., n.d.) the performance in a two-class setting is evaluated, hence this type of behavior seems to have been not noticed in the literature. In any case, the behavior of the estimator is much more reliable, in terms of convergence, if one sets α− =

α+ 7 C−1 .

The learning rate for the weights, following

the recommendation in Villman et al. (2005), is set equal to 1/10 of the learning rate α− for the prototypes, so that the prototype updating happens in an almost-stationary setting, in the sense that the distance metric varies much more slowly than the location of the prototypes. Preliminary exploratory experiments also assessed whether standardizing the data or estimating the prototypes in the space of the principal components of the data improved performance. It turns out that standardization does not affect the performance (but involves one additional step to recast the prototypes in the original space of the variables in order to interpret them). On the other hand, the prototype performs 6

Analogously to the function performed by regression coefficients in regression models. One could also adjust the learning rates by the proportions of observations that fall into categories. What is done here is equivalent to assuming that all parties have equal sizes. 7

16

(slightly) worse if estimated on the data transformed via principal components rather than the original data. Hence in the experiments and the empirical application I estimate the GRLVQ on the untransformed data.

Multiple chains and bootstrap There is no simple way to assess whether the algorithm has converged to a global minimum of the cost function in equation 1. In order to address this issue, and to make sure that no role is played in the results by the initial state of the prototype algorithm (the initialized codebook), I refit Mconv = 20 parallel chains, using different initial values. For each of the chains, I get estimates of the weights and of the location of the prototypes. To get a final set of estimates for the locations of the prototypes, the average location of each prototype across the Mconv chains is taken.8 In addition, in order to estimate what is the sampling variability of the estimates, Mbs = 980 bootstrap resamples of the data are taken, and the algorithm is fit on each of the resamples. This allows one to estimate, under standard bootstrap assumptions, the sampling distribution of the positions of the prototypes from the variation in the location of prototypes over resamples. Due to the computational burden (which is trivial in estimation on actual data, but significant when running MonteCarlo experiments) I do not assess the performance of the standard errors in the experiments reported below.

2.2

Evaluating the performance

In order to assess the performance of the LVQ, I run an extensive set of MonteCarlo simulations, varying the dispersion of the observations around their centroids (and hence the overlap both across parties and across constituencies across parties), the number of constituencies per party, and the number of parties. For each experimental condition, I first draw and record one random configuration of centroids: these are the “true values” of the positions of the centroids of the constituencies. Then, I draw the actual data one thousand times, from multivariate normal distributions with mean equal to the “true value” for a given constituency and variance set according to the experimental condition. It is worth explaining why I do not simulate the data according to the data-generating process implied by the LVQ. The LVQ is a hard bound classifier: it classifies observations as belonging to category j if a prototype of category j is closest to the observation. This means that there is no overlap in the predicted 8 Given that the order of the prototypes for each class in the codebook is not necessarily the same across chains, I match the prototypes from different chains using the nearest-neighbor rule.

17

15 10 5

15 10 5

A A

0

X2 0

BB

−5 −10

A A

−15

−15

−10

−5

CC AA

−5

0

5

10

15

−15

−10

−5

0 X1

5

10

15

10

−10

10

−15

5

5

A A CC A A

C C

B B A A

0

B DD

D

D

D

−5

−5

0

X2

C

−10

−5

0

5

10

−10

−5

0 X1

5

10

Figure 3: The left panel displays one draw of the data used in the MonteCarlo experiments. The right panel shows the true (square) and estimated (triangle) locations of the centroids. Top row: three parties, one (party A) with three constituencies, two (parties B and C) with one constituency each, and strong overlap. Bottom row: four parties, with four constituencies each, with moderate overlap.

18

values that come from LVQ. If one were to try to “estimate the DGP” one would have to necessarily generate non-overlapping classes. One draw would involve choosing the location of the prototypes, then randomly scatter observations in the multi-dimensional space of the predictors, and assign to each observation the class of the closest prototype. In actual data on voting behavior, one almost never encounters non-overlapping classes. Hence, the data generating process aims at recreating this feature of the actual data on which I propose to use it. The performance of the LVQ with non-overlapping classes should be even more accurate, so the experiments I perform are a hard case for the LVQ. The experimental parameter that drives the amount of overlap, in my experiments, is the standard deviation of the distribution from which the locations of the observations are drawn. In the plots below, I focus on two variance conditions. In the moderate dispersion condition, the standard deviation of the random variable from which observations are simulated is set equal to one. In the case of the strong dispersion, the standard deviation is set equal to three. To put these numbers in perspective, consider the simple unidimensional case with two constituencies. In the simulations, the true locations of the prototypes are chosen (randomly) in a way that ensures that the median distance between any pair of true values is less than 4 units on each dimension. In the case of the strong dispersion, approximately 10% of the observations that belong to the constituency with lower mean lie above the center (the mean) of the other constituency. In the moderate overlap case, only a negligible part of the lower distribution lies above the mean of the upper one, but the tails of the two distributions are heavily overlapping. The strong overlap case is, potentially, a borderline impossible case for any method, while the moderate overlap case is not a trivial case with separate, non-overlapping, clusters. The amount of overlap is evident in the plots in the left column of figure 3. In particular, the bottom left plot displays an experiment with moderately low variance. For every experimental condition, I fit the GRLVQ on each of the 1000 simulated datasets, following the procedure detailed in the previous section. For each simulation, I record the final position of the prototypes, and the final estimate of the weights.The discrepancies between the true values and the estimated locations of the prototypes are summarized in the plots below as estimates of bias. In the appendix, I also report the root mean square error of the estimated locations, and the standard deviation of the estimates of the weights λ. A central tuning parameter for the LVQ is the number of prototypes per class (the number of constituen-

19

cies per party). In the experiments reported here, I always tune the estimator with the true configuration of constituencies in the DGP. Clearly, the true number of constituencies is known to the analyst in the case of simulations on artificial data, but is not, in general, known to the analyst when working with actual data. These simulations, then, are run in order to assess the ability of the method to estimate the position of the centroids of the constituencies, conditional on the correct number of constituencies. In plain words, these experiments evaluate the performance of the estimator based on the assumption that the LVQ model is properly tuned. In subsection 5, I discuss the gap statistic, a very powerful method to back out the number of prototypes to be fit. This method was originally developed to decide on the number of clusters in unlabeled data (this is an “unsupervised” method). The gap statistic should be used in conjunction with the LVQ in a two-step procedure. First, the gap statistic (which is computationally undemanding) is used to estimate how many clusters per party are present in the data. Then the LVQ, tuned to estimate as many prototypes per parties as there are clusters in the data for each party, is fit to estimate the location of the centroids of each constituency. Several conclusions can be drawn from the results of the extensive MonteCarlo simulations. First of all the GRLVQ displays a remarkable ability to “find” the centroids of the constituencies, even in the presence of strong overlap between constituencies, when tuned with the correct number of prototypes. The plots in figure 3 displays one draw of the data, the true centers of the constituencies, and one simulation of the estimated locations, for two experimental conditions. In the top row I display an experiment with three parties, one (party A) with three constituencies, two (parties B and C) with one constituency each, and strong overlap. In the bottom row I display an experiment with four parties, of four constituencies each, with moderate overlap. Only the first two dimensions are displayed in the plot. There is considerable overlap in the constituencies. The plot on the left displays how the data looks like. The plot on the right displays the true and the estimated locations of the prototypes. The triangles are the true values, while the squares are the estimated locations. The bottom row displays the same information, for a case with three parties, two of them with just one constituency, and one of them with three constituencies. As one can appreciate from the plots, the GRLVQ is able to find the locations of the centroids with remarkable accuracy. The plots in figures 4-7 display, for a subset of the experimental conditions, the bias for each prototype on each dimension, as a percentage of the range of the data. The bias is defined as the average of the difference between the true

20

● ●● ● ●●●●

●

●●●● ● ● ● ●

●

●●● ● ●● ● ● ●

●

●●

●

● ●●

●

●

●● ● ● ●

●●

●

−10

−5

0

●

●

5

10

15

−15

Bias (% of predictor range)

−10

●

● ●●

●●

−5

●

●

●

●●●

●

●

0

●

●● ● ●

●● ● ●

● ●

●

● ●

●

●

●● ●

● ●

●

● ●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●

●●

● ●● ● ● ●●

● ●●● ● ● ●

●

●● ●

●

●●●

●

● ●

●

● ● ● ●●●● ●●

−15

● ●●

● ●●

●

●● ●●●● ● ●

●● ● ●●

●●

●

●

●

●

5

10

15

Bias (% of predictor range)

Figure 4: Bias as a percentage of the range of the predictors in the data. Bias is defined as average difference between estimated coordinate of the prototype and the true coordinate of the centroid in the data-generating process. The average is taken over the 1000 MonteCarlo simulations. Each row represents a prototype, each dot represents bias on one dimension. The crosshair is the average bias per prototype (where the average is taken over the dimensions). Ten-dimensional space, three parties with three constituencies each. Left panel: moderate dispersion of the observations around the center. Right panel: strong dispersion of the observations around the center. positions of the centroids and the estimated ones, where the average is taken over the 1000 simulations. One can also look at the average bias by prototype (with the average taken over the dimensions in which the prototype is located). There is no obvious way to scale the difference between estimated and true value, unlike in the case of the regression coefficients, to make them comparable in absolute terms. Scaling the difference between estimated and true locations of the prototypes by the range of the data is a reasonable solution. For example, in terms of ideological self-placement on a 10-point scale, a variable almost always used in the study of comparative voting behavior, a 10% deviation would mean that the prototype estimator is missing the true location of the typical voter of a party by one ideology point; a 1% deviation implies an error of one-tenth of a point on the ten-point scale.

Convergence and inference The GRLVQ I implement can be framed as a gradient descent minimization of the sum of the “residuals” as described in equation 1. It is desirable to assess whether the algorithm converged, and if it has converged, whether it found an actual global minimum, or a local minimum.

21

● ● ●● ●● ●●

●

●● ●●● ● ● ●

●

−15

−10

−5

●● ●●

0

●

●● ●●●

●

●

●

5

10

15

−15

Bias (% of predictor range)

●

●

●

−10

● ●

●

● ●

●

●

−5

●

0

●

● ● ●●

●

●

●

●

●

5

10

15

Bias (% of predictor range)

Figure 5: Bias as a percentage of the range of the predictors in the data. Bias is defined as average difference between estimated coordinate of the prototype and the true coordinate of the centroid in the data-generating process. The average is taken over the 1000 MonteCarlo simulations. Each row represents a prototype, each dot represents bias on one dimension. The crosshair is the average bias per prototype (where the average is taken over the dimensions). Ten-dimensional space, three parties with one constituency each. Left panel: moderate dispersion of the observations around the center. Right panel: strong dispersion of the observations around the center.

22

●

●●

●

●

●● ●●

● ● ● ●● ●●●

−5

●

●

●

0

●●

●

●

● ● ● ● ●

●●●● ● ●● ●

−10

●

● ● ●●●● ● ● ●

●● ●●

−15

●

●

●

●

●

5

10

15

−15

Bias (% of predictor range)

−10

−5

●

●

●

●

●

●

●

●●

● ●

●

●

●●

● ●

●

●●

● ●

0

●

●

●

●

●

●

●●

● ●●●

●● ●

●

5

10

15

Bias (% of predictor range)

Figure 6: Bias as a percentage of the range of the predictors in the data. Bias is defined as average difference between estimated coordinate of the prototype and the true coordinate of the centroid in the data-generating process. The average is taken over the 1000 MonteCarlo simulations. Each row represents a prototype, each dot represents bias on one dimension. The crosshair is the average bias per prototype (where the average is taken over the dimensions). Ten-dimensional space, three parties, one with three constituencies (bottom three rows) and two with one constituency each (top two rows). Left panel: moderate dispersion of the observations around the center. Right panel: strong dispersion of the observations around the center.

23

●●● ● ● ● ● ●

●

● ●● ●●● ●

●

●● ● ●●

●

●

●

●

●

●● ● ● ●●● ●

−15

−10

−5

0

5

10

15

−15

−10

●●

●●●

●

●

●

●

●

●

Bias (% of predictor range)

● ●

−5

● ●

● ●

●

●

●

●● ●

●

●● ● ●

●●

0

●

●

●

●

5

10

15

Bias (% of predictor range)

Figure 7: Bias as a percentage of the range of the predictors in the data. Bias is defined as average difference between estimated coordinate of the prototype and the true coordinate of the centroid in the data-generating process. The average is taken over the 1000 MonteCarlo simulations. Each row represents a prototype, each dot represents bias on one dimension. The crosshair is the average bias per prototype (where the average is taken over the dimensions). Ten-dimensional space, four parties with one constituency each. Left panel: moderate dispersion of the observations around the center. Right panel: strong dispersion of the observations around the center.

24

Clearly, if the estimator were to give a different answer based on the initial values of the algorithm and on how the data are sorted, one would have doubts about the validity of the estimates. There are two perspectives from which one can look at convergence. The first and simpler has to do with the weights. How much do the estimates of the weights vary, when the LVQ is fit with initial starting values? If the estimator has converged to the actual minimum of equation 1, the weights should be the same in all chains. In addition, one can also look at the variation in the location of prototypes across the Mconv chains. Given that these are estimated on the same data (unlike the bootstrap ones), with the only difference across them being the order of the data and the initial conditions, they can be considered to have converged if they are very close to each other. In practical terms, to assess convergence, one can run several parallel chains of the GRLVQ, each using a different initialization (i.e., the initial codebook to be optimized) and passing through the observations in a different order. Comparing the results across chains does not constitute definitive evidence of convergence to a global minimum, but is informative regarding whether the LVQ has converged to a configuration that seems to be “in the data” rather than an artifact of the initial values or of the specific order in which the algorithm passes through observations. This does not exclude that many chains might converge to the same local minimum (this cannot be excluded in principle) but from different starting points and following different orders, consistency across results for different chains should be evidence in favor of convergence to the “correct” ones. In any case, if all the chains settle in the same location (at least, within a reasonable approximation), this means that initial conditions (i.e., codebook used as starting value) have dissipated. The logic is analogous to the one on which the R statistic for convergence of MCMC is based (Gelman and Rubin 1992). It is unclear how one could formally assess the convergence of the algorithm. In the case of MCMC estimation, one can compare the steady-state within-chain variance and the variance across chains. But the LVQ is a simple minimization, hence it does not have a steady state, only a solution. In any case, there are two set of quantities worth looking at to detect clear departures from convergence. The first one has to do with variation in the location of the prototypes, the other with the variation in the λ weights across chains. While the fact that multiple chains have converged to the same location does not guarantee convergence, wild variation in the final state of the algorithm points to clear non-convergence. Similarly, if there is substantial

25

variation across chains in the weights λ assigned to different dimensions, this constitutes evidence of nonconvergence. In the appendix, I include tables that report the root mean squared error of the position for each of the prototypes, and the average, by dimension, taken over all prototypes, as well as the average (taken over the 1000 simulations) of the standard deviation of the λ weights across chains

2.3

Estimating the number of constituencies per party

A crucial parameter of any prototype method, and of the LVQ in particular, is the number of prototypes to be fit for each class. In substantive terms, a preliminary task is to estimate from the data the number of constituencies per party. Now, this is a non-trivial challenge. There is an extensive literature in statistics about finding the number of clusters in unlabeled data (see Milligan and Cooper 1985 for a review of earlier studies). The problem is very similar to that faced when deciding on the number of classes in latent class analysis (Nylund et al. 2007). One relatively recent development comes from the gap statistic, proposed by Tibshirani et al. (2001) to estimate the number of clusters in (unlabeled) data. The intuition on which the gap statistic is based can be (loosely) summarized as follows. Assume one compares the fit of a k-clusters model and the fit of a (k + 1)-clusters model. If there really are k clusters in the data, the improvement in the fit of the model should be modest, while if there are more than k clusters, the improvement should be larger. In order to evaluate whether the improvement in fit warrants the addition of one cluster to the model, one needs to compare such an improvement with the improvement that would obtain under some reference (null) distribution. Given that the reference distribution (the distribution of the fit under the null hypothesis of no clusters) does not have an analytical formulation (beyond one-dimensional problems), Tibshirani et al. (2001) suggest to proceed by simulating data from a null distribution, estimate a statistic on the fake data, and then compare the observed value with the simulated reference distribution. In formal terms, the Tibshirani et al. (2001) method works as follows. First of all, fit a cluster algorithm to the data, and define W (k) as the within sum of squares for a cluster classification with k clusters. Then, for every k ∈ {1, 2, .., K} compute Gap(k) = E ∗ (log(W (k)) − log(W (k))

26

(5)

as the difference in the expected (log) sum of squares under the reference distribution, and the observed (log) sum of squares. Then, choose the number of clusters k as the smallest k for which Gap(k) ≥ Gap(k + 1) − sk+1 where s is an estimate of the variability of the gap estimate itself. In practice, to estimate E ∗ (log(W (k)) and s one follows this procedure: 1. generate B reference data sets, drawing from uniform distributions with the same range as the data (possibly, along the principal components of the data). 2. fit the clustering algorithm on each reference data set (indexed by b) and calculate Wb∗ (k) as the within sum of squares 3. estimate E ∗ (log(W (k)) with ˆl =

1 B

∗ b log(Wb (k)) and plug this into (5) to compute the gap statistic

P

4. compute the standard deviation of the estimated expected within sum of squares sdk = [ B1

P

∗ b (log(Wb (k))−

ˆl)2 ]1/2 5. calculate sk = sdk

q 1+

1 B

The gap statistic is designed to be used on unlabeled data, and therefore relies on “unsupervised” classification methods. This means that the “dependent variable”, i.e., the cluster to which one observation belongs, in not observed. It is convenient, for this reason, to perform the test separately for each party. Hence, the voters of party A are selected, the gap statistic is estimated on these respondents only, and the number of clusters detectable among party A voters is decided. Then, one selects the next party, and repeats the procedure. At the end, for a party system with J parties, one gets a J-vector with j-th element the number of constituencies of party j. The gap statistic has been show to perform remarkably well, especially when there is no overlap in the clusters. In addition, unlike other methods, it is able to test whether there is one single cluster in the data. Given that in the application proposed here there is no good reason to exclude the possibility that a given party has only one single constituency — this might be the case especially for smaller parties, for instance ethnic minority parties— a test that cannot evaluate whether there is only one cluster (one constituency) in the data would not be appropriate. 27

gap median mode pcentcorrect Hartigan median mode pcentcorrect Silhouette median mode pcentcorrect CH median mode pcentcorrect

c2 d2 l

c4 d2 l

c2 d5 l

c4 d5 l

c2 d10 l

c4 d10 l

c2 d2 h

c4 d2 h

c2 d5 h

c4 d5 h

c2 d10 h

c4 d10 h

c1 d2 h

c1 d5 h

c1 d10 h

2.0 2.0 99.3

3.0 3.0 32.5

2.0 2.0 100.0

3.0 4.0 33.2

2.0 2.0 100.0

3.0 3.0 28.5

2.0 2.0 50.9

1.0 1.0 0.2

2.0 2.0 93.1

3.5 4.0 49.9

2.0 2.0 99.8

4.0 4.0 81.7

1 1 100

1 1 100

1 1 100

2.0 2.0 92.6

1.0 1.0 0.2

2.0 2.0 99.6

1.0 1.0 0.0

2.0 2.0 99.9

1.0 1.0 0.0

1.0 1.0 0.9

1.0 1.0 0.0

1.0 1.0 0.0

1.0 1.0 0.0

1.0 1.0 0.0

1.0 1.0 0.0

1 1 100

1 1 100

1 1 100

2.0 2.0 99.4

3.0 3.5 34.3

2.0 2.0 100.0

4.0 4.0 43.8

2.0 2.0 100.0

4.0 4.0 45.5

2.0 2.0 78.8

2.0 2.0 1.1

2.0 2.0 99.0

2.0 2.0 15.1

2.0 2.0 100.0

3.0 4.0 41.0

4 3 0

6 8 0

7 8 0

2.0 2.0 99.2

4.0 4.0 50.9

2.0 2.0 100.0

5.0 4.0 45.3

2.0 2.0 100.0

5.0 4.0 45.4

2.0 2.0 71.9

2.0 2.0 6.6

2.0 2.0 99.9

2.0 2.0 16.2

2.0 2.0 100.0

3.0 4.0 36.7

6 6 0

2 2 0

2 2 0

Table 1: Results of the simulation experiments for the gap statistic and three other methods: the Hartigan method, the silhouette statistic, and the Calinski and Harabasz index. The columns are labeled as follows: cX for number of clusters in the DGP, dX for the dimensionality of the data, and l or h for low or high variance of the data. Hence the column “c2 d2 l” is for experiments with two clusters, two variables, and low variance.

The authors of the gap statistic do not test extensively the performance when there is overlap between clusters, providing only preliminary evidence of satisfactory performance. In the type of data I have in mind (post-election and general social surveys) overlap is an almost ubiquitous feature. The reported results of the simulations, then, constitute an independent contribution to the clustering literature in addition to showing why the method works with voting behavior data. I implement the gap statistic in the R environment. Following Tibshirani et al. (2001) I allow for the null distribution to be a uniform hypercube along the directions of the principal components of the data. Preliminary simulations show that standardizing the data before performing the test improves the performance, hence I first standardize and center the data to have mean zero and standard deviation one half. I test the performance on a number of artificial conditions, drawing fake data from multivariate normal distributions and varying the number of constituencies per party (number of clusters) and the amount of overlap between clusters. I also test, on the same data, a handful of other techniques that have been proposed to decide on the number of clusters. In particular, I implement the statistics proposed by Hartigan (1975), by Calinski and Harabasz (1974) , and by Kaufman and Rousseeuw (1990). See Tibshirani et al. (2001) for details of these methods. Table 1 displays the summaries of the MonteCarlo experiments for the gap statistic. The title of the column reports the number of prototypes in the data-generating process and the variance of the DGP, which controls the amount of overlap between clusters. In the first row I report the median number of prototypes 28

estimated, in the second row the mode (taken over the 1000 simulations), and in the third row, the percentage of simulations in which the gap statistic detects the true number of clusters in the DGP. The rows below show analogous information for the other methods, First of all, the gap statistic, unlike the cross-validation procedure used in a previous version of this manuscript, does not have the tendency to over-estimate the number of prototypes. If anything, it tends to under-estimate it. When there is only one cluster (hence, one constituency per party) the gap statistic correctly detects this fact in all the simulations, regardless of the dimensionality of the space and even when there is high variance (strong overlap) in the data. It is also quite accurate at detecting the presence of two clusters in most conditions: it performs poorly with just two variables and high variance, though. The worse performance is for a large number of clusters (4), with only two variables (two-dimensional space) and high variance. In that case, the gap statistic systematically detects just one cluster, while there are 4 in the data. Notice the implication for the substantive application of the method: in these conditions, the combination of gap statistic and LVQ would lead, in practice, to simply summarize the average characteristics of the voters of a given party. While clearly inaccurate, as discussed above, this would not lead to “false discoveries”. In addition, it is unlikely one would fit a voting behavior model with just two explanatory variables. The performance of the gap statistic is quite more satisfactory when there are more predictors, even if it tends to underestimate the number of clusters when there are many of them. The practical message, then, is that one should try to include rich enough sets of explanatory variables when estimating this type of model, and remember that the gap statistic might underestimate the number of constituencies when there are many of them. In any case, the gap statistic performs better, in terms of accurately detecting the number of clusters, than other available methods of cluster detection. These are not discussed here.

3

An empirical application: social-democratic and radical right parties in Europe

What are the constituencies which traditional social democratic parties in Western Europe rely on? Do these change over time? And how does the emergence of radical right parties across Europe change the nature of electoral competition and policy conflict? These are all crucial questions to understand the past, the present,

29

and the future of party systems and policy in Europe (Kitschelt 1997; Meguid 2005). In order to show how the prototype approach I propose can be used to answer substantive political science questions, I collect data from the five waves of the European Social Survey. For every respondent, I know the party choice in the most recent election, along with demographic and socio-economic features and basic socio-political attitudes. I then fit a prototype model, according to the methodology described above, for each country-year. First, the gap statistic method is used to decide on the number of prototypes per party, then the GRLVQ is fit, with the number of prototypes chosen by the gap statistic method. Political parties that are voted by less than 3% of the survey respondents are collapsed in an “other” category. As detailed above, the LVQ outputs, for each prototype, a decription of its location in the space spanned by the predictor variables. Hence, in practical terms, we learn what is the age, the education, the income, the left-right self-positioning, of each prototypical voter. These prototypes are the centroids of a “constituency”, or a social group that supports that party. At the same time, it is worth thinking of the LVQ as a datareduction device: an entire electorate can be summarized, or parsimoniously represented, by a smaller set of prototypes. Given that the prototypes “live” in the same space in which respondents can be located, and this space is common to all countries and all years surveyed, comparisons over time and across space are possible. In this sense, one can look at how, say, the typical German Social Democratic voters differ from the typical UK Labour voters, or how the typical voters change over time in a given country. This means that one can also compare how the “base” of a party change over time. For instance, one can discover if a party, over time, ends up relying on older and older voters, or if a party is able to make breakthroughs in social groups that are not traditionally among its supporters. So for instance a conservative party might be able to “add” to its supporters a prototype that is lower income, or a socialist party might be able to gather support from, say, high-education middle-class voters. For every survey respondent, I know the age, sex, the education level (coded according to the ISCED classification), income (an ordinal scale constructed by the ESS), union membership, occupational status (retired, student, not in the labor force), type of occupation, and basic attitudes. The occupation type is worth some more detail. I define a respondent a “traditional worker” if their occupation has ISO code between 7000 and 8999. This covers occupations defined as “Craft and related trades workers” and “Plant and machine operators and assemblers.” A respondent is classified as “clerical worker” if their ISO code

30

is between 4000 and 5999. This covers occupations defined as “Clerks” but also “Service workers, shop, market sales workers” . Finally, a worker is classified as “unskilled” if their occupation ISO code is between 9000 and 9999. This are those classified as “Elementary occupations” by the ISO. For each type of worker, I create three dummies, that take the value of one if the respondent belongs to the category. All workers whose occupation does not fall into one of the three categories have zero on the three dummies and are treated as the reference category. As for the attitudes recorded and used in the analysis, these are first of all the respondent’s self-placement on a left-right scale with range [0, 10]. I compute an “interpersonal trust” score as the average of three items, about whether people can be trusted, people are fair, and people are willing to help. I also compute a TV consumption score (as the average of two questions about TV viewing habits). Information about trust in the police, trust in politicians, trust in political parties, satisfaction with the economy, satisfaction with democracy, support for income redistribution, and gay rights is also included, along with a score for attitudes about immigration obtained from the first principal component of several immigration-related items.9 For every country and every party one gets a full description of the prototypes. These can be reported in tabular format, like for instance I do in table 2 for one specific example, the UK in the fifth wave of the ESS. One can also plot multiple prototypes in the two-dimensional space defined by any pair of predictors, as done in the plots below. party Cons Cons Cons Lab Lab Lab Libdem Other λ

age 55.500 38.410 72.800 51.010 71.540 39.300 48.110 49.740 0.003

female 0.420 0.480 0.590 0.550 0.370 0.650 0.630 0.310 0.000

union 0.540 0.330 0.410 0.660 0.840 0.550 0.600 0.500 0.000

unemployed 0.030 0.010 0.020 0.030 0.000 0.100 0.020 0.020 0.034

student 0.000 0.080 0.000 0.000 0.000 0.010 0.070 0.000 0.000

retired 0.150 0.000 0.850 0.020 0.890 0.000 0.130 0.180 0.000

nilf 0.240 0.250 0.880 0.130 0.920 0.000 0.260 0.480 0.000

lrscale 6.540 6.680 6.600 4.650 4.410 4.700 4.900 5.010 0.796

freehms 1.990 1.700 2.110 1.760 1.810 1.540 1.630 1.960 0.000

income 7.010 7.910 4.230 5.260 3.070 6.270 5.120 4.680 0.002

tradit.worker 0.150 0.090 0.080 0.050 0.400 0.340 0.120 0.420 0.009

Table 2: Estimated location of the prototypes for UK parties, from the fifth wave of the ESS. Selected dimensions 9 These are the questions labeled as “Allow many/few immigrants of same race/ethnic group as majority”, “Allow many/few immigrants of different race/ethnic group from majority”, “Allow many/few immigrants from poorer countries outside Europe”, “Immigration bad or good for country’s economy”, “Country’s cultural life undermined or enriched by immigrants”, and “Immigrants make country worse or better place to live”. Higher scores of the index mean that the respondent has a more favorable attitude towards immigration. The principal components model is fit pooling all the data, meaning all countries and all waves.

31

In order to compare across countries and across time what are the social groups that support and play a central role for socialist parties, I first isolate the prototypes of voters of traditional Labour and SocialDemocratic parties. This is one party in most countries, with the exception of Belgium, that has two socialist parties (a Flemish and a Francophone one).

10

I perform an analogous exercise with the (small) set of radical right parties that achieve sufficiently large support (more than 3% of the survey respondents in a given country-wave combination). The following parties (countries) are classified as “radical right” parties: the FPÖ in Austria, the Vlaams Blok/Vlaams Belag in Belgium, the Swiss People Party, the True Finns, the Northern League in Italy, the Danish People Party, the National Front in France, the List Pim Fortuyn and the List Wilders in the Netherlands, the Progress Party in Norway, the Slovene National Party, the Slovak National Party, the Swedish Democrats, the League of Polish Families, the Croatian Party of Rights and MIEP/Jobbik in Hungary. In some countries a party classifiable as “radical right” runs in elections with scant success. If less than 3% of the survey respondents declare to vote for such a party, they are collapsed in the “other” category hence the party does not appear in the analysis below.

The radical right “poaching” traditionally left voters Whenever new political parties emerge in a given party system or a given political arena, the first question one might want to ask is where they received support from. Other than parties appealing especially to first-time voters (as, for instance, the Pirate Party in Germany and analogous organizations around north-Western Europe might be doing), these need to attract the support of voters who previously supported a different party or, less often, of disaffected citizens who did not turn out in past elections. In many cases, they might be able to attract an entire social constituency that used to support some other party. Radical right parties are, superficially, ideologically closer to traditional, mainstream conservative parties, but they might have been able to “poach” voters mostly on the opposite side of the spectrum, and in particular former voters of social-democratic parties. I now show how the LVQ method can be used to directly address this type of questions. 10 These are the Social Democratic parties of Austria, Switzerland, Germany, Denmark, Finland, and Sweden; the Socialist parties of France, Luxembourg, Spain, Greece, Belgium (Francophone), Belgium (Flanders) and Portugal; the Labour party in the UK, Ireland, Netherlands and Norway; the Left Democrats in Italy.

32

DK3

1.0

1.0

0.8

Union member

0.6

DK2

0.0 2

4

6

8

10

0.2 0.0

0.4

IT2 SK5 NO2 BE2 DK1 FI1BE2 NO3 FI4 FI3 NO3 FR5 BE4 DK5 NO5 HR5 DK1 DK3 NO5 PL2NO5 DK1 SI2 AT3BE2 FI2 NO3 FI1FI2NO4 SE5 DK1DK1 NO5 NO5 FI2 SI3 BE4DK2 SE3 NL3 DK3 NO1 BE4 BE4 FI3 NO1 DK3 DK2 FI1 BE1 BE2 FI2SE2BE2 DK2DK3NO3 BE3 DK3 DK4FI4 BE1 FI4 SE5 DK2 AT3 HU5 BE3 BE5 FI3 BE5 NL1 AT1 SE3 BE2 BE4 SE4 FI4 SE5 BE3 FR3 SE4 SE5 SE4 BE1 SE3 NL5 BE3 NO5 DK1 FI2 SK3 SE1BE2 SE1 AT1SE4 DK4 BE4 NO4 IT2 NO5BE1 FI5 FI1SE2 DK2 SE4 BE5 NO5 PL1 NL4 IT2 CH4SE5 AT2AT1IT2 FI1 FI4 BE2NL1 IT1 SE4 NO1 BE5AT1 BE3 FI3 SE2 AT2 SE2 CH2 NO3 NL3 SE2 BE2 SE5 SE1 NO2 SE3 NO2 CH2 NO1 FI3 CH3 SI1 BE5 SI3 FR2SE1 BE2FR2 NO2 FR2 FI1 NO2 NO2 NL5 NL3 NO5 SE5 AT1 BE5 NL2 BE2 DK2 CH2NO2 FR4 CH5 FR4 BE2 BE1 FI3 FI4 SE1 SE3 NO1 NL5 NL3 AT2 FI4 NO1 NO2 CH2 AT3 NL3 NL3 DK3 DK3 IT1 NL2 NL1 NL2DK1 CH1 NO3 FR5FR4 AT3 AT1CH5FI5 BE1 IT2 FR2 DK2 AT2 NL4 SE1 NL1 SE3 NO3 NL4 NL2 FI2 CH1 CH2 FR2 NL4 BE4 DK5 FR2BE3 DK1 NL4 BE4 NO1 CH3CH2 AT3AT1 CH5 FR3BE2FR2 SE2 NL2 IT2 NO5BE5 NO5 BE4 BE2 BE3 BE4 FI4 FR4 BE2 CH4 NL1 NL4 CH2 BE4 BE4 FR4 BE2 NL1 BE4 NL2 NL5 HR4 BE2 FR4 NL1

0.2

Traditional worker

DK1 FI5BE2

0.6

NO5 SE5

0.4

0.8

SK5

DK5

DK5

DK5 DK1 SE5 DK2 DK3 FI5 BE2 DK1 DK1 DK1 NO2DK3 DK1 BE2 DK1 NO2 NO1 DK3 DK1 DK2 DK4 DK2 DK3 BE2 SE5 SE5 DK2 DK2 DK2 DK4 NO2 NO5 SE5 BE5 DK1 DK5NO2 NO1 NO1 DK2 NO5 NO2 BE3 SK5 DK3 BE2 DK2 SI1 NO3 DK3 FI5 SE5 NO1 BE1 BE1 BE2 NO1 GB5 SE5 NO3 BE2 NO5BE4 NO4 DK1 BE1 BE2 BE4 NO3 NO5 BE3 BE3 BE5 SI3 BE4 DE5 NO2 FI5 NO4 DE1 NO3 NO1 NO3 NO5 BE2 BE2 NO2 BE5NO5 BE3 BE4 NL3 BE1 NO5 BE4 BE2 NO1 LU1 NL3 GB5 NO3 BE3 DE1DE2 DE4 DE4 BE3 DE3 BE5 DE2 NO5 BE4 BE4 BE5 NL1 NL2 DE2 BE4BE4 NO2 DE5 BE2 BE3 DE1 BE2 NL1 DE2 SE5 DE3 DE4BE5 BE2 SK3 GB5 NL3BE1 DE4 LU2 BE1 NL3 HU5 DE1 DE5 BE4 BE2 NL4 CH1 NL4 DE5 NL4 PL1NL1SK4 NO5 PL2 DE3 BE4 DE2 NL1 NL5 SI3 NL4 NL2 NL1 NL5 HR4 DE5 DE3 NL1 ES1 CH2 BE2 SI2 BE4 NL4 NL5 CH3 DE1 CH4 ES2 DE4 ES2 FR4 HR5 CH5 FR4 ES2 NL3 NL1 BE5 PT5 FR3 FR5 PT4 FR4 FR4 NL5 FR3 SK5 BE2 CH1 CH3 DE4 FR4 BE2 DE5 NL2 NL2 ES2NO5 CH2 ES4 NO5 ES3 NL2 ES3 GR5 CH2 CH2 CH2 ES4 CH4 FR5 DE2 ES2 ES4 CH2 CH5 PT5 NL3 GR5 NL4 DE1 ES3 ES3 ES3 BE4 PT5 DE3 PT5 CH2 FR4 DE3ES4 BE2GR5 CH5 GR5 SE5

12

2.0

Income

2.5

3.0

3.5

4.0

4.5

5.0

Education

Figure 8: Estimated locations of the prototypical voters of mainstream socialist and radical right voters in the space defined by income and occupation (left plot) and union membership and education level (right plot). Prototypes in grey are for mainstream socialist parties, prototypes in black are for radical right parties. What emerges from a comparison, country by country, of the educational and occupational profiles of radical right and socialist voters is that both families of parties appeal to superficially similar constituents. For each country, I plot the location of the prototypes in the two-dimensional space defined by income and occupation type (the dummy for “traditional worker”) and by education and union membership. Interestingly, in several cases, the demographic and occupational makeup of the prototypical voters of radical right parties are very similar (when not almost coincident) with the prototypical voters of traditional social democratic and labor parties. In many cases, the typical voters of radical right parties are closer to the stated target of socialist parties, or at least to the stereotype of the socialist voter: many prototypes are more unionized, have lower income, and are closer to traditional skilled manual workers, than the typical voters of socialist parties. Yet, what emerges from the inspection of the prototypes is that the attitudes and what could be called a personality trait (interpersonal trust) of the voters are quite different. Figure 3 plots, for Western European countries, the location of the prototypes of radical right and traditional socialist parties in the space defined by support for government redistribution (with higher values meaning less support for redistribution) and attitudes about immigration (the first principal component of several items, with higher values meaning a

33

more favorable attitude towards immigration). The first pattern that emerges is far from surprising: voters of radical right parties have much more opposed views to immigration than voters of traditional socialist parties. The main exceptions are Finland, where one of the prototypes for the radical right party (the True Finns) has a more favorable attitude than many prototypical voters of socialist parties in Finland and also in other countries. Yet, Finnish voters in general have a more favorable view on immigration than the rest of European voters. In addition, some prototypical socialist voters in Belgium, France, and Denmark have very negative views of immigration. The second pattern worth noticing has to do with the heterogeneity of the redistribution preferences of prototypical voters of radical right parties. In a majority of cases, their preferences for redistribution are not different from those of the voters of mainstream left parties. In some cases, like in Finland, the prototypical voters of radical right parties are more supportive of redistribution. On the other hand, there’s a handful of parties that attract voters with relatively strong anti-redistribution preferences. Among these, the Danish, the Norwegian, and the Flemish parties. One could conjecture, then, that there are really two types of parties labeled as “radical right”, making one question the usefulness of the category itself for analytical purposes: some of these parties attract voters with classical liberal (or “libertarian”) views about the economy, while others (True Finns, French National Front, the Northern League) attract supporters of redistribution and the welfare state. The small plots in figure 10 break down the same information country by country. The Danish and Swiss cases show how the average attitudes of radical right voters are quite more conservative than those of socialist voters; some of the prototypical voters of radical right parties are cosiderably more conservative regarding redistribution than socialist voters also in Belgium and Norway, but there is a good amount of overlap. In the rest of the countries, it becomes clear how preferences about redistribution are not what differentiates between radical right and mainstream socialist voters. In addition, it is interesting to look at patterns in interpersonal trust. The plot in figure 11 displays the location of the prototypes of traditional socialist and radical right parties in the space defined by income and the interpersonal trust score. While both socialist parties and radical right parties draw support from all of the income sections of the electorate, it seems clear that on average the typical voters of radical right parties are less trusting than the voters of traditional socialist parties. The three plots in figure 12 display the pattern for some interesting cases: Norway, Sweden, and France.

34

4.0

DK5

DK2

3.5

BE2 BE2

CH2 NO5 DK5

Redistribution stance 2.5 3.0

DK1 DK2

DK1 DK1 DK4

NO5 DK3

CH2 DK1 DK3 NL1 DK2 DK3 CH1 DK1 DK5 NO5 DK3 CH2 AT1 CH2 DK1 DK3 NO1 NL4 NO5 NO3 CH5 CH4 CH3 BE4 DK3 NO5 DK1 BE2 DK4 NL1 DK1 DK2 NO5 NO5 DK1 NO1 DK2 BE3 NO5 NO3 NL3 BE2 BE2 NL4 NL3 DK2 BE2 NO5 BE3 NL2 AT3 CH2 NL1 BE2 NO2 NO4 NL1 IT2 BE4 DK3 BE1 DK2NL4 BE2 BE2 CH2 NL4 SE2 BE2 BE2 AT3 NL4 NO5 SE1 SE5 BE4 BE4 AT3 AT1 BE3BE4 NO2 NO2 SE4 NO5 BE2 NL1 NO3 BE1 SE4 NO3 NL5 AT1 NL1 BE4 NO4 NL5 CH5 SE2 SE5 NL1 BE4 BE2 BE4 FI1 BE4 NO5 NO3SE4 CH1 BE2 AT1 SE1 SE5 CH5 SE2 NL5 BE4 NL2 AT2 NO1 SE5 CH4 BE5 FI2 NL2NL5 BE3 NL2 SE3 SE4 SE3 CH3 SE3 DK2 IT2 NO2 BE5 NL3 SE1 NL2 SE1 BE5 NO2 NO3 SE2AT1 NL3 NO1 NO1 BE5 FI3 BE1 NO2 FI4 SE1 AT1 SE1 SE3 NO3 AT2 SE2 SE2 BE2FR4 FI4 BE5 NO1 NO2 NL3 BE4 NL4AT3 AT1 NL2 SE4 FR4 NO1 BE1 FI3 SE5 FI4 NL3 SE5 SE3 FI1 FR1 FI2SE3FR4 FR1 IT1BE2FI1 BE2 SE5 FI4 IT2 FI2 FI5 FI3 FI2 FI2 BE3 BE3 FI2 SE4 BE5 FI3 AT3 FR3 BE4 FI1 FI3 BE3 IT2 SE5 FI3 FI4 BE1 IT1 FI4 BE2 FR4 NO2 FR2 AT2 FI1 FI5 FR1 FR2 FR3 FR2 BE1 FI1 FR5 FR4 IT2 FR2 FR1 FR4 FR2 FR1 IT2 FR1 FR1 FI4 FI5 FI4 FR2 BE4 CH2

2.0

AT2

FR2

BE5

1.0

1.5

FR5

−4

−2

0 Immigration stance

2

4

Figure 9: Estimated locations of the prototypical voters of radical right parties (black) and traditional socialist parties (grey) regarding income redistribution (higher values meaning opposition to redistribution) and immigration (higher values meaning a more favorable view of immigration). The observations are identified by the country code and the ESS wave from which they are estimated, for instance “BE4” is a prototype for the 4th wave of the ESS in Belgium.

35

4.0

AT3

AT3 1.5

AT2

1.5

2.0

AT3 AT3 AT1 AT1 AT1AT2 AT1 AT1 AT2 AT3 AT1

2.5

Redistribution stance

3.0

3.5

4.0 3.5 3.0 2.5

AT1 AT2

2.0

Redistribution stance

BE2 BE2

BE4 BE2 BE3 BE2 BE3 BE2BE2 BE2 BE4 BE1 BE2 BE2 BE2 BE4 BE2 BE3 BE4 BE4 BE2 BE1 BE4 BE4 BE4 BE4 BE2 BE5 BE2 BE4 BE3 BE5 BE1BE4 BE2 BE5 BE5 BE1 BE5 BE2 BE2 BE3 BE3 BE5 BE4 BE3 BE1 BE2 BE1

1.0

1.0

BE4 −4

−2

0

2

4

−4

−2

4.0 3.5 3.0

CH2 CH1 CH5 CH4 CH3

Redistribution stance

CH2 CH5

DK1 DK3 DK2 DK3 DK3 DK5 DK1 DK3 DK1 DK3 DK2 DK1 DK4DK2 DK1 DK1 DK2 DK3 DK2 DK2

1.5 1.0

1.5 1.0

−2

0

2

4

−4

−2

0

2

4

3.5 3.0 2.5

Redistribution stance

2.0

FR2

FR5 FR1 FR3

FR4FR4 FR4 FR1 FR1 FR3 FR4 FR2 FR2 FR2 FR5 FR2 FR4 FR1 FR4 FR2 FR1 FR1 FR1 FR2

1.0

1.0

1.5

2.0

FI1 FI2 FI3 FI4 FI4 FI3 FI4 FI1 FI2 FI4 FI2FI1 FI5 FI3 FI2FI1 FI2 FI3FI2 FI3 FI4 FI1 FI5FI4 FI3 FI1 FI4FI5 FI4

1.5

2.5

3.0

3.5

4.0

Immigration stance

4.0

Immigration stance

Redistribution stance

4

DK5 DK1 DK1 DK2 DK1 DK4DK3

2.0

CH2 CH2 CH2 CH1 CH5 CH4 CH3

2.5

3.5 2.5

CH2

2.0

Redistribution stance

3.0

CH2

−4

−4

−2

0

2

4

−4

−2

0

2

4

3.5 3.0

IT2

1.5

IT2 IT1

IT2

NL4 NL1 NL3NL3NL4 NL2 NL1 NL1 NL4 NL4 NL4 NL1 NL5 NL5NL1 NL1 NL5 NL2 NL2 NL3NL2 NL5 NL2 NL3 NL4 NL3 NL2 NL3

2.0

IT1 1.5

2.0

IT2 IT2

NL1 2.5

2.5

Redistribution stance

3.0

3.5

4.0

Immigration stance

4.0

Immigration stance

Redistribution stance

2

DK5

DK2

1.0

IT2

1.0

−4

−2

0

2

4

−4

−2

0

2

4

3.5

NO5

3.0 Redistribution stance

3.0

NO5

SE5

1.0

1.0

1.5

NO2

1.5

SE2 SE1 SE5 SE4 SE4 SE2 SE5 SE1 SE4 SE5 SE2 SE5 SE3 SE4 SE3 SE3 SE1 S E1 SE2 SE2 SE1 SE1 SE3 SE2 SE4 SE5 SE5 SE3 SE3 SE4 SE5

2.0

2.0

2.5

NO5 NO1 NO5NO5 NO3 NO5 NO1 NO5 NO3NO5 NO5 NO2 NO4 NO5 NO2 NO5 NO2 NO3 NO3 NO4 NO3 NO5 NO1 NO2 NO2 NO3 NO1 NO2 NO3 NO1 NO2 NO1 NO1

2.5

3.5

4.0

Immigration stance

4.0

Immigration stance

Redistribution stance

0 Immigration stance

4.0

Immigration stance

−4

−2

0 Immigration stance

2

4

−4

−2

0

2

4

Immigration stance

Figure 10: Estimated locations of the prototypical voters of radical right parties (black) and traditional 36 socialist parties (grey) regarding income redistribution (higher values meaning opposition to redistribution) and immigration (higher values meaning a more favorable view of immigration).

9 8

DK1

7

NO5

4

5

Interpersonal trust 6

NO5

DK1 DK3 DK2

DK2 NO5 DK1 DK5

DK2NO2 SE2 NO3 DK2 DK2 NO3 NO1 SE1 NL2 DK3 DK2 FI3 NO2NO3 NO5 NO1 DK5 DK1NO3 SE1 FI1 DK3 NO5 DK1DK4 NO5 NO3 SE4 SE1 DK1 NO2 NO4 SE5 NO2 NO2 SE5 SE3 DK3 CH4 SE1 SE1 DK2 NO2NO1 FI3 DK1FI2FI3 DK3 SE5 FI3 NL5FI4 FI2 FI2 NO3 FI1 SE4 SE3 SE2 NL3 NL3 FI2 NL2 SE4 FI1 NL2 SE5 FI1 NL2 SE1 FI2 NL1 FI5 FI4 SE4NO1 FI2 NO1 NO3 NL1 CH5 SE3 CH1 SE5 SE4 BE4 SE3 DK3 FI4 CH2 NO5 SE4 SE2 NO1 CH3 NO2 SE3 NO5 FI4 FI4 FI1 NO2 NL2 NL3 NO5 NL4 NO4 DK3 SE2 FI3 NL4 CH3 FI4 FI5 NO1 DK4 SE2 SE3 BE2 DK2 NL3 NL2 FI3 FI1NO5 BE2 NL4 CH2 NL4 CH4 NL5 BE2 FI5 AT2 NL3 SE2 NL1 CH2 CH1 NL1 BE4 CH5 CH5 BE1 SE5 BE2 AT3 AT2 CH2 FR2 FR2 FR2 BE4 BE1 NO5 AT3 NL4 DK5 NL5 AT3 BE5 FR4 BE2 BE4 BE4NL1 NL4 BE2 BE4 BE3 DK1 AT2 CH2 NL3 BE3 DK1 AT1 FR3 BE2 FR2 FR4 AT1 BE5 CH2 BE4 BE2 BE1 AT1 NL1 BE1 FR4 BE2 IT1 AT3 IT2 IT1 BE2 NL1 NL5 FR4 FR5 SE5 CH2 BE2 BE4FR2 FR4 BE3 BE4 FR4 BE3 IT2 BE3 AT3 IT2 AT1 BE3 BE3 BE5 BE4 AT1 BE5 BE2BE5 IT2 FR2 BE1 BE2 BE4 BE2 BE2 BE4 FR5 AT2 SE5 AT1 FR3 BE2 BE2 BE5 BE4 IT2 BE5 BE1 BE2 AT1 NO5

FI4

FI4

FR2

3

IT2

2

4

6 Income

8

10

Figure 11: Income and interpersonal trust of prototypical voters of radical right (black) and traditional socialist (grey) parties.

37

9 6

SE5

SE2 SE1 SE1 SE4 SE1 SE5 SE5 SE3 SE1 SE1 SE5SE2 SE3 SE4 SE4 SE1 SE4 SE3 SE5 SE4 SE3 SE4 SE2 SE2 SE3 SE2 SE3 SE2 SE5 SE5

5

5

NO5

Interpersonal trust

NO5 NO5 NO2 NO3 NO5 NO3 NO1 NO2 NO5 NO5 NO3 NO1 NO3 NO5 NO3 NO2 NO4 NO2 NO2 NO2 NO1 NO3 NO1 NO1 NO3 NO2 NO1 NO5NO5NO5 NO2 NO4 NO1 NO5

7

8

9 8 6

Interpersonal trust

7

9 8 7 6

4

SE5 4

Interpersonal trust

FR2 FR2 FR2 FR4FR4 FR3FR2 FR5 FR4 FR4 FR2 FR4 FR2 FR5 FR3

4

5

FR4

NO5

2

4

6 Income

8

10

3

3

3

FR2

2

4

6 Income

8

10

2

4

6

8

10

Income

Figure 12: Income and interpersonal trust of prototypical voters of radical right (black) and traditional socialist (grey) parties in France, Norway, and Sweden. What emerges, then, is that voters of the two types of parties are, in many cases, not different from the prototypical voters of socialist parties when it comes to demographic and socio-economic background, but they are different when it comes not only to attitudes about immigrants but also to interpersonal trust, which could be regarded as a cultural trait or, to an extent, as a direct reflection of a personality trait. Using the interpretation of the prototypes as centroids of constituencies, one can claim that radical right parties draw their support among the same social groups, but from culturally defined sub-groups within broader social groups. Notice that a pattern like this might have been “masked” in a conditional logit framework if the interpersonal trust score were introduced as a simple linear term. Consider the cases of Denmark, Sweden and Norway.11 I estimate binary logit models with an indicator variable equal to one if the respondent voted for the radical right party, and zero otherwise, as the response variable. As predictor variables, in a first model I include those included in the estimation of the prototypes. Neither in Sweden nor in Norway the coefficient of the interpersonal trust score is anywhere near statistical significance. It is statistically significant, on the other hand, in Denmark. In Norway, the importance of interpersonal trust in predicting electoral support for the radical right emerges only when I include the interaction between interpersonal trust and respectively ideology (leftright self placement) and attitudes about immigration.12 When this model is estimated, the main effect 11 These are three cases in which the pattern is quite apparent. A failure of binary logit to detect the pattern here would imply that a fortiori the issue would surface when the pattern is less marked. 12 The variables are centered to have mean zero to make interpretation of the estimates more straightforward.

38

of interpersonal trust is negative (and marginally statistically significant, p<.06) and the interaction with ideology is positive and statistically significant. From the substantive point of view, if we compare two voters with average ideological self-placement (let’s call these “centrist voters”), the one with higher interpersonal trust is much less likely to support a radical right party. The effect of interpersonal trust is more pronounced for voters placed to the left of center, and muted for voters who place themselves farther to the right of the ideology spectrum. In Sweden, though, even when the interaction is included, and even if the analysis is restricted to the fifth wave of the ESS (in the other surveys, very few respondents chose the radical right party), no statistical effect of interpersonal trust or of its interaction with ideology and immigration attitudes can be detected. This exemplifies how the prototype approach makes it possible for the analyst to detect patterns that would require a more complicated modeling strategy to be detected in a conditional probability framework. While at times (as in the Denmark example above) a conditional probability model and a prototype model provide information that is basically equivalent from the substantive, political science, point of view, this is not necessarily the case: in a conditional probability model, a clear distinction between groups of voters might be masked.

4

Conclusion

Political parties should be understood as political expressions of coalitions of social groups and not only as Schumpeterian/Downsian “teams” that compete in elections. I contend that the off-the-shelf empirical models available to voting behavior researchers, and comparative researchers in particular, do not lend themselves to answer questions about what are the core constituencies, or the social “bases”, of political parties. For this purpose, I propose to adapt a method, developed in the machine learning literature, that makes it possible to describe “prototypical” voters of political parties based on post-election or general social surveys. The method not only makes it possible to describe multiple constituencies of political parties, but also accommodates by construction the interactions, of any order, between the observable features of voters. I present a new (and fast) implementation of an algorithm that belongs to the generalized relevance learning vector quantization family. The implementation is going to be prepared shortly as a publicly-

39

available package in the R environment for statistical computing (R Core Team 2013). Experiments on artificial data show that the method I propose has a remarkably strong ability to locate the “true” locations of the constituency centroids even in the presence of overlap and with noisy data. In addition, I suggest to use the “gap statistic” introduced by Tibshirani et al. (2001) to decide on the number of constituencies that a party displays. The empirical application shows the promise of this approach. As it turns out, voters of radical right parties are quite similar to voters of social democratic parties under certain respects, in particular when it comes to occupation and income, and in some cases they are closer to the stereotypical socialist voter than the prototypical socialist voters estimated from the data. In fact, they tend to be unionized skilled manual workers and often have lower income than socialist voters. At the same time, the policy preferences of radical right voters, especially when it comes to immigration, are very different from those of socialist voters. The analysis also reveals that there are two sub-types of radical right parties: some attract voters that strongly support government redistribution of income, while other attract voters with preferences closer to those of “classical liberal” parties. This might call into question the general usefulness of the “radical right” category as an analytical tool. In addition, the analysis shows that typical voters of radical right parties display lower levels of interpersonal trust than socialist voters. Hence, radical right parties might appeal to subgroups, defined by personality traits, within the same social groups to which traditional social-democratic parties “fish”. I show how conditional probability models that do not include interactions between the interpersonal trust score and other characteristics of voters (e.g., ideological self-placement) might be unable to detect patterns like this in the data.

References Adams, James F., Samuel Merrill III, and Bernard Grofman. 2005. A Unified Theory of Party Competition: A Cross-national Analysis Integrating Spatial and Behavioral Factors. Cambridge University Press. Bawn, Kathleen, Martin Cohen, David Karol, Seth Masket, Hans Noel, and John Zaller. 2012. “A Theory of Political Parties: Groups, Policy Demands and Nominations in American Politics.” Per-

40

spectives on Politics 10(3): 571-597. Calinski, Tadeusz, and Jerzy Harabasz. 1974. “A Dendrite Method for Cluster Analysis.” Communications in Statistics – Theory and Methods 3(1): 1-27. Ezrow, Lawrence, Catherine De Vries, Marco Steenbergen, and Erica Edwards. 2011. “Mean Voter Representation and Partisan Constituency Representation: Do Parties Respond to the Mean Voter Position or to Their Supporters?” Party Politics 17: 275-301. Gelman, Andrew, and Donald B. Rubin. 1992. “Inference from Iterative Simulation using Multiple Sequences.” Statistical Science 457-472. Ghitza, Yair, and Andrew Gelman. 2013. “Deep Interactions with MRP: Election Turnout and Voting Patterns Among Small Electoral Subgroups.” American Journal of Political Science forthcoming Hammer, Barbara, and Thomas Villman. 2002. “Generalized Relevance Learning Vector Quantization.” Technical report, Department of Mathematics and Computer Science, University of Osnabrück. Hammer, Barbara, and Thomas Villman. n.d. “Estimating Relevant Input Dimensions for Selforganizing Algorithms.” Technical report, Department of Mathematics and Computer Science, University of Osnabrück. Hartigan, John A. 1975. Clustering Algorithms. New York: John Wiley and Sons, Inc. Hastie, Trevor, and Robert Tibshirani. 1996. “Discriminant Analysis by Gaussian Mixtures.” Journal of the Royal Statistical Society. Series B (Methodological) 155-176. Hastie, Trevor, Robert Tibshirani, and Jerome Friedman. 2001. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. New York: Springer. Honaker, James. 2011. “Learning Vectors for Case Study Analysis.” Paper presented at the Summer Meeting of the Society for Political Methodology, Princeton, NJ. Karreth, Johannes, Jonathan T. Polk, and Christopher S. Allen. 2013. “Catchall or Catch and Release? The Electoral Consequences of Social Democratic Parties’ March to the Middle in Western Europe.” 41

Comparative Political Studies 46: 791-822. Kaufman, Leonard, and Peter J. Rousseeuw. 1990. Finding Groups in Data: An Introduction to Cluster Analysis. New York: John Wiley and Sons, Inc. Kirchheimer, Otto. 1990. “The Catch-all Party.” The West European Party System 50-60. Kitschelt, Herbert. 1997. The Radical Right in Western Europe: A Comparative Analysis. Ann Arbour: University of Michigan Press. Kohonen, Teuvo. 1988. “Learning Vector Quantization.” Neural Networks 1, Supplement 1. Kohonen, Teuvo. 1990. “Improved Versions of Learning Vector Quantization.” Neural Networks. Lax, Jeffrey R., and Justin H. Phillips. 2009. “How Should we Estimate Public Opinion in the States?” American Journal of Political Science 53(1): 107-121. Meguid, Bonnie M. 2005. “Competition between Unequals: The Role of Mainstream Party Strategy in Niche Party Success.” American Political Science Review 99(3): 347-359. Milligan, Glenn W., and Martha C. Cooper. 1985. “An Examination of Procedures for Determining the Number of Clusters in a Data Det." Psychometrika 50(2): 159-179. Nylund, Karen L., Tihomir Asparouhov, and Bengt O. Muthén. 2007. “Deciding on the Number of Classes in Latent Class Analysis and Growth Mixture Modeling: A Monte Carlo Simulation Study.” Structural Equation Modeling 14(4): 535-569. Park, David K., Andrew Gelman, and Joseph Bafumi. 2006. “State Level Opinions from National Surveys: Poststratification Using Multilevel Logistic Regression.” In Jeffrey E. Cohen (Ed.), Public Opinion in State Politics. Stanford, CA: Stanford University Press, 209-28 R Core Team. 2013. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. http://www.R-project.org/

42

Sato, A. S.and K. Yamada. 1995. “Generalized Learning Vector Quantization.” In G. Tesauro, D. Touretzky, and T. Leen, editors, Advances in Neural Information Processing Systems 7:423-429. Cambridge: MIT Press. Schneider, Petra, Michael Biehl, and Barbara Hammer. n.d. “Adaptive Relevance Matrices in Learning Vector Quantization.” Technical report, Institute for Mathematics and Computing Science, University of Groningen. Tibshirani, Robert. 1996. “Regression Shrinkage and Selection via the Lasso.” Journal of the Royal Statistical Society. Series B (Methodological), 267-288. Tibshirani, Robert, Guenther Walther, and Trevor Hastie. 2001. “Estimating the Number of Clusters in a Data Set via the Gap Statistic.” Journal of the Royal Statistical Society: Series B (Statistical Methodology) 63(2): 411-423. Vermunt, Jeroen K. 2010. “Latent Class Modeling with Covariates: Two Improved Three-Step Approaches.” Political Analysis 18(4): 450-469. Vicente, Pedro C., and Leonard Wantchekon. 2009. “Clientelism and Vote Buying: Lessons from Field Experiments in African Elections.” Oxford Review of Economic Policy 25(2): 292-305. Villmann, Thomas, F. Schleif and Barbara Hammer. 2005. “Comparison of Relevance Learning Vector Quantization with other Metric Adaptive Classification Methods.” Technical report, Clinic for Psychotherapy, University Leipzig. Wantchekon, Leonard. 2003. “Clientelism and Voting Behavior: Evidence from a Field Experiment in Benin.” World Politics 55(3): 399-422.

43

A

Further evidence from the simulations

rmse 1 rmse 2 rmse 3 rmse 4 rmse 5 rmse 6 rmse 7 rmse 8 rmse 9 mean.rmse lambda.sum

1 0.565 0.562 0.542 0.711 0.725 0.728 0.466 0.470 0.472 0.582 0.087

2 0.657 0.648 0.635 0.788 0.767 0.758 0.406 0.414 0.428 0.611 0.089

3 0.431 0.461 0.442 0.591 0.610 0.606 0.148 0.146 0.150 0.398 0.226

4 0.481 0.493 0.479 0.699 0.718 0.717 0.987 0.970 0.985 0.725 0.149

5 0.666 0.659 0.643 0.515 0.526 0.544 0.735 0.764 0.775 0.647 0.029

6 0.655 0.706 0.675 0.214 0.219 0.216 0.524 0.546 0.558 0.479 0.161

7 0.860 0.960 0.920 0.430 0.439 0.425 0.577 0.600 0.609 0.647 0.170

8 0.426 0.445 0.441 1.274 1.262 1.255 0.934 0.906 0.918 0.874 0.022

9 0.912 0.888 0.876 0.639 0.641 0.640 1.045 1.037 1.069 0.861 0.148

10 0.256 0.254 0.252 0.665 0.656 0.642 0.409 0.386 0.398 0.435 0.050

Table 3: Root mean square error, by prototype and dimension, and by dimension, over the 1000 simulations, and average standard deviation of the weights. Experiment with three parties of three constituencies each, moderate overlap.

rmse 1 rmse 2 rmse 3 rmse 4 rmse 5 rmse 6 rmse 7 rmse 8 rmse 9 mean.rmse lambda.sum

1 0.84 0.83 0.83 0.64 0.61 0.60 0.64 0.65 0.66 0.70 0.00

2 1.11 1.08 1.10 0.65 0.61 0.61 0.72 0.74 0.74 0.82 0.04

3 0.60 0.60 0.60 0.99 0.97 0.95 0.98 1.01 0.96 0.85 0.01

4 1.32 1.32 1.33 1.02 1.01 0.99 0.91 0.95 0.95 1.09 0.07

5 1.01 1.09 1.08 1.27 1.25 1.26 0.82 0.90 0.88 1.06 0.08

6 0.70 0.69 0.70 0.69 0.67 0.69 0.43 0.41 0.39 0.60 0.00

7 0.67 0.68 0.67 0.94 0.93 0.95 0.84 0.87 0.85 0.82 0.04

8 0.48 0.50 0.48 0.81 0.79 0.80 0.56 0.57 0.57 0.62 0.00

9 0.76 0.81 0.79 0.74 0.74 0.72 0.92 0.93 0.93 0.82 0.04

10 0.86 0.87 0.87 1.06 1.06 1.07 0.80 0.81 0.79 0.91 0.00

Table 4: Root mean square error, by prototype and dimension, and by dimension, over the 1000 simulations, and average standard deviation of the weights. Experiment with three parties, each with three constituencies, strong overlap.

44

rmse 1 rmse 2 rmse 3 mean.rmse lambda.sum

1 0.25 0.28 0.21 0.25 0.00

2 0.93 0.66 0.19 0.59 0.00

3 0.33 0.28 0.35 0.32 0.00

4 0.62 0.35 0.31 0.43 0.00

5 0.87 0.65 0.19 0.57 0.00

6 0.27 0.52 0.40 0.39 0.00

7 0.93 0.65 0.19 0.59 0.00

8 0.52 0.21 0.83 0.52 0.00

9 0.69 0.41 0.20 0.43 0.00

10 0.43 0.25 0.22 0.30 0.00

Table 5: Root mean square error, by prototype and dimension, and by dimension, over the 1000 simulations, and average standard deviation of the weights. Experiment with three parties, with one constituency each, strong overlap.

rmse 1 rmse 2 rmse 3 rmse 4 rmse 5 mean.rmse lambda.sum

1 0.46 0.46 0.45 0.53 0.60 0.50 0.00

2 1.05 1.06 1.08 0.40 0.44 0.81 0.13

3 0.41 0.43 0.40 0.31 0.30 0.37 0.00

4 1.14 1.24 1.11 0.77 0.34 0.92 0.01

5 0.63 0.65 0.63 1.31 0.34 0.71 0.02

6 0.45 0.48 0.45 0.82 0.52 0.54 0.07

7 0.53 0.55 0.54 0.39 1.15 0.63 0.02

8 0.46 0.47 0.46 0.71 1.02 0.63 0.00

9 0.53 0.55 0.52 0.53 1.09 0.65 0.06

10 0.68 0.73 0.68 0.45 0.41 0.59 0.00

Table 6: Root mean square error, by prototype and dimension, and by dimension, over the 1000 simulations, and average standard deviation of the weights. Experiment with three parties, with respectively three, one and one constituencies, strong overlap.

45

All-Nearest-Neighbors Queries in Spatial Databases