होम
Political Analysis Ignoramus, Ignorabimus? On Uncertainty in Ecological Inference
Ignoramus, Ignorabimus? On Uncertainty in Ecological Inference
Martin Elff, Thomas Gschwend and Ron J. Johnstonयह पुस्तक आपको कितनी अच्छी लगी?
फ़ाइल की गुणवत्ता क्या है?
पुस्तक की गुणवत्ता का मूल्यांकन करने के लिए यह पुस्तक डाउनलोड करें
डाउनलोड की गई फ़ाइलों की गुणवत्ता क्या है?
खंड:
16
भाषा:
english
पत्रिका:
Political Analysis
DOI:
10.2307/25791917
Date:
January, 2008
फ़ाइल:
PDF, 2.32 MB
आपके टैग:
फ़ाइल 1-5 मिनट के भीतर आपके ईमेल पते पर भेजी जाएगी.
फ़ाइल 1-5 मिनट के भीतर आपकी Kindle पर डिलीवर हो जाएगी.
टिप्पणी: आप जो भी पुस्तक अपने Kindle पर भेजना चाहें इसे सत्यापित करना होगा. Amazon Kindle Support से सत्यापन ईमेल के लिए अपना मेलबॉक्स देखें.
टिप्पणी: आप जो भी पुस्तक अपने Kindle पर भेजना चाहें इसे सत्यापित करना होगा. Amazon Kindle Support से सत्यापन ईमेल के लिए अपना मेलबॉक्स देखें.
Conversion to is in progress
Conversion to is failed
0 comments
आप पुस्तक समीक्षा लिख सकते हैं और अपना अनुभव साझा कर सकते हैं. पढ़ूी हुई पुस्तकों के बारे में आपकी राय जानने में अन्य पाठकों को दिलचस्पी होगी. भले ही आपको किताब पसंद हो या न हो, अगर आप इसके बारे में ईमानदारी से और विस्तार से बताएँगे, तो लोग अपने लिए नई रुचिकर पुस्तकें खोज पाएँगे.
1
|
|
2
|
|
Society for Political Methodology Ignoramus, Ignorabimus? On Uncertainty in Ecological Inference Author(s): Martin Elff, Thomas Gschwend and Ron J. Johnston Source: Political Analysis, Vol. 16, No. 1 (Winter 2008), pp. 70-92 Published by: Oxford University Press on behalf of the Society for Political Methodology Stable URL: http://www.jstor.org/stable/25791917 Accessed: 21-09-2016 01:54 UTC JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range of content in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new forms of scholarship. For more information about JSTOR, please contact support@jstor.org. Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at http://about.jstor.org/terms Society for Political Methodology, Oxford University Press are collaborating with JSTOR to digitize, preserve and extend access to Political Analysis This content downloaded from 150.216.68.200 on Wed, 21 Sep 2016 01:54:41 UTC All use subject to http://about.jstor.org/terms Political Analysis (2008) 16:70-92 doi:10.1093/pan/mpm030 Advance Access publication December 8, 2007 Ignoramus, Ignorabimus? On Uncertainty in Ecological Inference Martin Elff Faculty of Social Sciences, University of Mannheim, A5, 6, 68131 Mannheim, Germany e-mail: eljf@sowi.uni-mannheim.de (corresponding author) Thomas Gschwend Center for Doctoral Studies in Social and Behavioral Sciences, University of Mannheim, D7, 27, 68131 Mannheim, Germany e-mail: gschwend@ uni-mannheim.de Ron J. Johnston School of Geographical Sciences, University of Bristol, Bristol BS8 1SS, UK e-mail: r.johnston @ bristol. ac. uk Models of ecological inference (El) have to rely on crucial assumptions about the individual level data-generating process, which cannot be tested because of the unavailability of these data. However, these assumptions may be violated by the unknown data and this may lea; d to serious bias of estimates and predictions. The amount of bias, however, cannot be assessed without information that is unavailable in typical applications of El. We therefore construct a model that at least approximately accounts for the additional, nonsampling error that may result from possible bias incurred by an El procedure, a model that builds on the Principle of Maximum Entropy. By means of a systematic simulation experiment, we exam ine the performance of prediction intervals based on this second-stage Maximum Entropy model. The results of this simulation study suggest that these prediction intervals are at least approximately correct if all possible configurations of the unknown data are taken into account. Finally, we apply our method to a real-world example, where we actually know the true values and are able to assess the performance of our method: the prediction of district-level percentages of split-ticket voting in the 1996 General Election of New Zealand. It turns out that in 95.5% of the New Zealand voting districts, the actual percentage of split ticket votes lies inside the 95% prediction intervals constructed by our method. 1 Introduction Many students and practitioners of social science have remained skeptical about the feasibility of sound ecological inference (El). To some, the terms "ecological inference" Authors' note: We thank three anonymous reviewers for helpful comments and suggestions on earlier versions of this paper. An appendix giving some technical background information concerning our proposed method, as well as data, R code, and C code to replicate analyses presented in this paper are available from the Political Analysis Web site. Later versions of the code will be packaged into an R library and made publicly available on CRAN (http://cran.r-project.org) and on the corresponding author's Web site. ? The Author 2007. Published by Oxford University Press on behalf of the Society for Political Methodology. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org 70 This content downloaded from 150.216.68.200 on Wed, 21 Sep 2016 01:54:41 UTC All use subject to http://about.jstor.org/terms Uncertainty in Ecological Inference 71 and "ecological fallacy" appear almost synonymous: They doubt whether it will be pos sible at all to draw any conclusions about the behavior of individuals from aggregate data. There seem to be good reasons for such skepticism: El has to rely on certain assumptions about the data-generating process at the level of individuals that cannot directly be tested, simply because the data on which such tests could be based are unavailable to begin with. The burden of assumptions becomes especially visible in a recent survey of Bayesian approaches to El for 2 x 2 tables given by Wakefield (2004), which shows that even Markov chain Monte Carlo (MCMC) approaches are faced with this indeterminacy. Wake field considers a total of 13 different variants of prior distributions for use in MCMC analysis of 2 x 2 tables including King's original truncated normal prior, beta exponential, and Student logistic gamma compound prior distributions and compares them with respect to prediction bias for King's (1997) Louisiana party registration data. Although there is one prior distribution that performs best in this application, one may still ask whether this result can be generalized to all possible El applications to 2 x 2 tables. The question remains, as Fienberg and Robert aptly remark in their comments to Wakefield's review, "to what extent can we really distinguish between the fit of different models, hierarchical or otherwise, when only aggregate data are available?" (Fienberg and Robert 2004, 432). Nonetheless, without certain restrictive assumptions about the process generating the unknown data, it is impossible to obtain any estimates for an El problem, however pre liminary these estimates may be. It should be noted that the challenge of El is only a special case of the wider class of ill-posed inverse problems (King 1997). The task of El is solving a problem that is inverse insofar as only a set of summaries of the data of interest are given and ill posed insofar as the information given by these summaries is not sufficient to identify a solution. Problems of this kind abound in the technical and scientific literature and numerous approaches to their solution have been proposed (see e.g., Groetsch 1993). Therefore, we think that a general rejection of El procedures would be premature. Al though the assumptions inherent in an El procedure cannot be tested by means of statistical techniques, it is still possible to delimit the potential error that is associated with predic tions from such a procedure. Constructing bounds to this potential error is the aim of the present paper. We derive a method to construct "robust" prediction intervals, that is, intervals that contain the true values of the unknown data with (at least approximately) known probability. Further, we assess the performance of these prediction intervals by means of simulation and a real-world example, the reconstruction of split-ticket votes in the 1996 General Election of New Zealand. As point of departure, we take an El procedure recently presented in this journal, the entropy-maximizing approach of Johnston and Pattie (2000). When recast into a probability model, the Johnston-Pattie model imposes rela tively mild and clearly structured restrictions on the unknown data-generating process and thus suits well the exploration of the consequences of model departures. As it will turn out, both in the simulation study and in the application to ticket splitting in New Zealand, the prediction intervals that we construct have a coverage that is almost identical to their nominal level. The paper is organized as follows: In Section 2, we explain the fundamental dilemma of El, which results from the necessity of El procedures that employ certain restrictive assumptions that cannot be tested without the aid of the very data which are unavailable in a typical El application. In Section 3, we propose a second-stage correction of the error distribution of El estimates based on the Principle of Maximum Entropy and report the results of a simulation study to assess the performance of the proposed method. In Section 4, we discuss how this method can be adapted to cases where some of the data on which El is based do not come from population-level aggregates but from a survey sample. In This content downloaded from 150.216.68.200 on Wed, 21 Sep 2016 01:54:41 UTC All use subject to http://about.jstor.org/terms 72 Martin Elff, Thomas Gschwend, and Ron J. Johnston Section 5, we illustrate our method with its application to split-ticket voting in the General Election of New Zealand. Section 6 discusses the limits of our proposed meth whereas Section 7 summarizes our results.1 2 A Basic Dilemma of El The situation of El can be compared to reconstructing the "inner workings" of a "bl box." These inner workings may be, for example, the numbers xijk of members various ethnic groups (i = 1, ..., I) who do or do not turn out to vote (j = 1, ..., voting districts (k = 1,..., K) or the probabilities Pv(Xijk = xijk) in which these counts occur. If the total sum of the counts is n, for example, if the total population of eli voters is n, then the total number of possible configurations of counts xijk that sum to be expressed as the binomial coefficient (^^j^K_ ~\ - In typical instances of El, this is a vast number: Even in a country with only 1 million (=ri) voters and three (=/) ethnic groups, whose members may or may not turn out to vote (J = 2) in one of 100 (=K) voting districts, there are n + UK - 1 ^ = ( 106 + 599 ^ ^m (1) UK-I J \ 599 J possible ways to arrange the voters into the black box. In the absence of any information about the marginal sums njk = Ylixijk> ?u ? or rijj. = J2k xijk>mat is? ^ onty me totalsum n ?f me counts xijk is known and if one has to make a point prediction about what configuration of counts is present inside the black box, it seems that one cannot do better than pick any of the {^^j^K_~[ 1 ^ possible configura tions at random.2 Such a random pick can be represented by a Uniform distribution on all possible configurations of counts in the (I x J x #)-array that have a total sum of n. The probability of any specific configuration of being chosen then is l j ^ n ^j^K_ 7 1) Also, the probability of hitting the true configuration of counts ? =(x?jk) by accident is also l j ~^jgK_ 7 1) Thus, one could also think of this true configuration as the outcome of a random variable X=(XiJk) that has arrays T=(Xijk) as values and has a Uniform distribution on its values. Now, information about marginal sums of the cell counts, for example, the number of members of ethnic groups within each voting district, can vastly reduce the number of possible cell counts one has to consider. If we consider the case (for reasons of simplicity if not of plausibility) that the observed turnout rate in all voting districts is 50%, then one has An appendix, available online on the Political Analysis Web site, contains supplemental information: some background on maximum entropy distributions, properties of the Dirichlet and Dirichlet-multinomial distribu tions relevant for the argument of our paper, details on the computation of Beta-binomial prediction intervals, and some details about two nonparametric alternatives to the method proposed in our paper. The Political Analysis Web site also contains source code in R and C as well as data suitable for the replication of the analyses presented here. 2A note on notation: We abbreviate , etc., as J2i anc* S/=i Z]y=i Ylk=i as X^y,* ^tne summation limits are understood. This content downloaded from 150.216.68.200 on Wed, 21 Sep 2016 01:54:41 UTC All use subject to http://about.jstor.org/terms Uncertainty in Ecological Inference 73 only to consider, given a combination of voting district k and turnout status y, the numb of possible triples of numbers that sum to nlJK = 106/200, that is, (jkhj -1 \ = ^ iovmo.+3 -iy 107.if (2) a number that is several orders of magnitude smaller than the number of possible config urations if no information about the marginal counts was available. This suggests that predictions about the unknown cell counts that take into account the marginal tables can have a greatly improved performance as compared to predictions that do not. El methods can be considered as attempts to make such improved predictions. Yet, these methods are plagued by a serious problem, which we will expose in the following. Statistical inference usually is made with the intention to describe a population based on a sample. For example, one may try to identify a model of voting behavior that connects voters' decisions to their own and the candidates' policy positions. If a correctly specified model is found, it can be used to make out-of-sample predictions, which may be pre dictions about future states of affairs or contrafactual states of affairs. One can also find both types of usage of results of EL As an example of the first type of usage, Burden and Kimball (1998) try to find how much and why American voters engage in split-ticket voting between Presidential and Congress elections based on aggregated Presidential and Congressional votes for American voting districts. An example of the other type of usage is predicting the effects of changing the boundaries of voting districts on the representation of racial groups in these districts and on the chances of Democratic and Republican candidates in these rearranged districts (Cirincione, Darling, and O'Rourke 2000). As with statistical inference, both usages of El need correctly specified models that describe the population. In contrast to statistical inference, El is confronted with two problems that together pose a dilemma. The first problem is that of modeling indeterminacy: If only aggregates of the variables of interest are observed, there will always be more than one model of an interrelation between these variables that even fits perfectly to the observed aggregates. The second problem is that of inferential indeterminacy: If one arrives at identifying a model describing the interrelation of the variables of interest, this model will entail certain restrictive assumptions about the population, which cannot be tested based on aggregate data alone, which usually are the only data available. Hence the dilemma: If the first problem is solved, the second one is inevitably encountered. If one tries to avoid the second problem, one cannot solve the first one. The challenge of the first problem lies in finding assumptions suitably restrictive for model identification. The criteria for suitability may vary with the application, but in most cases one will use assumptions that are plausible on the one hand and convenient on the other insofar as they lead to simple models for which estimation is feasible. However, plausibility and simplicity may conflict. Consider the case in which someone has aggregate data on turnout and the proportion of African-Americans at the level of voting districts and wants to find out whether African Americans differ from other citizens with respect to turnout. In this case, a model that presupposes that in each voting district turnout and race are conditionally independent will perfectly fit to the aggregate data and thus cannot be improved based on these data alone: Let rii.k denote the number of African-Americans eligible to vote in district k, n2.k the number of other citizens eligible to vote, n.ik the number of citizens eligible who actually turn out to vote, and n.2k the number of citizens eligible who do not turn out to vote in district k. Further, let p^ be the probability that an eligible citizen in district k is an This content downloaded from 150.216.68.200 on Wed, 21 Sep 2016 01:54:41 UTC All use subject to http://about.jstor.org/terms 74 Martin Elff, Thomas Gschwend, and Ron J. Johnston African-American (/ = 1) and turns out to vote (j = 1), is an African-American (/ does not turn out to vote (j = 2), is not an African-American (i = 2) and turns ou (j = 1), or is not an African-American (/ = 2) and does not out to vote (j = I) and le the actual number of African-Americans/others who turn out/do not turn out to district k. We thus have = Yljtyk and n.jk = ]C/ *(/* Then, a model that states Pij\k =Pv\kP-j\k (3) can be fitted to the aggregate data using the maximum likelihood estimates PiM =Pi\kPj\k = \r\n.t Y,jn-jk (4) This model implies that within each race group i in district k, the probability to turn out to vote is equal: _ Pij\k _ Pi-\kP.j\k _ Pj\k Pj\ik ? ^-^-^-7 PJ L<jPij\k 2^jPi\kPj\k 2^jPj\k that is, turnout is unrelated to race. Of course, such a model s fact that survey data indicate that turnout and race are statistic and Claggett 1984). But this independence assumption cann available aggregate data alone. On the other hand, any mode turnout to be related, that is, p?j\k = CijPi.\kp.j\k with ctj ^ 1, YlicijPij\k= Pi\k, and YlicijPij\k ~ P j\k, is empirically indistinguishable from the independence model. Thus, it seems that based on aggregate data alone, one cannot decide whether race and turnout are related or not. Ecological regression (Goodman 1953, 1959) aims to overcome this problem by using a model that is in some ways more but in other ways less restrictive. In the case of race and turnout, it requires that the conditional probabilities of turnout are different across race groups but are the same in all voting districts, that is, p^ = Pj\iPj\h Some refined eco logical regression models even relax this assumption; they allow the conditional proba bility to vary across districts according to some known properties of the districts, that is, Pij\k - Pj\ikPi \k and Pj\ik = /(Pi/o + Pc/iZi* + + P(/dZd*)? where zu, ..., zDk are the properties of the districts and pl5..., pD are the parameters to be estimated. In contrast to the conditional independence model discussed above, ecological regression models can, to some degree, empirically be tested. With maximum likelihood estimates Pj\ik =/(Pyo + Piyi^u + + %DzDk) and pL\k = m.k/ ]T\ ni.k, it is still logically possible that predicted turnout numbers per district n.jk = Y^jPj\ikni k differ from actually observed turnout numbers n.Jk. Thus, in case of a poor fit between observed and predicted rates of eligible citizens who turn out in each of the districts (indicated by k), one may conclude that the ecological regression model is wrong and some of its assumptions have to be lifted. To account for departures of observed counts in a marginal table, such as the observed turnout per voting district, from predicted turnout, various authors extend this ecological regression by a random component: In this extended ecological regression model, conditional probabilities p^ik are not fixed parameters but outcomes of a random variable. King proposes that these conditional probabilities have a truncated normal This content downloaded from 150.216.68.200 on Wed, 21 Sep 2016 01:54:41 UTC All use subject to http://about.jstor.org/terms 75 Uncertainty in Ecological Inference distribution. Brown and Payne (1986) and Rosen et al. (2001) use the more natural assumption that the conditional probabilities have a Dirichlet distribution with mean E(Pjlik)' = Kj]ik = E,?p(^ + R,izu + --- + R^) (6) and precision parameter 9. As Goodman (1953) notes, ecological regression models are suitable only if it is reasonable to assume that there is a causal relation between the properties corresponding to the marginal tables. But based on the marginal tables alone, that is, the district-level aggregates, it is possible neither to establish such a causal relation nor to disprove it: As just noted, in ecological regression models there may be a lack of fit between observed and predicted counts in the marginal table of turnout per district. The independence model (3) will thus almost always seem superior to any ecological regression model. But again, the fact that the marginal tables can perfectly be fitted by the independence model does not imply that, for example, turnout and race are unrelated since there are infinitely many models that pose a relation between race and turnout which may also perfectly fit to the marginal tables. Statistical relations between variables represented by the observed marginal tables are often suggested by survey samples. For example, there is evidence from election studies (e.g., Abramson and Claggett 1984) that there is a clear statistical association between race and turnout. Therefore, it is worthwhile to combine evidence from aggregate data with evidence from survey data. An El model that allows for this is the Entropy-Maximizing model proposed by Johnston and Pattie (2000) in an earlier issue of this journal. In contrast to the models just discussed, Johnston and Pattie aim to directly make predictions about the unknown counts xtjk without the need of any statistical model. They show that a model of cell counts that are the most likely subject to the constraints j has i k the log-l log%k They also sho procedure. Although Johnston and Pattie (2000, 337) state that their proposed procedure is "math ematical rather than statistical" and that therefore "no error terms" are attached to the predicted cell counts, this is not the case: Readers familiar with log-linear contingency table analysis (Fienberg, Holland, and Bishop 1977; King 1998) will realize that the Johnston-Pattie model is actually a log-linear model for counts without three-way inter actions. However, the correct form of this model is logH,;/* = <*ij + Pi* + yjk + x, (9) where \iijk is the mean of a Poisson distribution, that is, the counts may indeed vary around this mean. The iterative scaling algorithm, which Johnston and Pattie propose, is the one usually employed to find maximum likelihood estimates for such log-linear models. That is, the "maximum likelihood solution" of Johnston and Pattie (2000, 335) is in fact the This content downloaded from 150.216.68.200 on Wed, 21 Sep 2016 01:54:41 UTC All use subject to http://about.jstor.org/terms = 76 Martin Elff, Thomas Gschwend, and Ron J. Johnston maximum likelihood estimates of the means of Poisson distributions that have the struc ture of equation (9). If the sum J2ijkxijk = Eijk ^ijk *s known in advance as n, which is usually the case in contingency table analysis and in El, the distribution of the cell counts can also be modeled as a multinomial distribution with cell probabilities exp(otff + pa + yjk) Pijk Ew?p(*. + P^ + yJ" 1 } Good (1963) has shown that finding the multinomial distribution that maximizes the entropy ? Y^ij,kPij^nPijk subject to the constraints j i leads k to this probabilities implies that milder restri regression m the condition More recent information Pattie (2000). bilities pjiik, similarly to e The restricti compared easi model, we pi>k 2sr,s,t = ^^iwR ^ 1st) with exP(P<*) a ni k and exP(fyO a n.jk] (12) eXP\Prf ~r for the ecological regression model (without district-level covariates), we have expK- + Jjk) exp(oty) m = ErA,exP(a? + yw) wlth ** = E^K) ^ K "* (13) whereas for the Johnston-Pattie multinomial model, we have + ptt + yjk) Ptik =exp(oey -7-riT^-v (14) All these are special submodels of the saturated model exp(a,7 + pa + yjk + 8fft) Pm E^exp(ars + Prt + Y,f + 5r,)' 3More on the application of information theoretic concepts to contingency table analysis and to statistics in general can be found in Kullback (1959). The argumentation of Johnston and Pattie (2000), however, follows a nonprobabilistic interpretation of entropy mentioned by Jaynes (1968). This content downloaded from 150.216.68.200 on Wed, 21 Sep 2016 01:54:41 UTC All use subject to http://about.jstor.org/terms Uncertainty in Ecological Inference 77 which does not pose any restrictions on the structure of the cell probabilities. T complete-data log-likelihood of this model is i= ^2xijk^omjk= ij,k ij i,k j,k ij,k (16) The implications of this expansion of the complete-data log-likelihood are quite ambiv alent: If, for example, the assumptions inherent in the independence model apply, that oiij = 0 and bijk = 0 for all ij, and k9 the complete-data log-likelihood depends only on t observed aggregates rii.k and tij.k and the parameters of the independence model can estimated by direct maximum likelihood. Also, if the assumptions of the Johnston-Patt model apply and odds ratios are equal in all districts, that is, the three-way interactio parameters hijk are zero for all ij, and the complete-data log-likelihood depends only the aggregates n,y., ntand rij.k. No such conclusion can be drawn with respect to eco ical regression: Ecological regression models are used if the aggregate table n^. is available. However, ecological regression models set parameters (3/.* and 8^ to zero bu not OLij. Therefore, the complete-data log-likelihood for ecological regression is not a special case of equation (16), but still it requires assumptions that are untestable with access to the complete array of counts xijk. Thus, any of the models so far discu requires assumptions that can be tested only if the complete data xijk are available. In problems, however, they are not. This problem, which may be called inferential indeterminacy, is much more seriou than the fact that distributional assumptions of, for example, ecological regression mo of King (King 1997; King, Rosen, and Tanner 1999; Rosen et al 2001) or Brown Payne (1986) cannot be checked or that under certain circumstances an ecologica gression model like King's may be susceptible to "aggregation bias" (Openshaw Taylor 1979, 1981; Cho 1998; Steel, Beh, and Chambers 2004). For example, one m want to make predictions about the number of African-Americans who will turn out to in a specific voting district. If one uses one of the models for El discussed above and assumptions inherent in this model do not hold, the predictions will be systematicall biased, irrespective of the specific estimation procedure one uses: Suppose, for examp that in a specific application of El, the complete data are generated from a multinom distribution with cell probabilities p% as in equation (15) and all structural parameters a*, PL yjh and are nonzero. Suppose further that one uses, for example, a procedure based on the Johnston-Pattie model (8) for EL If n approaches infinity, the cell proportions xijkln will converge to pfjt and the proportions n^Jn, n^n, and n.jk/n in the marginal tables will converge to p% = Y<kP% P*k = HjP% d P% = respectively. Further, the parameters of equation (8) will converge to values 6t,y, and yjk that maximize the scaled expected log-likelihood (17) This content downloaded from 150.216.68.200 on Wed, 21 Sep 2016 01:54:41 UTC All use subject to http://about.jstor.org/terms 78 Martin Elff, Thomas Gschwend, and Ron J. Johnston Now even if 6ty, pljb and yjk are equal or very close to the corresponding parameters a*, Pa, and yfk, ptjk and ptjk will in general be different because of the nonzero 8**. Conse quently, estimates of cell probabilities based on the Johnston-Pattie model will have asymptotic bias pyk ? p% and thus are not consistent. However, in order to obtain an estimate of this bias, one will need an estimate of 8^, which is unavailable because only the marginal tables are known. This is what makes the dilemma posed at the beginning of this section so serious: Without making certain identifying assumptions, one will not arrive, for example, at a prediction about the number of African-Americans who turn out to vote at all, apart from a random guess. The identifying assumptions, however, have crucial behavioral implications and if these assumptions are wrong, one will incur biased predictions. But these assumptions cannot be tested without access to the complete data, which are un available to begin with. Therefore, in any specific instance, one will not know how large this bias actually is. 3 Accounting for Inferential Uncertainty: A Maximum Entropy Approach Estimating cell probabilities with the help of one of the models discussed in the previous section exhausts the resources for statistical inference. The models used for estimating the cell probabilities all employ certain restrictive assumptions that are necessary for the identification of cell probability estimates. Yet, as long as one is faced with the task of El, these assumptions cannot be tested. They could be tested as a statistical hypothesis only if the complete data were available. But then, the problem no longer would be one of El. This dilemma seems to sustain a skepticism with regard to the validity of El. However, we would argue that a healthy amount of skepticism does not force us to give up the attempt at El altogether. Doing so would mean ignoring the information contained in the marginal tables; it would mean throwing out the baby with the bathwater. On the other hand, it is clear that one should not put the same confidence in predictions from an El model as one would in predictions from a well-tested statistical model. In the present section, we propose a model that allows taking into account the uncertainty associated with estimates obtained from an El. Since the resources for statistical inference are already exhausted, the model we propose cannot be justified in terms of theoretical statistics. All we can do is to appeal to some general principles that have some plausibility. The principle on which our proposition is based is the Principle of Maximum Entropy, which is a generalization of the (in)famous Laplace's Principle of Indifference. Before we present our proposed model, we need to explain what the Principle of Maximum Entropy entails and in what perspective we hold it to be plausible. Laplace's Principle of Indifference, of which the Principle of Maximum Entropy is a generalization (Uffink 1995), postulates that one should assign to each elementary out come ..., {xn} of a probability experiment, in absence of any prior information, the same probability l/n. For example, if the experiment is throwing a dice, one should assign to the outcome of each number one, two, three, four, five, or six the same probability 1/6. The Principle of Maximum Entropy generalizes this principle to cases where some prior information of the probability distribution in question is available and where the probability distribution may be continuous with infinite support. It postulates that, if only some moments of the probability distribution (i.e., some nonrandom functions of the probability distribution, like, e.g., mean and variance) are given in advance, one should select the probability distribution that has maximal entropy among a set of probability distributions This content downloaded from 150.216.68.200 on Wed, 21 Sep 2016 01:54:41 UTC All use subject to http://about.jstor.org/terms Uncertainty in Ecological Inference 79 with the same support with the given moments. This principle leads to some common families of probability distributions such as the family of normal distributions or the family of exponential distributions. For example, the normal distribution with zero mean and variance ex2 has maximal entropy of all continuous distributions over the real line with zero mean and variance a2. The exponential distribution with parameter X has maximal entropy of all continuous distributions over the positive half real line with mean X~1 (Shannon 1948 The Maximum Entropy Principle has been used to specify "reasonable" null hypoth eses for contingency table analysis (Good 1963; Golan, Judge, and Perloff 1996), to specify noninformative priors for Bayesian inference (Jaynes 1968), and for proposing solutions to ill-posed inverse problems (Vardi and Lee 1993). But it has also been used for El (Johnston and Hay 1983; Johnston and Pattie 2000; Judge, Miller, and Cho 2004). Here, we use the Principle of Maximum Entropy to motivate and construct a second-stage probability model to account for inferential uncertainty. Suppose (Xijk) is an (/ x J x ?)-array of counts generated by a multinomial distribution with size index n (the "population size") and cell probabilities pfjk. Suppose, further, that one has knowledge only about the marginal tables of this array and tries to make pre dictions about the cell counts. Using an El method as discussed in the previous section, one may arrive at an estimate p^k for the cell probabilities. The model by which the cell probabilities are estimated may or may not be correctly specified, that is, the "true" cell probabilities pfjk may or may not satisfy the constraints inherent in the model. Based on the El procedure of choice, one makes the prediction npijk about the cell count xijk. Then, the error of prediction of xijk by pijk can be decomposed as follows: Xijk - npijk = n - pfjk^j + n(pfjk- pijk). (18) The difference (xyk/n) ? pfjk will have mean zero and variance p|*(l ? pfjk) In and thus will be the smaller the larger n is, while the difference pfjk ? Pijk will not become smaller unless the model leading to p^ is correctly specified, that is, if the assumptions of the El model employed are satisfied by pfjk. In the analysis of bias at the end of the last section, we held pfjk fixed and considered the estimator as random. In contrast, we now propose to change the roles of p^k and pfjk, that is, to treat p^ as fixed, in virtue of being a function of the known marginal tables, and pfjk as the realization of a random variable Pijk. If we are completely ignorant about the data array (xijk) except for the total sum n, then any possible array of numbers (pijk) with 0 < pijk < 1 and J2ij kPijk = 1 would seem equally plausible as having generated the unknown array of counts. We can represent this, according to Laplace's Principle of Indifference, by a probability distribution with a den sity function that is Uniform for all admissible arrays (p^). Now, if each of the n indi viduals in the (I x J x /Q-array falls into the cell (ij, k) with probability Pijk, which is an element of a random array (Pijk) with a Uniform distribution, then each possible array (xijk) that may result from such a process has the same chance of occurrence. That is, there is a direct connection between the perspective that focuses on our ignorance about the counts xiJk and the perspective that focuses on our ignorance about the data-generating process. Taking this as a baseline, we can construct a distribution of plausible arrays (pijk) that reflects our ignorance about the true cell probabilities (pfjk) that generated the unknown 4Although there is some relation between the two, "entropy" refers here to a functional of density or probability mass functions and not to entropy in the sense of statistical mechanics and thermodynamics. For the relation, see Jaynes (1957). There are several attempts at giving this principle a general axiomatic foundation, for example, Jaynes (1957), Vasicek (1980), and Cziszar (1991). This content downloaded from 150.216.68.200 on Wed, 21 Sep 2016 01:54:41 UTC All use subject to http://about.jstor.org/terms 80 Martin Elff, Thomas Gschwend, and Ron J. Johnston data (Xijk)?an ignorance that is only reduced by the information contained in the margina tables (riij.) =should (52kxijk), = (j2j distribution xijk), and (n.jk)in the = and Such a distribution be as similar(/!/.*) to the Uniform described recovered by the EL previous paragraph as possible under the restriction that its mean is given by the estimates produced by the El method used. The Uniform distribution under consideration is a special case of a Dirichlet distribution, that is, a Dirichlet distribution with all shape parameters equal to one. Now, if we select a Dirichlet distribution that maximizes entropy under the constraint that its mean is equal to the estimates obtained from an El procedure, then we have the distribution that is, under these constraints, the most similar to the Uniform distribution in terms of the Kullback-Leibler criterion for the similarity of distributions.5 Informally speaking, the maximum entropy criterion leads here to the "flattest," that is least informative distribution with the given mean. The selection of such an entropy-maximizing Dirichlet distribution is possible since by its mean a Dirichlet distribution is only partially specified: If a multidimensional random variable (Pijk) has a Dirichlet distribution with parameters 0^, its components have ex pectations Kjjk : = E(Pijk) ? where 0o : = Ylr,s,t ? t- Therefore, each parameter 0^ of a Dirichlet distribution can be decomposed into a mean parameter nijk and a common scale parameter 0O by 0^ = nijk%, so that an entropy-maximizing Dirichlet distribution with mean niJk fixed at ptjk can be identified by maximizing Hv(9) = ^mr(0o/%) - lnr(0o) + (Onij,k - IJK)ty(%) - ^ (%pijk ~ lM%Pijk) ij,k (19) for 0O, where v|/( ) is the digamma function (Abramovitz and Stegun 1964, 258).6 Since the family of Dirichlet distributions is a multivariate generalization of the family of Beta distributions, we can use this family of distributions to illustrate what finding a maximum entropy distribution entails. A Beta distribution is usually characterized by its two shape parameters, which we call here (\>i and <$>2. The mean of this distribution then is 71: = ^iA^i + ^2)5 so if we define 0O : = cj>t H- fy2, the shape parameters can be reex pressed as = Qon and (J)2 = 0o(l ? n) and the variance as n(l - n)/(l + 0O). Figures 1 and 2 depict how a Beta distribution will look like for n fixed at 0.5 and 0.2, respectively, for various values of 0O below, above, and equal to 0MaxEnt> where 0MaxEnt denotes the value of 0O for which the entropy of the Beta distribution is maximal. As Fig. 1 shows, the Uniform distribution over the interval [0,1] is a special case of a Beta distribution: the Beta distribution with maximal entropy subject to the constraint that the expectation is equal to 0.5. Both figures indicate that, irrespective of the value of the expectation of the distribution, values of 0o above 0MaxEnt lead to single-peaked densities, where the peak is the higher the larger 0O is. However, as 0O gets smaller, the density function puts more and more weight on values around zero and one. Since the variance approaches a supremum of 7i(l - 71) as 0O approaches zero, the variance of a Beta density is not a good measure of uncertainty. Conversely, since the weight of the density is most evenly distributed if the 0O attains the entropy-maximizing value, the entropy seems to be a much better measure of uncertainty if the expectation n is fixed. If each of the n individuals in the (I x J x ?)-array falls into the cell (/, j, k) with probability Pijk, where Pijk itself is part of a random array with a Dirichlet distribution with 5For details see Appendix (Section A.l) to this paper on the Political Analysis Web site. ^or a formal proof of the validity of this formula, see Appendix (Section B.2) at the Political Analysis Web site. This content downloaded from 150.216.68.200 on Wed, 21 Sep 2016 01:54:41 UTC All use subject to http://about.jstor.org/terms 81 Uncertainty in Ecological Inference E(P) = 0.5, e > eM - " _ "MaxEnt e = 21" 0MaxE -e = 24- eMay. / / / / / / E(P) = o.5, e<eM - 9 - 9MaxEnt -- - e = 2"1-eMi -e = 2-4- eM. \ \ \ \ \ \ o h ---,-1-1-1 0.0 0.2 0.4 0.6 0.8 1.0 l-1-1-1-1-r 0.0 0.2 0.4 0.6 0.8 1.0 Fig. 1 Density of Beta distributions with mean held fixed at parameter 0O. parameters 0^ = %iZijk, then the distribution of the possible resulting arrays is a mix ture of a multinomial distribution with a Dirichlet distribution, a compound-multinomial or Dirichlet-multinomial distribution (Mosimann 1962; Hoadley 1969). Just as the family of Dirichlet distributions is a multivariate generalization of the family of Beta distri butions, the family of Dirichlet-multinomial distributions is a multivariate generalization of the family of Beta-binomial distributions, which has been used to model over dispersed proportions and success counts (Skellam 1948; Crowder 1978; Prentice 1986). In fact, if an array of random variables (Pijk) has a joint Dirichlet distribution with parameter array (6^), each component of the array has a Beta distribution with parameters 0o fyijk and 0O(1 - nyk) and each component of the corresponding array of counts (x^) with a Dirichlet-multinomial distribution has a Beta-binomial distribution with parameters Qofyjjk and 0O(1 - %ijk) (Mosimann 1962), where, as before, 0o = k 0//*. Now, the counts in each cell have expectation mZyk and variance wi^Q - TC^Xn -f 0o)/(l + 0o)- That is, the E(P) = 0.2, 6>0MaxEnt=4.35 ? e = 21 E(P) = 0.2, G<eMaxEnt=4.35 - ? _ ?MaxEnt -- - e = 2-1eMa -e = 2^ eMa i i i i r 0.0 0.2 0.4 0.6 0.8 1.0 Fig. 2 Density of Beta distributions with mean held fixed at parameter 0O. This content downloaded from 150.216.68.200 on Wed, 21 Sep 2016 01:54:41 UTC All use subject to http://about.jstor.org/terms 82 Martin Elff, Thomas Gschwend, and Ron J. Johnston expectation is the same as if the counts had a multinomial distribution with cell probabil ities nijk, whereas the variance differs from the variance of such a multinomial distribution by the factor (n + 60)/(l + 60). Although proportions XiJk/n of counts Xijk with binomial distribution are asymptotically normal and although arrays of proportions X^n of arrays of counts Xijk with multinomial distribution are asymptotically multivariate normal as n approaches infinity, it will be a mistake to assume asymptotic normality in the case of counts that have a Beta-binomial or Dirichlet-multinomial distribution, respectively: The distribution of the proportions will converge to a Beta distribution or Dirichlet distribution, respectively, which can have a shape quite dissimilar to the normal distribution if it has maximal entropy. The above considerations make clear that one cannot assume that the asymptotic dis tribution obtained from an El procedure is normal. Therefore, one should not use the normality assumption to construct confidence or prediction intervals based on the standard errors of the estimates. Rather, we propose to use the quantile function of the Beta distri bution to construct approximate credibility intervals for the cell probabilities piJk and the quantile function of the Beta-binomial distribution to construct approximate prediction intervals for the cell counts xiJk. If the counts in cell (i, j, k) have a Beta-binomial distribution with parameters 0orc,y* and 0O(1 - tc,^), 95% prediction intervals would be, for example, delimited by F^(0.025; 0O, ft/,*) and Fj^J(0.975; 0O, n^) + 1, where ^Bb(a;9o,7r.//fc) : =sup{jc: FBb;%,nijk(x) < ol} is the quantile function and FBb(x;%, nijk) : = X)*=o/Bb(*; 9o,rc,yj0 *s tne cumulative distribution function of the Beta-binomial distribution with parameters Oo^y* and 0O(1 - niJk). That is, under the assumption that the counts in cell (/, j, k) have this Beta-binomial distribution, the proba bility that the counts are in these intervals will be 95%.7 Since the reasoning behind this procedure is rather heuristic than justifiable in terms of theoretical statistics or probability theory, it seems necessary to assess its performance by way of a simulation study. Therefore, we conducted a systematic simulation experiment in which we vary the size of the array of counts (x^) and its total sum n. For each considered array size / x J x AT and each "population size" n, we (1) generated 2000 arrays of random numbers (x$) with r - 1, ..., 2000, (2) computed the marginal tables (njj)), (? *)> and (nS), and (3) used the Johnston-Pattie procedure to generate estimates of cell probabilities (Pijk)- We then (4) constructed prediction intervals for the cell counts based on the as sumption of the Johnston-Pattie model that the cell counts have, jointly, a multinomial distribution and, individually, a binomial distribution with success probability pjj? and (5) prediction intervals based on the procedure proposed in the present section. We then (6) recorded whether the counts x^ are covered by the respective prediction intervals, that is, whether they are inside the intervals. The random counts are generated such that each array of counts (x^k) that sums to n has the same chance of occurrence. Thus, the generated arrays of counts are a simple random sample from all such arrays and the distribution of the generated counts represents the initial ignorance about the interior of the black box before a procedure of El is applied. The simulation results may also be generalized such that they are representative for the average performance with respect to all possible interiors of the black box. By recording the performance of prediction intervals both based on assump tions of an El procedure and based on the second-stage maximum entropy procedure, we are able, first, to demonstrate the consequences of the indeterminacy that besets El and, second, to show the degree to which our proposed second-stage maximum entropy 7For details about the computation of Beta-binomial cumulative probability and quantile functions see Appendix (Section B.4) on the Political Analysis Web site. This content downloaded from 150.216.68.200 on Wed, 21 Sep 2016 01:54:41 UTC All use subject to http://about.jstor.org/terms 83 Uncertainty in Ecological Inference Table 1 Simulation study of coverage of true cell counts and true cell probabilities after 2000 replications Population size Array size 100,000 10,000,000 (a) Effective coverage by prediction 3 x 3 x 50 18.4 2.1 intervals based on the assumption of 7 x 7 x 50 28.0 2.6 a multinomial distribution 3 x 3 x 200 33.0 3.3 7 x 7 x 50 54.1 6.0 (b) Effective coverage by prediction 3 x 3 x 50 95.5 96.3 intervals based on the second-stage 7 x 7 x 50 93.1 94.1 maximum entropy method 3 x 3 x 200 95.0 96.4 7 x 7 x 200 88.5 93.8 procedure improves over a "naive" application of El and r uncertainty associated with El procedures. Table 1 presents the results of our simulation study rega of 95% prediction intervals of these two types with respe the respective arrays. Since the distribution from wh symmetric with respect to the cells in the array, any cell set of counts in the array as any other cell. For convenien (1, 1, 1). Panel (a) of Table 1 reports the coverage per intervals based on the assumption that the counts have a correctly specified cell probabilities, whereas panel (b) re of prediction intervals constructed based on the second-st A comparison of the two panels (a) and (b) in Table intervals are a large improvement in comparison to naiv vals. Panel (a) shows that the effective coverage of multi diction intervals that cover the true counts decreases with population size n is 10,000, the effective coverage may r falls short from the nominal coverage of 95%. But if the undercoverage of the naive prediction intervals is disastr are the true counts inside a 95% prediction interval. This inconsistency of the estimates of the cell probabilities: Th suppose that the standard deviation of xijk, conditional on creases with the population size n only proportional to y/n prediction intervals will decrease relative to the range of delimited by zero and n) proportional to 1 /yfn. Due to th simulation experiment the cell probabilities will almost su root mean square error of predictions based on the cell es incurred by misspecification. As n grows large, the effect of will dominate the random variation of the cell counts xijk In contrast, the maximum entropy Dirichlet-multinom show an effective coverage quite close to their nominal l Table 1. That the prediction intervals based on the secondstill differ from their nominal level may have several rea be a consequence of simulation error, in which case these number of replications approaches infinity. However, thes This content downloaded from 150.216.68.200 on Wed, 21 Sep 2016 01:54:41 UTC All use subject to http://about.jstor.org/terms 84 Martin Elff, Thomas Gschwend, and Ron J. Johnston our method is just an approximation. Our proposed method so far only takes into account the consequences of inferential indeterminacy but not the consequences of sampli variability. It takes into account that the true cell probabilities and true expected cell counts cannot be known completely even if n approaches infinity. It does not, however take into account the fact that the cell probabilities are estimated on the basis of a finite n. I may thus be possible to improve on the coverage performance of the prediction intervals a finite-n correction could be applied. However, this is beyond the scope of this paper sin it is mainly concerned with the consequences of inferential indeterminacy. Another reas is that exact identity between effective coverage and nominal coverage would only be achievable if the counts were continuous and not discrete. With finite n, discrete values m just be too coarse. For example, for a population of size 100,000 and for 7 x 7 x 200 9800 cells, the actual coverage of the cell counts falls clearly short. That may be caused b the fact that in this setting, the average counts per cell are less than four which is clear below 100. Therefore, it will be almost impossible to obtain exact percentiles for su counts. Nevertheless, even if we admit that the proposed method is only an approximation it works better than prediction intervals that rely on the identifiability assumptions of t El procedure of Johnston and Pattie to hold. 4 Accounting for Sampling Variability In the previous section, we considered the uncertainty about estimates that comes from th unavailability of the complete data. The observed data, however, were assumed to b population-level summaries. Probabilistic methods are employed in the previous sections first, to take into account that even population-level counts are the outcome of som stochastic data-generating process and, second, to take into account the uncertainty abou estimates based on these counts. If, however, at least one of the available marginal tables not a population-level summary but comes from a sample, for example, from a survey sample with data on ethnic identity and voting behavior, another level of uncertainty added. For example, if instead of a population-level cross-tabulation of voting behavior and ethnic group membership only a cross-tabulation xtl=(mij) from a sample is available and the true counts in the array of combinations of ethnic group membership (with categories / = 1, ..., /), voting behavior (with categories j = 1, ..., J), and voting district (with running numbers k = 1, ..., K) are xijk, then (provided that we have a simple random sample) the array of counts m,y has a multinomial distribution with cell probabilities qtj = njjJn, where = Ylkxijk> Eor given marginal tables ni=(n.jk), *l2=(ni.k), and 1*3=(n^-.), the method proposed in the previous section leads to one set of parameters (0^)= 0(tti,ri2,n3) of a Dirichlet-multinomial model of the counts xijk in the array. But for a given sample cross-tabulation m^, there is more than one possible multinomial distribution with cell probabilities qtj and thus more than one possible marginal population-level table ntj. from which this sample may have been drawn. That is, there is more than one set of parameters to be considered for modeling the distribution of the unknown counts in the array (0,y*)=<Knl.n2,n3). Of course, given a sample cross-tabulation, not all possible (/ x J) arrays n^. with 5^. ny. = n are equally plausible candidates. Rather, the plausibility of these candidates is adequately expressed by the values fi^3\m)=?r(^=n^\^l=Vti) of the probability mass function of the conditional distribution of 9t3given the sample table (Afly)=SDt. Now, since (Ny.) is an unobserved random variable in this perspective, it is no longer possible to simply construct prediction intervals based on a quantile function This content downloaded from 150.216.68.200 on Wed, 21 Sep 2016 01:54:41 UTC All use subject to http://about.jstor.org/terms Uncertainty in Ecological Inference 85 FBlJ (a; Go, n^) whose parameters are computable from fixed, observed marginal tables tt2, and tt3. Rather, an appropriate quantile function is given by F~l (a; 0O, nijk \fm) = ^ (a; 0O, nijk \ nf)f(n^ |m ), (20) where the sum is over all possible tables n^=(jiij(t)) (t = 1,...) that satisfy ?V= ?. This sum has no closed form, not the least because there is no function in closed form that leads from the marginal tables t*i=(?.,*), n2=(?/.*), and 1*3=(/i,y.) to 0O and nijk (these parameters have to be computed by numerical methods). Therefore, we propose a bootstrap method to construct the limits of prediction intervals, which involves the following steps: 1. Each replication r = 1, R starts by generating random counts from an approximation of the conditional distribution of (riij) given (m,y). This is done by a double-bootstrap procedure. First, random counts (m\p) from a multinomial distribution with size index m and cell probabilities = m^/m are generated. From these random counts, a second, random set of cell probabilities p^ = m^/m is computed. These are the cell probabilities of a multinomial distribution with size index n from which a second set (n^) of counts is generated. This assures that the (random) marginal tables (n^;) are integers from a multinomial distribution with size index n and reflect the variability of the observed sample (m,y). 2. Based on the marginal tables in]?}), (rii.k), and (n.jk), first-stage cell probability estimates pj^ are obtained based on the Johnston-Pattie model for each replication r = 1, ..., R. 3. After setting = p^ values, 0^ are determined for each r such that the Dirichlet distribution with parameters 0-^ = ^n\jk nas maximal entropy for given tt-^. 4. Random numbers p~k from the Dirichlet distribution with parameters 0-^ are generated for each r. (r) 5. For each r, random counts jtj^ from a multinomial distribution with probability parameters p^k are generated. For each r, the random array (jcj^) has thus a Dirichlet-multinomial distribution, however, with different parameters 0^ for each r. After R replications, the predictions of the unknown cell counts xijk are given by the averages R~l Y2r xljk ?ftne random counts for all ij, and k, and the limits of the prediction interval for each cell count xtjk are given by the respective quantiles of the random counts x[jk. In the next section, we demonstrate the application of this procedure to the recon struction of percentages of split-ticket votes in the 1996 General Election of New Zealand. 5 A Real-World Example: Split-Ticket Voting in New Zealand The term "split-ticket voting" is a pattern of voting behavior that can emerge when voters have the opportunity to cast several votes on the same occasion. For example, American citizens have the opportunity to cast a vote both for a candidate who runs for the Presi dency and for a candidate for the House of Representatives. In parallel and mixed-member electoral systems, voters may cast two votes in general elections, one with which they can choose the candidate to represent the voting district they live in and one with which they can choose which party they want to support for the proportional tier of the electoral system. Variants of mixed-member electoral systems can be found, for example, in leg islative elections of Bolivia, Germany, New Zealand, and Venezuela; parallel voting sys tems can be found in general elections, for example, in Japan, Mexico, and South Korea. This content downloaded from 150.216.68.200 on Wed, 21 Sep 2016 01:54:41 UTC All use subject to http://about.jstor.org/terms 86 Martin Elff, Thomas Gschwend, and Ron J. Johnston Strategic voting accounts emphasize the effect of "Duverger's Law" on the amount of split-ticket voting: Voters will give their most preferred party their list vote but, if they expect that the district candidate of this party to have only little chances of electoral success, restrict their choice on those candidates they deem to be potentially successful, thus not to "waste their vote." Often, available empirical data do not suffice to test such accounts. Either there are only aggregate data on list and candidate votes in individual voting districts available or the survey data, if they are available, are too sparse with regard to candidate vote-list vote combinations in individual voting districts. Therefore, exam ining split-ticket voting is a typical field of application for El (e.g., Burden and Kimball 1998; Gschwend, Johnston, and Pattie 2003; Benoit, Laver, and Gianetti 2004). There are, however, some rare occasions where official district-level data are available not only on list vote and candidate vote results but also on the numbers of straight- and split-ticket votes. One of these occasions is the 1996 General Election of New Zealand. This gives us the opportunity to examine the performance of the methods developed in the two preceding sections in a "real-life" setting. Three sorts of data are available on the 1996 General Election of New Zealand: (1) official data on electoral results for party lists and for party candidates and independent candidates for each of the 67 voting districts, (2) official data on total numbers of straight ticket and split-ticket votes for each of the voting districts, and (3) an 8 x 8 table o combinations of list and candidate votes from a nation-level survey sample (Levine and Roberts 1997; Johnston and Pattie 2000). That is, whereas the district-level aggregates tii=(n.Jk) and tt2=(/i,.*) are observed, the nation-level aggregate tt3=(H,;/.) is not but only a sample cross-tabulation m=(m#). To make an approximately valid El about the level of split-ticket voting in the individual voting districts, we thus need the bootstrap method developed in the preceding section. In the context of research on split-ticket voting, one is usually not interested in all counts xijk in the array made of candidate votes, list votes, and voting districts. Rather, one is interested in the proportion of split-ticket and straight-ticket votes in the individual voting districts. This additional complication, however, can be tackled in a straightforward manner. The bootstrap method of the preceding section produces random counts x^ (r = 1, ...,/?) from which point and interval predictions of the true counts xijk are computed. Now, based on Jr) _ 2^i=jXijk * Ar) _ LiftXjjk r?n /straight,* - ^ (r) dnu /split,* ~~ ^ (r) > Z^iJ Xijk Xijk we obtain random proportions of straight-ticket and split-ticket votes.8 The averages r Sr/straight,* an<* h Sr/spiit,* can men serve as a Pomt prediction of the proportions of straight-ticket and split-ticket votes, whereas the simulated quantiles can serve as limits of prediction intervals. Figure 3 shows a comparison of actual against predicted percentages of ticket splitters in the voting districts of New Zealand along with 95% prediction intervals.9 As can be seen 8A further note on notation: The expression =y xijk refers to the sum of all elements xijk in unit (voting district) k for which the first index (e.g., the vote for the candidate of party 0 equals the second index (e.g., the vote for the list of party;'). The expression xijk refers to the sum of all elements in which the first and the second index differ. 9Rather than showing an ordinary scatter plot of predicted against actual percentages, this plot contains a dot plot of predicted and actual percentages against the individual voting districts (sorted by the predicted percentages) to make it easier to discern the prediction intervals of cases with similar predicted or actual percentages of ticket splitters. This content downloaded from 150.216.68.200 on Wed, 21 Sep 2016 01:54:41 UTC All use subject to http://about.jstor.org/terms Uncertainty in Ecological Inference 87 o Predicted percentages True percentages -Prediction intervals Voting district, sorted by predicted percentage of ticket-splitters Fig. 3 Application of the second-stage maximum entropy approach to split-ticket voting in the 1996 General Election of New Zealand: true and predicted split-ticket percentages with prediction intervals based on bootstrap percentiles, after R = 2000 replications. The coverage of the tru percentages is 95.5%. in this plot, there are three instances in which the actual percentage of ticket splitters lie outside prediction intervals. The total coverage of the actual ticket-splitting percentages b these prediction intervals thus is 95.5%, which is close to their nominal level. This con stitutes a slight overcoverage, but with only 67 voting districts, one can hardly expect to achieve a coverage exactly at nominal level. The application to the case of split-ticket voting in the 1996 General Election of New Zealand is another corroboration of the method developed in the preceding sections Although, again, the match between the nominal coverage of the prediction interva and their effective coverage is not perfect, it is close enough to suggest that the methods are, if not a final solution to the problem, a substantial step in the right direction. However, the comparison of predicted and actual percentages of ticket splitting as well as the lengt of the prediction intervals shows that El cannot produce the miracle of delivering pre dictions from mere aggregates that are comparable in quality to estimates and predictions obtained from fully observed data. In the New Zealand application, the predictions o percentages of ticket splitters are in some districts roughly 10 percentage points away from the actual percentages. Also, the length of the prediction intervals amounts to, on average, 15 percentage points. This makes again clear that the basic dilemma of El should lea scholars to take greatest care when they interpret results obtained from an El procedure. 6 Discussion In Sections 3 and 4, we develop a method of accounting for the extra amount of uncertaint associated with estimates and predictions obtained from an El procedure that stems from a problem that we exposed in Section 2, the problem of inferential indeterminacy. Without This content downloaded from 150.216.68.200 on Wed, 21 Sep 2016 01:54:41 UTC All use subject to http://about.jstor.org/terms 88 Martin Elff, Thomas Gschwend, and Ron J. Johnston the problem of inferential indeterminacy, it would be possible to model the data-generatin process of the counts xijk as a multinomial distribution. To model the consequences of inferential indeterminacy for the uncertainty about cell probabilities, we use the conjuga family of the multinomial, the family of Dirichlet distributions. To model the consequence for uncertainty about cell counts, we use the family of mixtures of multinomial Dirichlet distributions. The first to consider Dirichlet-multinomial mixture distributions for El are Brown and Payne (1986). Their approach consists of an extension of the ecological regression model (12), which allows the conditional probabilities pj\ik to vary in different units k according a Dirichlet distribution with mean parameters E(pj\ik) = and precision parameter 60 According to Brown and Payne (1986), both the mean and the dispersion parameters can be estimated from the marginal tables n.jk and n^, the mean parameters based on gener alized least squares and the precision parameter based on the variance of the residuals of the regression of n.jk on (Rosen et al [2001] consider estimation of this model v MCMC and a simpler model without a Dirichlet mixing distribution estimated via no linear least squares). Since the precision parameter 00 is estimated from the residuals of this regression, this model may account for model departures with respect to those aspec in which model (12) is more restrictive than the Johnston-Pattie model (8). But since th precision parameter of the model in Brown and Payne (1986) is estimated based on the observed marginal tables (n,-.*) and (n.Jk), the resulting Dirichlet model of the conditiona probabilities can hardly account for those model departures that can be detected only if th full array (jc^) of counts is available. In contrast to the approach of Brown and Payn (1986), we do not try to estimate the precision parameter 0O from the observable margin tables but rather determine its value according to an a priori criterion, the Principle of Maximum Entropy. As our simulation experiments show, the values of the precisio parameter thus determined can capture much of the uncertainty caused by the possibility of undetectable model departures. Therefore, we suggest that any method of constructing prediction intervals for El should lead to intervals at least as large as those based on our proposed method. Otherwise, they cannot account for those undetectable model departures that haunt the confidence in results of El. It seems that our method of accounting for the consequences of modeling indetermi nacy itself rests on a crucial assumption that a priori, without taking into account the information contained in the marginal tables, any possible array of counts may occur wit the same probability. This, however, would be a misunderstanding. The Uniform distri bution reflects the ignorance about the true cell counts that characterizes the point of departure of El problems. It plays a role similar to a "flat" or noninformative prior in Bayesian inference. Using an informative, non-Uniform distribution as a reference distr bution may increase the risk of biased predictions unless this reference distribution is sufficiently close to the true distribution of the cell counts. We use a Uniform distribution as reference specifically to avoid such possible bias. If, however, prior information available that can be summarized in a distribution of the cell probabilities, a generalization of the maximum entropy method can be used: One can then, instead of maximizing entropy select a model that minimizes the directed Kullback-Leibler information divergence, also known as Kullback-Leibler information criterion, relative to this prior distribution.10 10As Kullback (1959) and Good (1963) have pointed out, the Principle of Maximum Entropy is just a specia of the Principle of Minimum Discriminating Information: Choosing the distribution with maximal entropy is equivalent to minimizing the directed Kullback-Leibler information divergence relative to a Uniform distribu tion. For details see Appendix (Section A) on the Political Analysis Web site. This content downloaded from 150.216.68.200 on Wed, 21 Sep 2016 01:54:41 UTC All use subject to http://about.jstor.org/terms Uncertainty in Ecological Inference 89 As a method to construct a distribution of the cell counts, one may argue that th method developed in Section 3 is deficient insofar as it only corrects consequences of parametric assumptions of El methods that could be bypassed by directly constructing a distribution of the unknown cell counts xijk that, without any prior parametric assump tions, maximizes entropy subject to those constraints that reflect the information contained in the known marginal tables. In principle, such a construction will be straightforward; in practice, however, finding a solution will be infeasible because of its prohibitive compu tational complexity.11 The method developed in Section 3 contains some concepts of Bayesian statistical inference, that is, a probability distribution on the parameters (the cell probabilities) of a multinomial distribution, which is a member of the conjugate to the family of multinomial distributions. But it does not make use of Bayes's theorem. The combination of the Johnston-Pattie model with the second-stage maximum entropy con struction may be viewed, at best, as an approximation to a posterior distribution of the cel counts. However, we are not able to show how good an approximation this is except by means of the simulation study, the results of which we report at the end of Section 3. Therefore, a direct Bayesian approach that makes use of the complete set of tools of this technique of statistical inference seems preferable to the approach proposed in this paper. As can be shown,12 a posterior constructed by the straightforward application of Bayes's theorem to a noninformative prior distribution of the cell counts would be of surprisingly simple structure. However, the computation of the posterior probabilities of the counts will be as computationally demanding as a nonparametric maximum entropy model, making a direct Bayesian approach as unfeasible as a nonparametric maximum entropy approach to modeling the distribution of the cell counts. Finally, we want to emphasize that one should not yield to the temptation of using cell counts or cell probabilities as "data" in second-stage regression models. Apart from the conceptual issues involved in fitting models to data predicted from another model and from possible bias incurred by using such second-stage regressions (Herron and Shotts 2003, 2004), conventionally estimated standard errors of a naive second-stage regression model will be highly inaccurate. Also, this problem cannot be cured by an increase in the number of spatial units considered or by the number of cases within the respective spatial units. Our simulation results reported at the end of Section 3 show how much one will be misled if one relies on asymptotic theory in these cases. Therefore, instead of treating estimates obtained from an El as data, one should consider them as parameters of the predictive distribution for unknown data and use this distribution for generating values for multiple imputation (e.g., King et al. 2001). Of course, since the amount of unknown data is quite large relative to the amount of known data, one would need a large number of imputed dat sets in order to get reliable estimates of quantities of interest and their variances. The issues connected with such a use of El results clearly deserve some further research, which is, however, beyond the scope of this paper. 7 Conclusion The point of departure of our paper is a fundamental dilemma of El and of inference in ill posed inverse problems in general: Estimates can be identified only if certain restrictive assumptions are made with respect to the structure of the data-generating process leading nFor details see Appendix (Section A.4) on the Political Analysis Web site. 12See Appendix (Section C) on the Political Analysis Web site. This content downloaded from 150.216.68.200 on Wed, 21 Sep 2016 01:54:41 UTC All use subject to http://about.jstor.org/terms 90 Martin Elff, Thomas Gschwend, and Ron J. Johnston to the unknown data one tries to reconstruct. However, these assumptions may not be satisfied by the unknown data, leading to serious bias in the estimates relative to the tr values of the data. This may be called the dilemma of fundamental indeterminacy (Cho and Mansky forthcoming). We tackle this dilemma by proposing a method of delimitin the error one has to expect when the assumptions needed for the identification of a solut are violated by the unknown data. We focus on a special model for El, a version of the model proposed in Johnston and Pattie (2000), because this model requires only relatively mild assumptions as compared those required by ecological regression models. We examine the consequences of possib departures of the model assumptions. We find that prediction intervals based on the mo assumptions are far too narrow and show a very serious undercoverage of the unknow data. The main contribution of our paper, however, is the development of a method for t construction of prediction intervals that are at least approximately correct. The method we propose consists in combining two stages. In the first stage, poi estimates for the unknown data are constructed based on a model that contains certain assumptions inevitable for the identification of the solution. In the second stage, consider the set of all possible solutions that satisfy or do not satisfy these identificatio assumptions. We construct a probability distribution that meets the requirement that expectation is the solution from the first stage but is otherwise as neutral as possible wi respect to other possible first-stage solutions. This distribution assigns the weights of possible solutions in terms of values of its density functions as equal as possible, th is, has maximal entropy, subject to the constraints on the expectation of this distributio Results from a simulation experiment show that if the population for which the El is to made is large enough, prediction intervals based on our proposed two-stage method are approximately correct. We supplement this simulation experiment with a real-world application: the predic tion of district-level percentages of ticket splitting in the 1996 General Election of Ne Zealand based on aggregate data about candidate votes and list votes at the voting-distr level and an 8 x 8 table obtained from a survey sample. This election is a rare opportunit to check the performance of an El procedure against real data: Not only are discrete-leve data available on candidate votes and list votes and a sample for candidate vote-list vote combinations at the national level but also district-level data on the percentages of straig ticket and split-ticket votes. Therefore, we are able to compare predicted with actual percentages of split-ticket votes for each voting district. Within the context of this app cation, we address another issue that scholars dealing with aggregate data may face: No always are these aggregate data exact summaries of the population but rather a sample th summarizes the population. As a solution for this problem, we propose a combination o our two-stage maximum entropy method with bootstrapping from the empirical distrib tion of the sample. Our application of this combination to the New Zealand data shows th it results in prediction intervals with a coverage performance roughly equal to th nominal level: In 95.5% of the New Zealand voting districts lies the actual percent of split-ticket votes inside 95% prediction intervals constructed based on our proposed method. The main conclusion of our paper thus is that standard errors and confidence intervals for El problems that do not take into account the fundamental uncertainty associated with any solution to ill-posed inverse problems may be grossly misleading. However, the consequences of this fundamental uncertainty can be delimited, if not exactly, though approximately. Therefore, despite the problems discussed in this paper, to reject altogether the idea of El on these grounds means throwing out the baby with the bathwater. This content downloaded from 150.216.68.200 on Wed, 21 Sep 2016 01:54:41 UTC All use subject to http://about.jstor.org/terms Uncertainty in Ecological Inference 91 References Abramovitz, Milton, and Irene A. Stegun, eds. 1964. Handbook of mathematical functions with formulas, graphs, and mathematical tables. Washington, DC: National Bureau of Standards. Abramson, Paul R., and William Claggett. 1984. Race-related differences in self-reported and validated turnout. Journal of Politics 46:719-38. Benoit, Kenneth, Michael Laver, and Daniela Gianetti. 2004. Multiparty split-ticket voting estimation as an ecological inference problem. In Ecological inference: New methodological strategies, ed. Gary King, Ori Rosen, and Martin Tanner, 333-50. Cambridge, UK: Cambridge University Press. Brown, Philip J., and Clive D. Payne. 1986. Aggregate data, ecological regression, and voting transitions. Journal of the American Statistical Association 81:452-60. Burden, Barry C, and David C. Kimball. 1998. A new approach to the study of ticket splitting. American Political Science Review 92:533^14. Cho, Wendy K. Tarn. 1998. Iff the assumption fits ...: A comment on the King ecological inference solution. Political Analysis 7:143-63. Cho, Wendy K. Tarn, and Charles F. Manski. Forthcoming. Cross-level/ecological inference. In Oxford hand book of political methodology, ed. Janet Box-Steffensmeier, Henry Brady, and David Collier. Oxford, UK: Oxford University Press. Cirincione, C, T. A. Darling, and T. G. O'Rourke. 2000. Assessing South Carolina's congressional districting. Political Geography 19:189-211. Crowder, Martin J. 1978. Beta-binomial ANOVA for proportions. Applied Statistics 27:34-7. Cziszar, Imre. 1991. Why least squares and maximum entropy? An axiomatic approach to inference for linear inverse problems. Annals of Statistics 19:2032-66. Fienberg, Stephen E., Paul W. Holland, and Yvonne Bishop. 1977. Discrete multivariate analysis: Theory and practice. Cambridge, MA: MIT Press. Fienberg, Stephen E., and Christian P. Robert. 2004. Comment to 'Ecological inference for 2 x 2 tables' by Jon Wakefield. Journal of the Royal Statistical Society: Series A (Statistics in Society) 167:432-4. Golan, Amos, George Judge, and Jeffrey M. Perloff. 1996. A maximum entropy approach to recovering in formation from multinomial response data. Journal of the American Statistical Association 91:841-53. Good, I. J. 1963. Maximum entropy for hypothesis formulation, especially for multidimensional contingency tables. Annals of Mathematical Statistics 34:911-34. Goodman, Leo A. 1953. Ecological regressions and the behavior of individuals. American Sociological Review 18:663-4. -. 1959. Some alternatives to ecological correlation. American Journal of Sociology 64:610-25. Groetsch, Charles W. 1993. Inverse problems in the mathematical sciences. Braunschweig and Wiesbaden: Vieweg. Gschwend, Thomas, Ron Johnston, and Charles Pattie. 2003. Split-ticket patterns in mixed-member proportional election systems: Estimates and analyses of their spatial variation at the German federal election, 1998. British Journal of Political Science 33:109-27. Herron, Michael C, and Kenneth W. Shotts. 2003. Using ecological inference point estimates as dependent variables in second-stage linear regressions. Political Analysis 11:44-64. -. 2004. Logical inconsistency in El-based second-stage regressions. American Journal of Political Science 48:172-83. Hoadley, Bruce. 1969. The compound multinomial distribution and Bayesian analysis of categorical data from finite populations. Journal of the American Statistical Association 64:216-29. Jaynes, Edwin T. 1957. Information theory and statistical mechanics. Physical Review 106:620-30. -. 1968. Prior probabilities. IEEE Transactions on Systems Science and Cybernetics 4:227-41. Johnston, Ron J., and A. M. Hay. 1983. Voter transition probability estimates: An entropy-maximizing approach. European Journal of Political Research 11:405-22. Johnston, Ron J., and Charles Pattie. 2000. Ecological inference and entropy-maximizing: An alternative estimation procedure for split-ticket voting. Political Analysis 8:333-45. Judge, George G., Douglas J. Miller, and Wendy K. Tarn Cho. 2004. An information theoretic approach to ecological estimation and inference. In Ecological inference: New methodological strategies, ed. Gary King, Ori Rosen, and Martin Tanner, 162-87. Cambridge, UK: Cambridge University Press. King, Gary. 1997. A solution to the ecological inference problem: Reconstructing individual behavior from aggregate data. Princeton: Princeton University Press. -. 1998. Unifying political methodology: The likelihood theory of statistical inference. Ann Arbor, MI: Michigan University Press. This content downloaded from 150.216.68.200 on Wed, 21 Sep 2016 01:54:41 UTC All use subject to http://about.jstor.org/terms 92 Martin Elff, Thomas Gschwend, and Ron J. Johnston King, Gary, James Honaker, Anne Joseph, and Kenneth Scheve. 2001. Analyzing incomplete political science data: An alternative algorithm for multiple imputation. American Political Science Review 95:49-69. King, Gary, Ori Rosen, and Martin A. Tanner. 1999. Binomial-beta hierarchical models for ecological inference. Sociological Methods and Research 28:61-90. Kullback, Solomon. 1959. Information theory and statistics. New York: Wiley. Levine, Stephen, and Nigel S. Roberts. 1997. Surveying the snark: Voting behaviour in the 1996 New Zealand general election. In From campaign to coalition: New Zealand's first general election under proportional representation, ed. Jonathan Boston, Stephen Levine, Elizabeth McLeay, and Nigel Roberts, 183-97. Palmerston North, NZ: Dunmore Press. Mosimann, James E. 1962. On the compound multinomial distribution, the multivariate P-distribution, and correlations among proportions. Biometrika 49:65-82. Openshaw, S., and P. J. Taylor. 1979. A million or so correlation coefficients: Three experiments on the modifi able areal unit problem. In Statistical methods in the spatial sciences, ed. N. Wrigley, 127-44. London: Pion. -. 1981. The modifiable areal unit problem. In Quantitative geography: A British view, ed. N. Wrigley and R. J. Bennett, 60-70. London: Routledge and Kegan Paul. Prentice, R. L. 1986. Binary regression using an extended beta-binomial distribution, with discussion of corre lation induced by covariate measurement errors. Journal of the American Statistical Association 81:321-7. Rosen, Ori, Wenxing Jiang, Gary King, and Martin A. Tanner. 2001. Bayesian and frequentist inference for ecological inference: The r x c case. Statistica Neerlandica 55:134-56. Shannon, Claude E. 1948. A mathematical theory of communication. Bell System Technical Journal 27:379 423, 623-56. Skellam, J. G. 1948. A probability distribution derived from the binomial distribution by regarding the prob ability as variable between the sets of trials. Journal of the Royal Statistical Society. Series B (Methodolog ical) 10:257-61. Steel, David G., Eric J. Beh, and Ray L. Chambers. 2004. The information in aggregate data. In Ecological inference: New methodological strategies, ed. Gary King, Ori Rosen, and Martin Tanner, 51-68. Cambridge, UK: Cambridge University Press. Uffink, Jos. 1995. Can the maximum entropy principle be explained as a consistency requirement? Studies in History and Philosophy of Modern Physics 26B:223-61. Vardi, Y., and D. Lee. 1993. From image deblurring to optimal investments: Maximum likelihood solutions for positive linear inverse problems. Journal of the Royal Statistical Society. Series B (Methodological) 55: 569-612. Vasicek, Oldrich Alfonso. 1980. A conditional law of large numbers. Annals of Probability 8:142-7. Wakefield, Jon. 2004. Ecological inference for 2 x 2 tables. Journal of the Royal Statistical Society: Series A (Statistics in Society) 167:385-426. This content downloaded from 150.216.68.200 on Wed, 21 Sep 2016 01:54:41 UTC All use subject to http://about.jstor.org/terms