होम Political Analysis Ignoramus, Ignorabimus? On Uncertainty in Ecological Inference

Ignoramus, Ignorabimus? On Uncertainty in Ecological Inference

यह पुस्तक आपको कितनी अच्छी लगी?
फ़ाइल की गुणवत्ता क्या है?
पुस्तक की गुणवत्ता का मूल्यांकन करने के लिए यह पुस्तक डाउनलोड करें
डाउनलोड की गई फ़ाइलों की गुणवत्ता क्या है?
Political Analysis
January, 2008
PDF, 2.32 MB
Conversion to is in progress
Conversion to is failed

अपनी समीक्षा पोस्ट करने के लिए साइन इन करें या साइन अप करें
आप पुस्तक समीक्षा लिख सकते हैं और अपना अनुभव साझा कर सकते हैं. पढ़ूी हुई पुस्तकों के बारे में आपकी राय जानने में अन्य पाठकों को दिलचस्पी होगी. भले ही आपको किताब पसंद हो या न हो, अगर आप इसके बारे में ईमानदारी से और विस्तार से बताएँगे, तो लोग अपने लिए नई रुचिकर पुस्तकें खोज पाएँगे.

Advances in the Archaeology of the Pampa and Patagonia

PDF, 10.32 MB

Front Matter

PDF, 887 KB
Society for Political Methodology
Ignoramus, Ignorabimus? On Uncertainty in Ecological Inference
Author(s): Martin Elff, Thomas Gschwend and Ron J. Johnston
Source: Political Analysis, Vol. 16, No. 1 (Winter 2008), pp. 70-92
Published by: Oxford University Press on behalf of the Society for Political Methodology
Stable URL: http://www.jstor.org/stable/25791917
Accessed: 21-09-2016 01:54 UTC
JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range of content in a trusted
digital archive. We use information technology and tools to increase productivity and facilitate new forms of scholarship. For more information about
JSTOR, please contact support@jstor.org.

Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at

Society for Political Methodology, Oxford University Press are collaborating with JSTOR to
digitize, preserve and extend access to Political Analysis

This content downloaded from on Wed, 21 Sep 2016 01:54:41 UTC
All use subject to http://about.jstor.org/terms

Political Analysis (2008) 16:70-92

Advance Access publication December 8, 2007

Ignoramus, Ignorabimus? On Uncertainty
in Ecological Inference
Martin Elff
Faculty of Social Sciences, University of Mannheim, A5, 6, 68131 Mannheim, Germany
e-mail: eljf@sowi.uni-mannheim.de (corresponding author)

Thomas Gschwend
Center for Doctoral Studies in Social and Behavioral Sciences,
University of Mannheim, D7, 27, 68131 Mannheim, Germany

e-mail: gschwend@ uni-mannheim.de

Ron J. Johnston
School of Geographical Sciences, University of Bristol, Bristol BS8 1SS, UK
e-mail: r.johnston @ bristol. ac. uk

Models of ecological inference (El) have to rely on crucial assumptions about the individual
level data-generating process, which cannot be tested because of the unavailability of these

data. However, these assumptions may be violated by the unknown data and this may lea; d

to serious bias of estimates and predictions. The amount of bias, however, cannot be
assessed without information that is unavailable in typical applications of El. We therefore
construct a model that at least approximately accounts for the additional, nonsampling error
that may result from possible bias incurred by an El procedure, a model that builds on the
Principle of Maximum Entropy. By means of a systematic simulation experiment, we exam

ine the performance of prediction intervals based on this second-stage Maximum Entropy
model. The results of this simulation study suggest that these prediction intervals are at least

approximately correct if all possible configurations of the unknown data are taken into
account. Finally, we apply our method to a real-world example, where we actually know
the true values and are able to assess the performance of our method: the prediction of
district-level percentages of split-ticket voting in the 1996 General Election of New Zealand.
It turns out that in 95.5% of the New Zealand voting districts, the actual percentage of split

ticket votes lies inside the 95% prediction intervals constructed by our method.

1 Introduction
Many students and practitioners of social science have remained skeptical about the
feasibility of sound ecological inference (El). To some, the terms "ecological inference"
Authors' note: We thank three anonymous reviewers for helpful comments and suggestions on earlier versions of
this paper. An appendix giving some technical background information concerning our proposed method, as well
as data, R code, and C code to replicate analyses presented in this paper are available from the Political Analysis

Web site. Later versions of the code will be packaged into an R library and made publicly available on CRAN
(http://cran.r-project.org) and on the corresponding author's Web site.
? The Author 2007. Published by Oxford University Press on behalf of the Society for Political Methodology.
All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org

This content downloaded from on Wed, 21 Sep 2016 01:54:41 UTC
All use subject to http://about.jstor.org/terms

Uncertainty in Ecological Inference


and "ecological fallacy" appear almost synonymous: They doubt whether it will be pos
sible at all to draw any conclusions about the behavior of individuals from aggregate data.
There seem to be good reasons for such skepticism: El has to rely on certain assumptions
about the data-generating process at the level of individuals that cannot directly be tested,
simply because the data on which such tests could be based are unavailable to begin with.

The burden of assumptions becomes especially visible in a recent survey of Bayesian
approaches to El for 2 x 2 tables given by Wakefield (2004), which shows that even
Markov chain Monte Carlo (MCMC) approaches are faced with this indeterminacy. Wake
field considers a total of 13 different variants of prior distributions for use in MCMC
analysis of 2 x 2 tables including King's original truncated normal prior, beta exponential,
and Student logistic gamma compound prior distributions and compares them with respect
to prediction bias for King's (1997) Louisiana party registration data. Although there is
one prior distribution that performs best in this application, one may still ask whether this

result can be generalized to all possible El applications to 2 x 2 tables. The question
remains, as Fienberg and Robert aptly remark in their comments to Wakefield's review, "to

what extent can we really distinguish between the fit of different models, hierarchical or

otherwise, when only aggregate data are available?" (Fienberg and Robert 2004, 432).
Nonetheless, without certain restrictive assumptions about the process generating the
unknown data, it is impossible to obtain any estimates for an El problem, however pre
liminary these estimates may be. It should be noted that the challenge of El is only a special
case of the wider class of ill-posed inverse problems (King 1997). The task of El is solving
a problem that is inverse insofar as only a set of summaries of the data of interest are given
and ill posed insofar as the information given by these summaries is not sufficient to
identify a solution. Problems of this kind abound in the technical and scientific literature
and numerous approaches to their solution have been proposed (see e.g., Groetsch 1993).
Therefore, we think that a general rejection of El procedures would be premature. Al
though the assumptions inherent in an El procedure cannot be tested by means of statistical
techniques, it is still possible to delimit the potential error that is associated with predic
tions from such a procedure. Constructing bounds to this potential error is the aim of the
present paper. We derive a method to construct "robust" prediction intervals, that is,
intervals that contain the true values of the unknown data with (at least approximately)
known probability. Further, we assess the performance of these prediction intervals by
means of simulation and a real-world example, the reconstruction of split-ticket votes in
the 1996 General Election of New Zealand. As point of departure, we take an El procedure
recently presented in this journal, the entropy-maximizing approach of Johnston and Pattie
(2000). When recast into a probability model, the Johnston-Pattie model imposes rela
tively mild and clearly structured restrictions on the unknown data-generating process and
thus suits well the exploration of the consequences of model departures. As it will turn out,
both in the simulation study and in the application to ticket splitting in New Zealand, the
prediction intervals that we construct have a coverage that is almost identical to their

nominal level.

The paper is organized as follows: In Section 2, we explain the fundamental dilemma of
El, which results from the necessity of El procedures that employ certain restrictive
assumptions that cannot be tested without the aid of the very data which are unavailable
in a typical El application. In Section 3, we propose a second-stage correction of the error
distribution of El estimates based on the Principle of Maximum Entropy and report the
results of a simulation study to assess the performance of the proposed method. In Section
4, we discuss how this method can be adapted to cases where some of the data on which El

is based do not come from population-level aggregates but from a survey sample. In
This content downloaded from on Wed, 21 Sep 2016 01:54:41 UTC
All use subject to http://about.jstor.org/terms


Martin Elff, Thomas Gschwend, and Ron J. Johnston

Section 5, we illustrate our method with its application to split-ticket voting in the
General Election of New Zealand. Section 6 discusses the limits of our proposed meth
whereas Section 7 summarizes our results.1

2 A Basic Dilemma of El

The situation of El can be compared to reconstructing the "inner workings" of a "bl

box." These inner workings may be, for example, the numbers xijk of members
various ethnic groups (i = 1, ..., I) who do or do not turn out to vote (j = 1, ...,

voting districts (k = 1,..., K) or the probabilities Pv(Xijk = xijk) in which these counts
occur. If the total sum of the counts is n, for example, if the total population of eli

voters is n, then the total number of possible configurations of counts xijk that sum to

be expressed as the binomial coefficient (^^j^K_ ~\ - In typical instances of El, this is

a vast number: Even in a country with only 1 million (=ri) voters and three (=/) ethnic
groups, whose members may or may not turn out to vote (J = 2) in one of 100 (=K) voting
districts, there are

n + UK - 1 ^ = ( 106 + 599 ^ ^m (1)

UK-I J \ 599 J

possible ways to arrange the voters into the black box.
In the absence of any information about the marginal sums njk = Ylixijk> ?u ?
or rijj. = J2k xijk>mat is? ^ onty me totalsum n ?f me counts xijk is known and if one has to
make a point prediction about what configuration of counts is present inside the black box,

it seems that one cannot do better than pick any of the {^^j^K_~[ 1 ^ possible configura

tions at random.2 Such a random pick can be represented by a Uniform distribution on all

possible configurations of counts in the (I x J x #)-array that have a total sum of

n. The probability of any specific configuration of being chosen then is l j ^ n ^j^K_ 7 1)

Also, the probability of hitting the true configuration of counts ? =(x?jk) by accident is

also l j ~^jgK_ 7 1) Thus, one could also think of this true configuration as the outcome

of a random variable X=(XiJk) that has arrays T=(Xijk) as values and has a Uniform

distribution on its values.

Now, information about marginal sums of the cell counts, for example, the number of
members of ethnic groups within each voting district, can vastly reduce the number of
possible cell counts one has to consider. If we consider the case (for reasons of simplicity if
not of plausibility) that the observed turnout rate in all voting districts is 50%, then one has

An appendix, available online on the Political Analysis Web site, contains supplemental information: some
background on maximum entropy distributions, properties of the Dirichlet and Dirichlet-multinomial distribu
tions relevant for the argument of our paper, details on the computation of Beta-binomial prediction intervals,

and some details about two nonparametric alternatives to the method proposed in our paper. The Political
Analysis Web site also contains source code in R and C as well as data suitable for the replication of the analyses

presented here.

2A note on notation: We abbreviate , etc., as J2i anc* S/=i Z]y=i Ylk=i as X^y,* ^tne summation limits are

This content downloaded from on Wed, 21 Sep 2016 01:54:41 UTC
All use subject to http://about.jstor.org/terms

Uncertainty in Ecological Inference


only to consider, given a combination of voting district k and turnout status y, the numb
of possible triples of numbers that sum to nlJK = 106/200, that is,

(jkhj -1 \ = ^ iovmo.+3 -iy 107.if (2)
a number that is several orders of magnitude smaller than the number of possible config
urations if no information about the marginal counts was available. This suggests that
predictions about the unknown cell counts that take into account the marginal tables can
have a greatly improved performance as compared to predictions that do not. El methods
can be considered as attempts to make such improved predictions. Yet, these methods are
plagued by a serious problem, which we will expose in the following.
Statistical inference usually is made with the intention to describe a population based
on a sample. For example, one may try to identify a model of voting behavior that connects
voters' decisions to their own and the candidates' policy positions. If a correctly specified

model is found, it can be used to make out-of-sample predictions, which may be pre
dictions about future states of affairs or contrafactual states of affairs. One can also find

both types of usage of results of EL As an example of the first type of usage, Burden and

Kimball (1998) try to find how much and why American voters engage in split-ticket
voting between Presidential and Congress elections based on aggregated Presidential and
Congressional votes for American voting districts. An example of the other type of usage
is predicting the effects of changing the boundaries of voting districts on the representation

of racial groups in these districts and on the chances of Democratic and Republican
candidates in these rearranged districts (Cirincione, Darling, and O'Rourke 2000). As
with statistical inference, both usages of El need correctly specified models that describe
the population. In contrast to statistical inference, El is confronted with two problems that
together pose a dilemma. The first problem is that of modeling indeterminacy: If only
aggregates of the variables of interest are observed, there will always be more than one
model of an interrelation between these variables that even fits perfectly to the observed
aggregates. The second problem is that of inferential indeterminacy: If one arrives at
identifying a model describing the interrelation of the variables of interest, this model
will entail certain restrictive assumptions about the population, which cannot be tested

based on aggregate data alone, which usually are the only data available. Hence the
dilemma: If the first problem is solved, the second one is inevitably encountered. If one
tries to avoid the second problem, one cannot solve the first one.
The challenge of the first problem lies in finding assumptions suitably restrictive for
model identification. The criteria for suitability may vary with the application, but in most
cases one will use assumptions that are plausible on the one hand and convenient on the
other insofar as they lead to simple models for which estimation is feasible. However,
plausibility and simplicity may conflict.
Consider the case in which someone has aggregate data on turnout and the proportion of
African-Americans at the level of voting districts and wants to find out whether African
Americans differ from other citizens with respect to turnout. In this case, a model that
presupposes that in each voting district turnout and race are conditionally independent will

perfectly fit to the aggregate data and thus cannot be improved based on these data alone:
Let rii.k denote the number of African-Americans eligible to vote in district k, n2.k the
number of other citizens eligible to vote, n.ik the number of citizens eligible who actually
turn out to vote, and n.2k the number of citizens eligible who do not turn out to vote in

district k. Further, let p^ be the probability that an eligible citizen in district k is an

This content downloaded from on Wed, 21 Sep 2016 01:54:41 UTC
All use subject to http://about.jstor.org/terms


Martin Elff, Thomas Gschwend, and Ron J. Johnston

African-American (/ = 1) and turns out to vote (j = 1), is an African-American (/
does not turn out to vote (j = 2), is not an African-American (i = 2) and turns ou

(j = 1), or is not an African-American (/ = 2) and does not out to vote (j = I) and le
the actual number of African-Americans/others who turn out/do not turn out to

district k. We thus have = Yljtyk and n.jk = ]C/ *(/*
Then, a model that states
Pij\k =Pv\kP-j\k (3)
can be fitted to the aggregate data using the maximum likelihood estimates

PiM =Pi\kPj\k = \r\n.t Y,jn-jk (4)
This model implies that within each race group i in district k, the probability to turn out to

vote is equal:

_ Pij\k _ Pi-\kP.j\k _ Pj\k

Pj\ik ? ^-^-^-7 PJ

L<jPij\k 2^jPi\kPj\k 2^jPj\k

that is, turnout is unrelated to race. Of course, such a model s
fact that survey data indicate that turnout and race are statistic

and Claggett 1984). But this independence assumption cann
available aggregate data alone. On the other hand, any mode

turnout to be related, that is, p?j\k = CijPi.\kp.j\k with ctj ^ 1, YlicijPij\k= Pi\k, and
YlicijPij\k ~ P j\k, is empirically indistinguishable from the independence model. Thus, it
seems that based on aggregate data alone, one cannot decide whether race and turnout are
related or not.

Ecological regression (Goodman 1953, 1959) aims to overcome this problem by using
a model that is in some ways more but in other ways less restrictive. In the case of race and

turnout, it requires that the conditional probabilities of turnout are different across race

groups but are the same in all voting districts, that is, p^ = Pj\iPj\h Some refined eco
logical regression models even relax this assumption; they allow the conditional proba
bility to vary across districts according to some known properties of the districts, that is,

Pij\k - Pj\ikPi \k and Pj\ik = /(Pi/o + Pc/iZi* + + P(/dZd*)? where zu, ..., zDk are the
properties of the districts and pl5..., pD are the parameters to be estimated. In contrast to

the conditional independence model discussed above, ecological regression models can,

to some degree, empirically be tested. With maximum likelihood estimates

Pj\ik =/(Pyo + Piyi^u + + %DzDk) and pL\k = m.k/ ]T\ ni.k, it is still logically possible
that predicted turnout numbers per district n.jk = Y^jPj\ikni k differ from actually observed

turnout numbers n.Jk. Thus, in case of a poor fit between observed and predicted rates of

eligible citizens who turn out in each of the districts (indicated by k), one may conclude

that the ecological regression model is wrong and some of its assumptions have to be
lifted. To account for departures of observed counts in a marginal table, such as the
observed turnout per voting district, from predicted turnout, various authors extend this

ecological regression by a random component: In this extended ecological regression
model, conditional probabilities p^ik are not fixed parameters but outcomes of a random

variable. King proposes that these conditional probabilities have a truncated normal
This content downloaded from on Wed, 21 Sep 2016 01:54:41 UTC
All use subject to http://about.jstor.org/terms


Uncertainty in Ecological Inference

distribution. Brown and Payne (1986) and Rosen et al. (2001) use the more natural
assumption that the conditional probabilities have a Dirichlet distribution with mean

E(Pjlik)' = Kj]ik = E,?p(^ + R,izu + --- + R^) (6)
and precision parameter 9.

As Goodman (1953) notes, ecological regression models are suitable only if it is
reasonable to assume that there is a causal relation between the properties corresponding
to the marginal tables. But based on the marginal tables alone, that is, the district-level
aggregates, it is possible neither to establish such a causal relation nor to disprove it: As
just noted, in ecological regression models there may be a lack of fit between observed and
predicted counts in the marginal table of turnout per district. The independence model (3)
will thus almost always seem superior to any ecological regression model. But again, the
fact that the marginal tables can perfectly be fitted by the independence model does not
imply that, for example, turnout and race are unrelated since there are infinitely many
models that pose a relation between race and turnout which may also perfectly fit to the

marginal tables.
Statistical relations between variables represented by the observed marginal tables are
often suggested by survey samples. For example, there is evidence from election studies
(e.g., Abramson and Claggett 1984) that there is a clear statistical association between race
and turnout. Therefore, it is worthwhile to combine evidence from aggregate data with
evidence from survey data. An El model that allows for this is the Entropy-Maximizing
model proposed by Johnston and Pattie (2000) in an earlier issue of this journal. In contrast
to the models just discussed, Johnston and Pattie aim to directly make predictions about
the unknown counts xtjk without the need of any statistical model. They show that a model
of cell counts that are the most likely subject to the constraints









Although Johnston and Pattie (2000, 337) state that their proposed procedure is "math
ematical rather than statistical" and that therefore "no error terms" are attached to the
predicted cell counts, this is not the case: Readers familiar with log-linear contingency

table analysis (Fienberg, Holland, and Bishop 1977; King 1998) will realize that the
Johnston-Pattie model is actually a log-linear model for counts without three-way inter
actions. However, the correct form of this model is

logH,;/* = <*ij + Pi* + yjk + x, (9)
where \iijk is the mean of a Poisson distribution, that is, the counts may indeed vary around
this mean. The iterative scaling algorithm, which Johnston and Pattie propose, is the one

usually employed to find maximum likelihood estimates for such log-linear models. That
is, the "maximum likelihood solution" of Johnston and Pattie (2000, 335) is in fact the

This content downloaded from on Wed, 21 Sep 2016 01:54:41 UTC
All use subject to http://about.jstor.org/terms



Martin Elff, Thomas Gschwend, and Ron J. Johnston

maximum likelihood estimates of the means of Poisson distributions that have the struc

ture of equation (9). If the sum J2ijkxijk = Eijk ^ijk *s known in advance as n, which is

usually the case in contingency table analysis and in El, the distribution of the cell counts
can also be modeled as a multinomial distribution with cell probabilities

exp(otff + pa + yjk)

Pijk Ew?p(*. + P^ + yJ" 1 }
Good (1963) has shown that finding the multinomial distribution that maximizes the

entropy ? Y^ij,kPij^nPijk subject to the constraints

















pi>k 2sr,s,t
= ^^iwR
^ 1st)
with exP(P<*) a ni k and exP(fyO a n.jk] (12)
eXP\Prf ~r
for the ecological regression model (without district-level covariates), we have

expK- + Jjk) exp(oty)

m = ErA,exP(a? + yw) wlth ** = E^K) ^ K "* (13)
whereas for the Johnston-Pattie multinomial model, we have

+ ptt + yjk)
Ptik =exp(oey
All these are special submodels of the saturated model

exp(a,7 + pa + yjk + 8fft)

Pm E^exp(ars + Prt + Y,f + 5r,)'
3More on the application of information theoretic concepts to contingency table analysis and to statistics in
general can be found in Kullback (1959). The argumentation of Johnston and Pattie (2000), however, follows
a nonprobabilistic interpretation of entropy mentioned by Jaynes (1968).

This content downloaded from on Wed, 21 Sep 2016 01:54:41 UTC
All use subject to http://about.jstor.org/terms

Uncertainty in Ecological Inference


which does not pose any restrictions on the structure of the cell probabilities. T
complete-data log-likelihood of this model is

i= ^2xijk^omjk=
ij,k ij

i,k j,k ij,k


The implications of this expansion of the complete-data log-likelihood are quite ambiv
alent: If, for example, the assumptions inherent in the independence model apply, that

oiij = 0 and bijk = 0 for all ij, and k9 the complete-data log-likelihood depends only on t
observed aggregates rii.k and tij.k and the parameters of the independence model can
estimated by direct maximum likelihood. Also, if the assumptions of the Johnston-Patt
model apply and odds ratios are equal in all districts, that is, the three-way interactio

parameters hijk are zero for all ij, and the complete-data log-likelihood depends only
the aggregates n,y., ntand rij.k. No such conclusion can be drawn with respect to eco

ical regression: Ecological regression models are used if the aggregate table n^. is

available. However, ecological regression models set parameters (3/.* and 8^ to zero bu
not OLij. Therefore, the complete-data log-likelihood for ecological regression is not a
special case of equation (16), but still it requires assumptions that are untestable with

access to the complete array of counts xijk. Thus, any of the models so far discu

requires assumptions that can be tested only if the complete data xijk are available. In
problems, however, they are not.
This problem, which may be called inferential indeterminacy, is much more seriou
than the fact that distributional assumptions of, for example, ecological regression mo

of King (King 1997; King, Rosen, and Tanner 1999; Rosen et al 2001) or Brown
Payne (1986) cannot be checked or that under certain circumstances an ecologica
gression model like King's may be susceptible to "aggregation bias" (Openshaw
Taylor 1979, 1981; Cho 1998; Steel, Beh, and Chambers 2004). For example, one m
want to make predictions about the number of African-Americans who will turn out to

in a specific voting district. If one uses one of the models for El discussed above and
assumptions inherent in this model do not hold, the predictions will be systematicall
biased, irrespective of the specific estimation procedure one uses: Suppose, for examp
that in a specific application of El, the complete data are generated from a multinom

distribution with cell probabilities p% as in equation (15) and all structural parameters a*,
PL yjh and are nonzero. Suppose further that one uses, for example, a procedure based
on the Johnston-Pattie model (8) for EL If n approaches infinity, the cell proportions xijkln

will converge to pfjt and the proportions n^Jn, n^n, and n.jk/n in the marginal tables will

converge to p% = Y<kP% P*k = HjP% d P% = respectively. Further, the

parameters of equation (8) will converge to values 6t,y, and yjk that maximize the
scaled expected log-likelihood


This content downloaded from on Wed, 21 Sep 2016 01:54:41 UTC
All use subject to http://about.jstor.org/terms


Martin Elff, Thomas Gschwend, and Ron J. Johnston

Now even if 6ty, pljb and yjk are equal or very close to the corresponding parameters a*,
Pa, and yfk, ptjk and ptjk will in general be different because of the nonzero 8**. Conse
quently, estimates of cell probabilities based on the Johnston-Pattie model will have

asymptotic bias pyk ? p% and thus are not consistent. However, in order to obtain an
estimate of this bias, one will need an estimate of 8^, which is unavailable because only
the marginal tables are known.

This is what makes the dilemma posed at the beginning of this section so serious:
Without making certain identifying assumptions, one will not arrive, for example, at
a prediction about the number of African-Americans who turn out to vote at all, apart
from a random guess. The identifying assumptions, however, have crucial behavioral
implications and if these assumptions are wrong, one will incur biased predictions. But
these assumptions cannot be tested without access to the complete data, which are un
available to begin with. Therefore, in any specific instance, one will not know how large
this bias actually is.

3 Accounting for Inferential Uncertainty: A Maximum Entropy Approach
Estimating cell probabilities with the help of one of the models discussed in the previous
section exhausts the resources for statistical inference. The models used for estimating the
cell probabilities all employ certain restrictive assumptions that are necessary for the
identification of cell probability estimates. Yet, as long as one is faced with the task of
El, these assumptions cannot be tested. They could be tested as a statistical hypothesis
only if the complete data were available. But then, the problem no longer would be one of
El. This dilemma seems to sustain a skepticism with regard to the validity of El. However,

we would argue that a healthy amount of skepticism does not force us to give up the
attempt at El altogether. Doing so would mean ignoring the information contained in the
marginal tables; it would mean throwing out the baby with the bathwater. On the other
hand, it is clear that one should not put the same confidence in predictions from an El
model as one would in predictions from a well-tested statistical model.

In the present section, we propose a model that allows taking into account the
uncertainty associated with estimates obtained from an El. Since the resources for
statistical inference are already exhausted, the model we propose cannot be justified
in terms of theoretical statistics. All we can do is to appeal to some general principles

that have some plausibility. The principle on which our proposition is based is the
Principle of Maximum Entropy, which is a generalization of the (in)famous Laplace's
Principle of Indifference. Before we present our proposed model, we need to explain
what the Principle of Maximum Entropy entails and in what perspective we hold it to be

Laplace's Principle of Indifference, of which the Principle of Maximum Entropy is
a generalization (Uffink 1995), postulates that one should assign to each elementary out

come ..., {xn} of a probability experiment, in absence of any prior information, the
same probability l/n. For example, if the experiment is throwing a dice, one should assign
to the outcome of each number one, two, three, four, five, or six the same probability 1/6.

The Principle of Maximum Entropy generalizes this principle to cases where some prior
information of the probability distribution in question is available and where the probability

distribution may be continuous with infinite support. It postulates that, if only some
moments of the probability distribution (i.e., some nonrandom functions of the probability

distribution, like, e.g., mean and variance) are given in advance, one should select the
probability distribution that has maximal entropy among a set of probability distributions

This content downloaded from on Wed, 21 Sep 2016 01:54:41 UTC
All use subject to http://about.jstor.org/terms

Uncertainty in Ecological Inference


with the same support with the given moments. This principle leads to some common
families of probability distributions such as the family of normal distributions or the family

of exponential distributions. For example, the normal distribution with zero mean and
variance ex2 has maximal entropy of all continuous distributions over the real line with zero
mean and variance a2. The exponential distribution with parameter X has maximal entropy
of all continuous distributions over the positive half real line with mean X~1 (Shannon 1948

The Maximum Entropy Principle has been used to specify "reasonable" null hypoth
eses for contingency table analysis (Good 1963; Golan, Judge, and Perloff 1996), to
specify noninformative priors for Bayesian inference (Jaynes 1968), and for proposing
solutions to ill-posed inverse problems (Vardi and Lee 1993). But it has also been used for
El (Johnston and Hay 1983; Johnston and Pattie 2000; Judge, Miller, and Cho 2004). Here,

we use the Principle of Maximum Entropy to motivate and construct a second-stage
probability model to account for inferential uncertainty.
Suppose (Xijk) is an (/ x J x ?)-array of counts generated by a multinomial distribution

with size index n (the "population size") and cell probabilities pfjk. Suppose, further, that
one has knowledge only about the marginal tables of this array and tries to make pre
dictions about the cell counts. Using an El method as discussed in the previous section, one

may arrive at an estimate p^k for the cell probabilities. The model by which the cell
probabilities are estimated may or may not be correctly specified, that is, the "true" cell

probabilities pfjk may or may not satisfy the constraints inherent in the model. Based on the
El procedure of choice, one makes the prediction npijk about the cell count xijk. Then, the
error of prediction of xijk by pijk can be decomposed as follows:

Xijk - npijk = n - pfjk^j + n(pfjk- pijk). (18)
The difference (xyk/n) ? pfjk will have mean zero and variance p|*(l ? pfjk) In and thus
will be the smaller the larger n is, while the difference pfjk ? Pijk will not become smaller
unless the model leading to p^ is correctly specified, that is, if the assumptions of the El

model employed are satisfied by pfjk. In the analysis of bias at the end of the last section,
we held pfjk fixed and considered the estimator as random. In contrast, we now propose
to change the roles of p^k and pfjk, that is, to treat p^ as fixed, in virtue of being a function
of the known marginal tables, and pfjk as the realization of a random variable Pijk.
If we are completely ignorant about the data array (xijk) except for the total sum n, then

any possible array of numbers (pijk) with 0 < pijk < 1 and J2ij kPijk = 1 would seem

equally plausible as having generated the unknown array of counts. We can represent this,
according to Laplace's Principle of Indifference, by a probability distribution with a den

sity function that is Uniform for all admissible arrays (p^). Now, if each of the n indi
viduals in the (I x J x /Q-array falls into the cell (ij, k) with probability Pijk, which is an
element of a random array (Pijk) with a Uniform distribution, then each possible array (xijk)
that may result from such a process has the same chance of occurrence. That is, there is

a direct connection between the perspective that focuses on our ignorance about the counts
xiJk and the perspective that focuses on our ignorance about the data-generating process.
Taking this as a baseline, we can construct a distribution of plausible arrays (pijk) that

reflects our ignorance about the true cell probabilities (pfjk) that generated the unknown
4Although there is some relation between the two, "entropy" refers here to a functional of density or probability
mass functions and not to entropy in the sense of statistical mechanics and thermodynamics. For the relation, see

Jaynes (1957). There are several attempts at giving this principle a general axiomatic foundation, for example,

Jaynes (1957), Vasicek (1980), and Cziszar (1991).

This content downloaded from on Wed, 21 Sep 2016 01:54:41 UTC
All use subject to http://about.jstor.org/terms


Martin Elff, Thomas Gschwend, and Ron J. Johnston

data (Xijk)?an ignorance that is only reduced by the information contained in the margina

(riij.) =should
= (j2j distribution
xijk), and
(n.jk)in the
= and
a distribution
be as similar(/!/.*)
to the Uniform

recovered by the EL

previous paragraph as possible under the restriction that its mean is given by the estimates
produced by the El method used. The Uniform distribution under consideration is a special
case of a Dirichlet distribution, that is, a Dirichlet distribution with all shape parameters

equal to one. Now, if we select a Dirichlet distribution that maximizes entropy under the
constraint that its mean is equal to the estimates obtained from an El procedure, then we
have the distribution that is, under these constraints, the most similar to the Uniform
distribution in terms of the Kullback-Leibler criterion for the similarity of distributions.5

Informally speaking, the maximum entropy criterion leads here to the "flattest," that is
least informative distribution with the given mean.
The selection of such an entropy-maximizing Dirichlet distribution is possible since by
its mean a Dirichlet distribution is only partially specified: If a multidimensional random
variable (Pijk) has a Dirichlet distribution with parameters 0^, its components have ex

pectations Kjjk : = E(Pijk) ? where 0o : = Ylr,s,t ? t- Therefore, each parameter 0^ of
a Dirichlet distribution can be decomposed into a mean parameter nijk and a common scale

parameter 0O by 0^ = nijk%, so that an entropy-maximizing Dirichlet distribution with
mean niJk fixed at ptjk can be identified by maximizing

Hv(9) = ^mr(0o/%)
- lnr(0o) + (Onij,k
- IJK)ty(%) - ^ (%pijk ~ lM%Pijk)
for 0O, where v|/( ) is the digamma function (Abramovitz and Stegun 1964, 258).6
Since the family of Dirichlet distributions is a multivariate generalization of the family
of Beta distributions, we can use this family of distributions to illustrate what finding
a maximum entropy distribution entails. A Beta distribution is usually characterized by its
two shape parameters, which we call here (\>i and <$>2. The mean of this distribution then is

71: = ^iA^i + ^2)5 so if we define 0O : = cj>t H- fy2, the shape parameters can be reex
pressed as = Qon and (J)2 = 0o(l ? n) and the variance as n(l - n)/(l + 0O). Figures 1
and 2 depict how a Beta distribution will look like for n fixed at 0.5 and 0.2, respectively,
for various values of 0O below, above, and equal to 0MaxEnt> where 0MaxEnt denotes the
value of 0O for which the entropy of the Beta distribution is maximal. As Fig. 1 shows,
the Uniform distribution over the interval [0,1] is a special case of a Beta distribution: the

Beta distribution with maximal entropy subject to the constraint that the expectation is
equal to 0.5. Both figures indicate that, irrespective of the value of the expectation of the
distribution, values of 0o above 0MaxEnt lead to single-peaked densities, where the peak is
the higher the larger 0O is. However, as 0O gets smaller, the density function puts more and
more weight on values around zero and one. Since the variance approaches a supremum of
7i(l - 71) as 0O approaches zero, the variance of a Beta density is not a good measure of
uncertainty. Conversely, since the weight of the density is most evenly distributed if the
0O attains the entropy-maximizing value, the entropy seems to be a much better measure of
uncertainty if the expectation n is fixed.

If each of the n individuals in the (I x J x ?)-array falls into the cell (/, j, k) with
probability Pijk, where Pijk itself is part of a random array with a Dirichlet distribution with
5For details see Appendix (Section A.l) to this paper on the Political Analysis Web site.
^or a formal proof of the validity of this formula, see Appendix (Section B.2) at the Political Analysis Web site.

This content downloaded from on Wed, 21 Sep 2016 01:54:41 UTC
All use subject to http://about.jstor.org/terms


Uncertainty in Ecological Inference

E(P) = 0.5, e > eM
- " _ "MaxEnt
e = 21" 0MaxE

-e = 24- eMay.

E(P) = o.5, e<eM
- 9 - 9MaxEnt
-- - e = 2"1-eMi

-e = 2-4- eM.


o h ---,-1-1-1
0.0 0.2 0.4 0.6 0.8 1.0


0.0 0.2 0.4 0.6 0.8 1.0

Fig. 1 Density of Beta distributions with mean held fixed at
parameter 0O.

parameters 0^ = %iZijk, then the distribution of the possible resulting arrays is a mix
ture of a multinomial distribution with a Dirichlet distribution, a compound-multinomial
or Dirichlet-multinomial distribution (Mosimann 1962; Hoadley 1969). Just as the family
of Dirichlet distributions is a multivariate generalization of the family of Beta distri
butions, the family of Dirichlet-multinomial distributions is a multivariate generalization
of the family of Beta-binomial distributions, which has been used to model over

dispersed proportions and success counts (Skellam 1948; Crowder 1978; Prentice
1986). In fact, if an array of random variables (Pijk) has a joint Dirichlet distribution with
parameter array (6^), each component of the array has a Beta distribution with parameters

0o fyijk and 0O(1 - nyk) and each component of the corresponding array of counts (x^) with

a Dirichlet-multinomial distribution has a Beta-binomial distribution with parameters

Qofyjjk and 0O(1 - %ijk) (Mosimann 1962), where, as before, 0o = k 0//*. Now, the counts
in each cell have expectation mZyk and variance wi^Q - TC^Xn -f 0o)/(l + 0o)- That is, the
E(P) = 0.2, 6>0MaxEnt=4.35

? e = 21

E(P) = 0.2, G<eMaxEnt=4.35
- ? _ ?MaxEnt

-- - e = 2-1eMa

-e = 2^ eMa






0.0 0.2 0.4 0.6 0.8 1.0

Fig. 2 Density of Beta distributions with mean held fixed at
parameter 0O.

This content downloaded from on Wed, 21 Sep 2016 01:54:41 UTC
All use subject to http://about.jstor.org/terms


Martin Elff, Thomas Gschwend, and Ron J. Johnston

expectation is the same as if the counts had a multinomial distribution with cell probabil
ities nijk, whereas the variance differs from the variance of such a multinomial distribution

by the factor (n + 60)/(l + 60). Although proportions XiJk/n of counts Xijk with binomial
distribution are asymptotically normal and although arrays of proportions X^n of arrays
of counts Xijk with multinomial distribution are asymptotically multivariate normal as n
approaches infinity, it will be a mistake to assume asymptotic normality in the case of

counts that have a Beta-binomial or Dirichlet-multinomial distribution, respectively: The
distribution of the proportions will converge to a Beta distribution or Dirichlet distribution,

respectively, which can have a shape quite dissimilar to the normal distribution if it has

maximal entropy.
The above considerations make clear that one cannot assume that the asymptotic dis
tribution obtained from an El procedure is normal. Therefore, one should not use the
normality assumption to construct confidence or prediction intervals based on the standard
errors of the estimates. Rather, we propose to use the quantile function of the Beta distri
bution to construct approximate credibility intervals for the cell probabilities piJk and the
quantile function of the Beta-binomial distribution to construct approximate prediction

intervals for the cell counts xiJk. If the counts in cell (i, j, k) have a Beta-binomial
distribution with parameters 0orc,y* and 0O(1 - tc,^), 95% prediction intervals would
be, for example, delimited by F^(0.025; 0O, ft/,*) and Fj^J(0.975; 0O, n^) + 1, where

^Bb(a;9o,7r.//fc) : =sup{jc: FBb;%,nijk(x) < ol} is the quantile function and

FBb(x;%, nijk) : = X)*=o/Bb(*; 9o,rc,yj0 *s tne cumulative distribution function of the
Beta-binomial distribution with parameters Oo^y* and 0O(1 - niJk). That is, under the
assumption that the counts in cell (/, j, k) have this Beta-binomial distribution, the proba
bility that the counts are in these intervals will be 95%.7
Since the reasoning behind this procedure is rather heuristic than justifiable in terms of
theoretical statistics or probability theory, it seems necessary to assess its performance by
way of a simulation study. Therefore, we conducted a systematic simulation experiment in
which we vary the size of the array of counts (x^) and its total sum n. For each considered
array size / x J x AT and each "population size" n, we (1) generated 2000 arrays of random

numbers (x$) with r - 1, ..., 2000, (2) computed the marginal tables (njj)), (? *)> and
(nS), and (3) used the Johnston-Pattie procedure to generate estimates of cell probabilities

(Pijk)- We then (4) constructed prediction intervals for the cell counts based on the as
sumption of the Johnston-Pattie model that the cell counts have, jointly, a multinomial

distribution and, individually, a binomial distribution with success probability pjj? and (5)

prediction intervals based on the procedure proposed in the present section. We then (6)

recorded whether the counts x^ are covered by the respective prediction intervals, that is,

whether they are inside the intervals. The random counts are generated such that each array

of counts (x^k) that sums to n has the same chance of occurrence. Thus, the generated
arrays of counts are a simple random sample from all such arrays and the distribution of the
generated counts represents the initial ignorance about the interior of the black box before
a procedure of El is applied. The simulation results may also be generalized such that they
are representative for the average performance with respect to all possible interiors of the
black box. By recording the performance of prediction intervals both based on assump

tions of an El procedure and based on the second-stage maximum entropy procedure, we

are able, first, to demonstrate the consequences of the indeterminacy that besets El
and, second, to show the degree to which our proposed second-stage maximum entropy
7For details about the computation of Beta-binomial cumulative probability and quantile functions see Appendix
(Section B.4) on the Political Analysis Web site.

This content downloaded from on Wed, 21 Sep 2016 01:54:41 UTC
All use subject to http://about.jstor.org/terms


Uncertainty in Ecological Inference

Table 1 Simulation study of coverage of true cell counts and true cell probabilities after

2000 replications

Population size

Array size 100,000 10,000,000
(a) Effective coverage by prediction 3 x 3 x 50 18.4 2.1

intervals based on the assumption of 7 x 7 x 50 28.0 2.6
a multinomial distribution 3 x 3 x 200 33.0 3.3
7 x 7 x 50 54.1 6.0

(b) Effective coverage by prediction 3 x 3 x 50 95.5 96.3
intervals based on the second-stage 7 x 7 x 50 93.1 94.1

maximum entropy method 3 x 3 x 200 95.0 96.4
7 x 7 x 200 88.5 93.8

procedure improves over a "naive" application of El and r
uncertainty associated with El procedures.
Table 1 presents the results of our simulation study rega
of 95% prediction intervals of these two types with respe

the respective arrays. Since the distribution from wh

symmetric with respect to the cells in the array, any cell
set of counts in the array as any other cell. For convenien

(1, 1, 1). Panel (a) of Table 1 reports the coverage per

intervals based on the assumption that the counts have a
correctly specified cell probabilities, whereas panel (b) re
of prediction intervals constructed based on the second-st

A comparison of the two panels (a) and (b) in Table

intervals are a large improvement in comparison to naiv
vals. Panel (a) shows that the effective coverage of multi

diction intervals that cover the true counts decreases with
population size n is 10,000, the effective coverage may r
falls short from the nominal coverage of 95%. But if the
undercoverage of the naive prediction intervals is disastr
are the true counts inside a 95% prediction interval. This
inconsistency of the estimates of the cell probabilities: Th
suppose that the standard deviation of xijk, conditional on
creases with the population size n only proportional to y/n
prediction intervals will decrease relative to the range of
delimited by zero and n) proportional to 1 /yfn. Due to th
simulation experiment the cell probabilities will almost su
root mean square error of predictions based on the cell es

incurred by misspecification. As n grows large, the effect of

will dominate the random variation of the cell counts xijk

In contrast, the maximum entropy Dirichlet-multinom

show an effective coverage quite close to their nominal l

Table 1. That the prediction intervals based on the secondstill differ from their nominal level may have several rea

be a consequence of simulation error, in which case these

number of replications approaches infinity. However, thes

This content downloaded from on Wed, 21 Sep 2016 01:54:41 UTC
All use subject to http://about.jstor.org/terms


Martin Elff, Thomas Gschwend, and Ron J. Johnston

our method is just an approximation. Our proposed method so far only takes into account

the consequences of inferential indeterminacy but not the consequences of sampli

variability. It takes into account that the true cell probabilities and true expected cell
counts cannot be known completely even if n approaches infinity. It does not, however

take into account the fact that the cell probabilities are estimated on the basis of a finite n. I

may thus be possible to improve on the coverage performance of the prediction intervals
a finite-n correction could be applied. However, this is beyond the scope of this paper sin
it is mainly concerned with the consequences of inferential indeterminacy. Another reas

is that exact identity between effective coverage and nominal coverage would only be
achievable if the counts were continuous and not discrete. With finite n, discrete values m

just be too coarse. For example, for a population of size 100,000 and for 7 x 7 x 200

9800 cells, the actual coverage of the cell counts falls clearly short. That may be caused b

the fact that in this setting, the average counts per cell are less than four which is clear

below 100. Therefore, it will be almost impossible to obtain exact percentiles for su

counts. Nevertheless, even if we admit that the proposed method is only an approximation

it works better than prediction intervals that rely on the identifiability assumptions of t
El procedure of Johnston and Pattie to hold.

4 Accounting for Sampling Variability

In the previous section, we considered the uncertainty about estimates that comes from th

unavailability of the complete data. The observed data, however, were assumed to b

population-level summaries. Probabilistic methods are employed in the previous sections

first, to take into account that even population-level counts are the outcome of som

stochastic data-generating process and, second, to take into account the uncertainty abou
estimates based on these counts. If, however, at least one of the available marginal tables
not a population-level summary but comes from a sample, for example, from a survey
sample with data on ethnic identity and voting behavior, another level of uncertainty


For example, if instead of a population-level cross-tabulation of voting behavior and
ethnic group membership only a cross-tabulation xtl=(mij) from a sample is available and
the true counts in the array of combinations of ethnic group membership (with categories

/ = 1, ..., /), voting behavior (with categories j = 1, ..., J), and voting district (with
running numbers k = 1, ..., K) are xijk, then (provided that we have a simple random
sample) the array of counts m,y has a multinomial distribution with cell probabilities qtj =

njjJn, where = Ylkxijk> Eor given marginal tables ni=(n.jk), *l2=(ni.k), and 1*3=(n^-.),
the method proposed in the previous section leads to one set of parameters (0^)=
0(tti,ri2,n3) of a Dirichlet-multinomial model of the counts xijk in the array. But for a given

sample cross-tabulation m^, there is more than one possible multinomial distribution with

cell probabilities qtj and thus more than one possible marginal population-level table ntj.

from which this sample may have been drawn. That is, there is more than one set of
parameters to be considered for modeling the distribution of the unknown counts in the

array (0,y*)=<Knl.n2,n3).
Of course, given a sample cross-tabulation, not all possible (/ x J) arrays n^. with
5^. ny. = n are equally plausible candidates. Rather, the plausibility of these candidates is
adequately expressed by the values fi^3\m)=?r(^=n^\^l=Vti) of the probability mass
function of the conditional distribution of 9t3given the sample table (Afly)=SDt.
Now, since (Ny.) is an unobserved random variable in this perspective, it is no longer

possible to simply construct prediction intervals based on a quantile function
This content downloaded from on Wed, 21 Sep 2016 01:54:41 UTC
All use subject to http://about.jstor.org/terms

Uncertainty in Ecological Inference


FBlJ (a; Go, n^) whose parameters are computable from fixed, observed marginal tables
tt2, and tt3. Rather, an appropriate quantile function is given by

F~l (a; 0O, nijk \fm) = ^ (a; 0O, nijk \ nf)f(n^ |m ), (20)
where the sum is over all possible tables n^=(jiij(t)) (t = 1,...) that satisfy ?V= ?.

This sum has no closed form, not the least because there is no function in closed form that

leads from the marginal tables t*i=(?.,*), n2=(?/.*), and 1*3=(/i,y.) to 0O and nijk (these
parameters have to be computed by numerical methods). Therefore, we propose a bootstrap
method to construct the limits of prediction intervals, which involves the following steps:

1. Each replication r = 1, R starts by generating random counts from an
approximation of the conditional distribution of (riij) given (m,y). This is done by

a double-bootstrap procedure. First, random counts (m\p) from a multinomial

distribution with size index m and cell probabilities = m^/m are generated.

From these random counts, a second, random set of cell probabilities p^ = m^/m
is computed. These are the cell probabilities of a multinomial distribution with

size index n from which a second set (n^) of counts is generated. This assures
that the (random) marginal tables (n^;) are integers from a multinomial

distribution with size index n and reflect the variability of the observed sample (m,y).

2. Based on the marginal tables in]?}), (rii.k), and (n.jk), first-stage cell probability

estimates pj^ are obtained based on the Johnston-Pattie model for each

replication r = 1, ..., R.

3. After setting = p^ values, 0^ are determined for each r such that the Dirichlet
distribution with parameters 0-^ = ^n\jk nas maximal entropy for given tt-^.
4. Random numbers p~k from the Dirichlet distribution with parameters 0-^ are
generated for each r.

5. For each r, random counts jtj^ from a multinomial distribution with probability

parameters p^k are generated. For each r, the random array (jcj^) has thus
a Dirichlet-multinomial distribution, however, with different parameters 0^ for
each r.

After R replications, the predictions of the unknown cell counts xijk are given by the

averages R~l Y2r xljk ?ftne random counts for all ij, and k, and the limits of the prediction
interval for each cell count xtjk are given by the respective quantiles of the random counts

x[jk. In the next section, we demonstrate the application of this procedure to the recon
struction of percentages of split-ticket votes in the 1996 General Election of New Zealand.

5 A Real-World Example: Split-Ticket Voting in New Zealand
The term "split-ticket voting" is a pattern of voting behavior that can emerge when voters

have the opportunity to cast several votes on the same occasion. For example, American
citizens have the opportunity to cast a vote both for a candidate who runs for the Presi
dency and for a candidate for the House of Representatives. In parallel and mixed-member
electoral systems, voters may cast two votes in general elections, one with which they can
choose the candidate to represent the voting district they live in and one with which they
can choose which party they want to support for the proportional tier of the electoral
system. Variants of mixed-member electoral systems can be found, for example, in leg
islative elections of Bolivia, Germany, New Zealand, and Venezuela; parallel voting sys
tems can be found in general elections, for example, in Japan, Mexico, and South Korea.

This content downloaded from on Wed, 21 Sep 2016 01:54:41 UTC
All use subject to http://about.jstor.org/terms


Martin Elff, Thomas Gschwend, and Ron J. Johnston

Strategic voting accounts emphasize the effect of "Duverger's Law" on the amount of
split-ticket voting: Voters will give their most preferred party their list vote but, if they

expect that the district candidate of this party to have only little chances of electoral
success, restrict their choice on those candidates they deem to be potentially successful,
thus not to "waste their vote." Often, available empirical data do not suffice to test such
accounts. Either there are only aggregate data on list and candidate votes in individual
voting districts available or the survey data, if they are available, are too sparse with regard
to candidate vote-list vote combinations in individual voting districts. Therefore, exam
ining split-ticket voting is a typical field of application for El (e.g., Burden and Kimball
1998; Gschwend, Johnston, and Pattie 2003; Benoit, Laver, and Gianetti 2004).
There are, however, some rare occasions where official district-level data are available

not only on list vote and candidate vote results but also on the numbers of straight- and
split-ticket votes. One of these occasions is the 1996 General Election of New Zealand.

This gives us the opportunity to examine the performance of the methods developed in the
two preceding sections in a "real-life" setting.

Three sorts of data are available on the 1996 General Election of New Zealand: (1)

official data on electoral results for party lists and for party candidates and independent
candidates for each of the 67 voting districts, (2) official data on total numbers of straight
ticket and split-ticket votes for each of the voting districts, and (3) an 8 x 8 table o
combinations of list and candidate votes from a nation-level survey sample (Levine and
Roberts 1997; Johnston and Pattie 2000). That is, whereas the district-level aggregates

tii=(n.Jk) and tt2=(/i,.*) are observed, the nation-level aggregate tt3=(H,;/.) is not but only
a sample cross-tabulation m=(m#). To make an approximately valid El about the level of
split-ticket voting in the individual voting districts, we thus need the bootstrap method
developed in the preceding section.
In the context of research on split-ticket voting, one is usually not interested in all
counts xijk in the array made of candidate votes, list votes, and voting districts. Rather, one
is interested in the proportion of split-ticket and straight-ticket votes in the individual
voting districts. This additional complication, however, can be tackled in a straightforward

manner. The bootstrap method of the preceding section produces random counts x^ (r =
1, ...,/?) from which point and interval predictions of the true counts xijk are computed.

Now, based on

Jr) _ 2^i=jXijk * Ar) _ LiftXjjk r?n

/straight,* - ^ (r) dnu /split,* ~~ ^ (r) >
Z^iJ Xijk Xijk

we obtain random proportions of straight-ticket and split-ticket votes.8 The averages

r Sr/straight,* an<* h Sr/spiit,* can men serve as a Pomt prediction of the proportions of

straight-ticket and split-ticket votes, whereas the simulated quantiles can serve as limits of
prediction intervals.

Figure 3 shows a comparison of actual against predicted percentages of ticket splitters
in the voting districts of New Zealand along with 95% prediction intervals.9 As can be seen
8A further note on notation: The expression =y xijk refers to the sum of all elements xijk in unit (voting district) k
for which the first index (e.g., the vote for the candidate of party 0 equals the second index (e.g., the vote for the list
of party;'). The expression xijk refers to the sum of all elements in which the first and the second index differ.

9Rather than showing an ordinary scatter plot of predicted against actual percentages, this plot contains a dot plot
of predicted and actual percentages against the individual voting districts (sorted by the predicted percentages) to
make it easier to discern the prediction intervals of cases with similar predicted or actual percentages of ticket

This content downloaded from on Wed, 21 Sep 2016 01:54:41 UTC
All use subject to http://about.jstor.org/terms

Uncertainty in Ecological Inference


o Predicted percentages
True percentages

-Prediction intervals

Voting district, sorted by predicted percentage of ticket-splitters

Fig. 3 Application of the second-stage maximum entropy approach to split-ticket voting in the 1996

General Election of New Zealand: true and predicted split-ticket percentages with prediction
intervals based on bootstrap percentiles, after R = 2000 replications. The coverage of the tru
percentages is 95.5%.

in this plot, there are three instances in which the actual percentage of ticket splitters lie
outside prediction intervals. The total coverage of the actual ticket-splitting percentages b

these prediction intervals thus is 95.5%, which is close to their nominal level. This con
stitutes a slight overcoverage, but with only 67 voting districts, one can hardly expect to

achieve a coverage exactly at nominal level.
The application to the case of split-ticket voting in the 1996 General Election of New
Zealand is another corroboration of the method developed in the preceding sections
Although, again, the match between the nominal coverage of the prediction interva

and their effective coverage is not perfect, it is close enough to suggest that the methods
are, if not a final solution to the problem, a substantial step in the right direction. However,

the comparison of predicted and actual percentages of ticket splitting as well as the lengt
of the prediction intervals shows that El cannot produce the miracle of delivering pre
dictions from mere aggregates that are comparable in quality to estimates and predictions

obtained from fully observed data. In the New Zealand application, the predictions o

percentages of ticket splitters are in some districts roughly 10 percentage points away from

the actual percentages. Also, the length of the prediction intervals amounts to, on average,

15 percentage points. This makes again clear that the basic dilemma of El should lea

scholars to take greatest care when they interpret results obtained from an El procedure.

6 Discussion

In Sections 3 and 4, we develop a method of accounting for the extra amount of uncertaint

associated with estimates and predictions obtained from an El procedure that stems from

a problem that we exposed in Section 2, the problem of inferential indeterminacy. Without

This content downloaded from on Wed, 21 Sep 2016 01:54:41 UTC
All use subject to http://about.jstor.org/terms


Martin Elff, Thomas Gschwend, and Ron J. Johnston

the problem of inferential indeterminacy, it would be possible to model the data-generatin

process of the counts xijk as a multinomial distribution. To model the consequences of
inferential indeterminacy for the uncertainty about cell probabilities, we use the conjuga
family of the multinomial, the family of Dirichlet distributions. To model the consequence
for uncertainty about cell counts, we use the family of mixtures of multinomial Dirichlet
The first to consider Dirichlet-multinomial mixture distributions for El are Brown and
Payne (1986). Their approach consists of an extension of the ecological regression model
(12), which allows the conditional probabilities pj\ik to vary in different units k according

a Dirichlet distribution with mean parameters E(pj\ik) = and precision parameter 60
According to Brown and Payne (1986), both the mean and the dispersion parameters can

be estimated from the marginal tables n.jk and n^, the mean parameters based on gener
alized least squares and the precision parameter based on the variance of the residuals of

the regression of n.jk on (Rosen et al [2001] consider estimation of this model v
MCMC and a simpler model without a Dirichlet mixing distribution estimated via no

linear least squares). Since the precision parameter 00 is estimated from the residuals of

this regression, this model may account for model departures with respect to those aspec
in which model (12) is more restrictive than the Johnston-Pattie model (8). But since th

precision parameter of the model in Brown and Payne (1986) is estimated based on the

observed marginal tables (n,-.*) and (n.Jk), the resulting Dirichlet model of the conditiona
probabilities can hardly account for those model departures that can be detected only if th

full array (jc^) of counts is available. In contrast to the approach of Brown and Payn

(1986), we do not try to estimate the precision parameter 0O from the observable margin
tables but rather determine its value according to an a priori criterion, the Principle of

Maximum Entropy. As our simulation experiments show, the values of the precisio

parameter thus determined can capture much of the uncertainty caused by the possibility

of undetectable model departures. Therefore, we suggest that any method of constructing
prediction intervals for El should lead to intervals at least as large as those based on our
proposed method. Otherwise, they cannot account for those undetectable model departures
that haunt the confidence in results of El.

It seems that our method of accounting for the consequences of modeling indetermi
nacy itself rests on a crucial assumption that a priori, without taking into account the
information contained in the marginal tables, any possible array of counts may occur wit
the same probability. This, however, would be a misunderstanding. The Uniform distri
bution reflects the ignorance about the true cell counts that characterizes the point of
departure of El problems. It plays a role similar to a "flat" or noninformative prior in
Bayesian inference. Using an informative, non-Uniform distribution as a reference distr

bution may increase the risk of biased predictions unless this reference distribution is
sufficiently close to the true distribution of the cell counts. We use a Uniform distribution

as reference specifically to avoid such possible bias. If, however, prior information

available that can be summarized in a distribution of the cell probabilities, a generalization

of the maximum entropy method can be used: One can then, instead of maximizing entropy

select a model that minimizes the directed Kullback-Leibler information divergence, also
known as Kullback-Leibler information criterion, relative to this prior distribution.10

10As Kullback (1959) and Good (1963) have pointed out, the Principle of Maximum Entropy is just a specia
of the Principle of Minimum Discriminating Information: Choosing the distribution with maximal entropy is
equivalent to minimizing the directed Kullback-Leibler information divergence relative to a Uniform distribu
tion. For details see Appendix (Section A) on the Political Analysis Web site.

This content downloaded from on Wed, 21 Sep 2016 01:54:41 UTC
All use subject to http://about.jstor.org/terms

Uncertainty in Ecological Inference


As a method to construct a distribution of the cell counts, one may argue that th
method developed in Section 3 is deficient insofar as it only corrects consequences of
parametric assumptions of El methods that could be bypassed by directly constructing

a distribution of the unknown cell counts xijk that, without any prior parametric assump
tions, maximizes entropy subject to those constraints that reflect the information contained

in the known marginal tables. In principle, such a construction will be straightforward; in
practice, however, finding a solution will be infeasible because of its prohibitive compu

tational complexity.11 The method developed in Section 3 contains some concepts of

Bayesian statistical inference, that is, a probability distribution on the parameters (the cell
probabilities) of a multinomial distribution, which is a member of the conjugate to the

family of multinomial distributions. But it does not make use of Bayes's theorem. The
combination of the Johnston-Pattie model with the second-stage maximum entropy con
struction may be viewed, at best, as an approximation to a posterior distribution of the cel
counts. However, we are not able to show how good an approximation this is except by
means of the simulation study, the results of which we report at the end of Section 3.
Therefore, a direct Bayesian approach that makes use of the complete set of tools of this
technique of statistical inference seems preferable to the approach proposed in this paper.
As can be shown,12 a posterior constructed by the straightforward application of Bayes's
theorem to a noninformative prior distribution of the cell counts would be of surprisingly
simple structure. However, the computation of the posterior probabilities of the counts will
be as computationally demanding as a nonparametric maximum entropy model, making
a direct Bayesian approach as unfeasible as a nonparametric maximum entropy approach
to modeling the distribution of the cell counts.
Finally, we want to emphasize that one should not yield to the temptation of using cell
counts or cell probabilities as "data" in second-stage regression models. Apart from the
conceptual issues involved in fitting models to data predicted from another model and from
possible bias incurred by using such second-stage regressions (Herron and Shotts 2003,
2004), conventionally estimated standard errors of a naive second-stage regression model
will be highly inaccurate. Also, this problem cannot be cured by an increase in the number
of spatial units considered or by the number of cases within the respective spatial units.
Our simulation results reported at the end of Section 3 show how much one will be misled
if one relies on asymptotic theory in these cases. Therefore, instead of treating estimates
obtained from an El as data, one should consider them as parameters of the predictive

distribution for unknown data and use this distribution for generating values for multiple

imputation (e.g., King et al. 2001). Of course, since the amount of unknown data is quite
large relative to the amount of known data, one would need a large number of imputed dat

sets in order to get reliable estimates of quantities of interest and their variances. The
issues connected with such a use of El results clearly deserve some further research, which
is, however, beyond the scope of this paper.

7 Conclusion

The point of departure of our paper is a fundamental dilemma of El and of inference in ill

posed inverse problems in general: Estimates can be identified only if certain restrictive
assumptions are made with respect to the structure of the data-generating process leading

nFor details see Appendix (Section A.4) on the Political Analysis Web site.
12See Appendix (Section C) on the Political Analysis Web site.

This content downloaded from on Wed, 21 Sep 2016 01:54:41 UTC
All use subject to http://about.jstor.org/terms


Martin Elff, Thomas Gschwend, and Ron J. Johnston

to the unknown data one tries to reconstruct. However, these assumptions may not be
satisfied by the unknown data, leading to serious bias in the estimates relative to the tr
values of the data. This may be called the dilemma of fundamental indeterminacy (Cho
and Mansky forthcoming). We tackle this dilemma by proposing a method of delimitin
the error one has to expect when the assumptions needed for the identification of a solut
are violated by the unknown data.
We focus on a special model for El, a version of the model proposed in Johnston and
Pattie (2000), because this model requires only relatively mild assumptions as compared
those required by ecological regression models. We examine the consequences of possib
departures of the model assumptions. We find that prediction intervals based on the mo
assumptions are far too narrow and show a very serious undercoverage of the unknow
data. The main contribution of our paper, however, is the development of a method for t
construction of prediction intervals that are at least approximately correct.

The method we propose consists in combining two stages. In the first stage, poi

estimates for the unknown data are constructed based on a model that contains certain

assumptions inevitable for the identification of the solution. In the second stage,

consider the set of all possible solutions that satisfy or do not satisfy these identificatio

assumptions. We construct a probability distribution that meets the requirement that
expectation is the solution from the first stage but is otherwise as neutral as possible wi
respect to other possible first-stage solutions. This distribution assigns the weights of

possible solutions in terms of values of its density functions as equal as possible, th

is, has maximal entropy, subject to the constraints on the expectation of this distributio
Results from a simulation experiment show that if the population for which the El is to

made is large enough, prediction intervals based on our proposed two-stage method are
approximately correct.
We supplement this simulation experiment with a real-world application: the predic
tion of district-level percentages of ticket splitting in the 1996 General Election of Ne
Zealand based on aggregate data about candidate votes and list votes at the voting-distr
level and an 8 x 8 table obtained from a survey sample. This election is a rare opportunit
to check the performance of an El procedure against real data: Not only are discrete-leve
data available on candidate votes and list votes and a sample for candidate vote-list vote
combinations at the national level but also district-level data on the percentages of straig
ticket and split-ticket votes. Therefore, we are able to compare predicted with actual
percentages of split-ticket votes for each voting district. Within the context of this app
cation, we address another issue that scholars dealing with aggregate data may face: No
always are these aggregate data exact summaries of the population but rather a sample th

summarizes the population. As a solution for this problem, we propose a combination o

our two-stage maximum entropy method with bootstrapping from the empirical distrib
tion of the sample. Our application of this combination to the New Zealand data shows th

it results in prediction intervals with a coverage performance roughly equal to th
nominal level: In 95.5% of the New Zealand voting districts lies the actual percent
of split-ticket votes inside 95% prediction intervals constructed based on our proposed


The main conclusion of our paper thus is that standard errors and confidence intervals
for El problems that do not take into account the fundamental uncertainty associated with

any solution to ill-posed inverse problems may be grossly misleading. However, the
consequences of this fundamental uncertainty can be delimited, if not exactly, though
approximately. Therefore, despite the problems discussed in this paper, to reject altogether
the idea of El on these grounds means throwing out the baby with the bathwater.
This content downloaded from on Wed, 21 Sep 2016 01:54:41 UTC
All use subject to http://about.jstor.org/terms

Uncertainty in Ecological Inference 91

Abramovitz, Milton, and Irene A. Stegun, eds. 1964. Handbook of mathematical functions with formulas,
graphs, and mathematical tables. Washington, DC: National Bureau of Standards.
Abramson, Paul R., and William Claggett. 1984. Race-related differences in self-reported and validated turnout.

Journal of Politics 46:719-38.
Benoit, Kenneth, Michael Laver, and Daniela Gianetti. 2004. Multiparty split-ticket voting estimation as an
ecological inference problem. In Ecological inference: New methodological strategies, ed. Gary King, Ori
Rosen, and Martin Tanner, 333-50. Cambridge, UK: Cambridge University Press.
Brown, Philip J., and Clive D. Payne. 1986. Aggregate data, ecological regression, and voting transitions.
Journal of the American Statistical Association 81:452-60.

Burden, Barry C, and David C. Kimball. 1998. A new approach to the study of ticket splitting. American
Political Science Review 92:533^14.
Cho, Wendy K. Tarn. 1998. Iff the assumption fits ...: A comment on the King ecological inference solution.

Political Analysis 7:143-63.
Cho, Wendy K. Tarn, and Charles F. Manski. Forthcoming. Cross-level/ecological inference. In Oxford hand
book of political methodology, ed. Janet Box-Steffensmeier, Henry Brady, and David Collier. Oxford, UK:
Oxford University Press.
Cirincione, C, T. A. Darling, and T. G. O'Rourke. 2000. Assessing South Carolina's congressional districting.

Political Geography 19:189-211.
Crowder, Martin J. 1978. Beta-binomial ANOVA for proportions. Applied Statistics 27:34-7.
Cziszar, Imre. 1991. Why least squares and maximum entropy? An axiomatic approach to inference for linear

inverse problems. Annals of Statistics 19:2032-66.
Fienberg, Stephen E., Paul W. Holland, and Yvonne Bishop. 1977. Discrete multivariate analysis: Theory and

practice. Cambridge, MA: MIT Press.
Fienberg, Stephen E., and Christian P. Robert. 2004. Comment to 'Ecological inference for 2 x 2 tables' by Jon

Wakefield. Journal of the Royal Statistical Society: Series A (Statistics in Society) 167:432-4.
Golan, Amos, George Judge, and Jeffrey M. Perloff. 1996. A maximum entropy approach to recovering in
formation from multinomial response data. Journal of the American Statistical Association 91:841-53.
Good, I. J. 1963. Maximum entropy for hypothesis formulation, especially for multidimensional contingency
tables. Annals of Mathematical Statistics 34:911-34.
Goodman, Leo A. 1953. Ecological regressions and the behavior of individuals. American Sociological Review


-. 1959. Some alternatives to ecological correlation. American Journal of Sociology 64:610-25.
Groetsch, Charles W. 1993. Inverse problems in the mathematical sciences. Braunschweig and Wiesbaden:

Gschwend, Thomas, Ron Johnston, and Charles Pattie. 2003. Split-ticket patterns in mixed-member proportional

election systems: Estimates and analyses of their spatial variation at the German federal election, 1998.
British Journal of Political Science 33:109-27.
Herron, Michael C, and Kenneth W. Shotts. 2003. Using ecological inference point estimates as dependent
variables in second-stage linear regressions. Political Analysis 11:44-64.

-. 2004. Logical inconsistency in El-based second-stage regressions. American Journal of Political
Science 48:172-83.
Hoadley, Bruce. 1969. The compound multinomial distribution and Bayesian analysis of categorical data from
finite populations. Journal of the American Statistical Association 64:216-29.
Jaynes, Edwin T. 1957. Information theory and statistical mechanics. Physical Review 106:620-30.

-. 1968. Prior probabilities. IEEE Transactions on Systems Science and Cybernetics 4:227-41.
Johnston, Ron J., and A. M. Hay. 1983. Voter transition probability estimates: An entropy-maximizing approach.

European Journal of Political Research 11:405-22.
Johnston, Ron J., and Charles Pattie. 2000. Ecological inference and entropy-maximizing: An alternative
estimation procedure for split-ticket voting. Political Analysis 8:333-45.

Judge, George G., Douglas J. Miller, and Wendy K. Tarn Cho. 2004. An information theoretic approach to
ecological estimation and inference. In Ecological inference: New methodological strategies, ed. Gary King,

Ori Rosen, and Martin Tanner, 162-87. Cambridge, UK: Cambridge University Press.
King, Gary. 1997. A solution to the ecological inference problem: Reconstructing individual behavior from
aggregate data. Princeton: Princeton University Press.

-. 1998. Unifying political methodology: The likelihood theory of statistical inference. Ann Arbor, MI:
Michigan University Press.

This content downloaded from on Wed, 21 Sep 2016 01:54:41 UTC
All use subject to http://about.jstor.org/terms

92 Martin Elff, Thomas Gschwend, and Ron J. Johnston
King, Gary, James Honaker, Anne Joseph, and Kenneth Scheve. 2001. Analyzing incomplete political science
data: An alternative algorithm for multiple imputation. American Political Science Review 95:49-69.
King, Gary, Ori Rosen, and Martin A. Tanner. 1999. Binomial-beta hierarchical models for ecological inference.

Sociological Methods and Research 28:61-90.
Kullback, Solomon. 1959. Information theory and statistics. New York: Wiley.
Levine, Stephen, and Nigel S. Roberts. 1997. Surveying the snark: Voting behaviour in the 1996 New Zealand
general election. In From campaign to coalition: New Zealand's first general election under proportional

representation, ed. Jonathan Boston, Stephen Levine, Elizabeth McLeay, and Nigel Roberts, 183-97.
Palmerston North, NZ: Dunmore Press.
Mosimann, James E. 1962. On the compound multinomial distribution, the multivariate P-distribution, and
correlations among proportions. Biometrika 49:65-82.
Openshaw, S., and P. J. Taylor. 1979. A million or so correlation coefficients: Three experiments on the modifi

able areal unit problem. In Statistical methods in the spatial sciences, ed. N. Wrigley, 127-44. London: Pion.

-. 1981. The modifiable areal unit problem. In Quantitative geography: A British view, ed. N. Wrigley and
R. J. Bennett, 60-70. London: Routledge and Kegan Paul.
Prentice, R. L. 1986. Binary regression using an extended beta-binomial distribution, with discussion of corre
lation induced by covariate measurement errors. Journal of the American Statistical Association 81:321-7.

Rosen, Ori, Wenxing Jiang, Gary King, and Martin A. Tanner. 2001. Bayesian and frequentist inference for
ecological inference: The r x c case. Statistica Neerlandica 55:134-56.
Shannon, Claude E. 1948. A mathematical theory of communication. Bell System Technical Journal 27:379

423, 623-56.

Skellam, J. G. 1948. A probability distribution derived from the binomial distribution by regarding the prob

ability as variable between the sets of trials. Journal of the Royal Statistical Society. Series B (Methodolog

ical) 10:257-61.
Steel, David G., Eric J. Beh, and Ray L. Chambers. 2004. The information in aggregate data. In Ecological
inference: New methodological strategies, ed. Gary King, Ori Rosen, and Martin Tanner, 51-68. Cambridge,

UK: Cambridge University Press.
Uffink, Jos. 1995. Can the maximum entropy principle be explained as a consistency requirement? Studies in
History and Philosophy of Modern Physics 26B:223-61.
Vardi, Y., and D. Lee. 1993. From image deblurring to optimal investments: Maximum likelihood solutions
for positive linear inverse problems. Journal of the Royal Statistical Society. Series B (Methodological) 55:


Vasicek, Oldrich Alfonso. 1980. A conditional law of large numbers. Annals of Probability 8:142-7.
Wakefield, Jon. 2004. Ecological inference for 2 x 2 tables. Journal of the Royal Statistical Society: Series A
(Statistics in Society) 167:385-426.

This content downloaded from on Wed, 21 Sep 2016 01:54:41 UTC
All use subject to http://about.jstor.org/terms