(note: this piece can get rather technical; I originally intended for it to be released after a series explaining how polling worked. If it’s confusing, I suggest waiting for the series; it’ll build up an understanding of polling and how it does/doesn’t work so hopefully this piece will be easier to understand)
When dealing with polls, you’ll often hear advice to look at an average of all polls, instead of simply relying on a single poll or pollster. Polling averages often outperform individual polls because they effectively combine the samples of multiple polls, reducing the effects of outlier samples/methods and producing an overall lower error than any single pollster could hope to achieve on their own (without prohibitively large and hence expensive sample sizes).
As a result, polling averages have significantly lower
– i.e. random error which is unavoidable due to the fact that pollsters (and by extension poll aggregators) only interview a random sample of the voting population. sampling error 1 You can think of sampling error as being similar to what happens when you flip a fair coin, say, 2000 times – while on average you would expect to see 1000 heads, getting between 956 and 1044 heads in that 2000-flip sample (i.e. a proportion of 47.8 – 52.2% heads where the “true” proportion is 50%) shouldn’t surprise you due to the random chance inherent in the process. Averaging polls is very similar to increasing the number of coin flips; while the theoretical margin of error on the proportion of heads in 2000 coin-flips is ± 2.2%, the margin of error on 4000 coin-flips is just ± 1.6%. x Theoretical average error of a polling average where the sample proportion = 0.5 and sample size = 2000 in each poll
However, for a polling average to perform effectively, a key assumption is that the polls in the average have to be
, i.e. whether Essential polls overestimate Labor should give you little information about whether Morgan polls are going to overestimate Labor as well. If this assumption is violated, then polling averages are less useful (though still somewhat better than individual polls), as polling errors do not “cancel out” – instead of having overestimates of Labor’s vote cancelled out by underestimates, the pollsters’ overestimates of Labor’s vote end up being averaged into a polling average which also overestimates Labor’s vote. relatively uncorrelated Sampling bias
In real life, this assumption (as with many other assumptions) does not always hold. Unlike with coin flips, when dealing with human respondents, pollsters have to both find and get people to respond to their surveys (whereas coins don’t have the option of flipping you off instead), meaning that the methods used to do so can produce samples of respondents unrepresentative of the broader population. I’m not going to go through them all, but some fairly well-documented ones:
Live interviews may be more prone to social-desirability bias, where respondents are likely to over-report things viewed as socially desirable (e.g. donating to charity) and under-report things viewed as socially undesirable (e.g. drug use). Phone interviews tend to sample a different subset of the population depending on the method and time of day. For example, calls at 10 am on a weekday are more likely to garner responses from non-working people (e.g. housewives, retirees), while robocalls are likely to reach older people who don’t have some form of call-blocking enabled. Online polling is likely to under-sample sections of the population who are less tech-savvy (as well as areas that don’t have reliable Internet access, e.g. rural areas)
To minimise these issues, pollsters are often able to weight their samples by certain characteristics to improve their representativeness of the general population – e.g. weighting by age to reduce a bias towards older respondents. However, weighting is not a panacea for sampling bias – in some cases, the sample is unrepresentative of the population in a way which was not accounted for by weighting (e.g.
failure to weight by education in the 2016 US Presidential election), or, worse, in a way which cannot be accounted for by weighting.
Examples of the latter include a
hypothesis that people with low social trust were both less likely to respond to polls and more likely to vote Republican in 2020; if people who don’t respond to polls have a significantly different opinion on an issue as compared to survey respondents and the key characteristic driving said difference is not something the pollster can account for, there can be a significant which affects all pollsters in a similar fashion. Note that both factors are required for there to be sampling bias; for example, 800 men and 400 women respond to a poll, but both men and women intend to vote 51-49 for the Coalition, then there will not be any sampling bias in the results even though the sample was not representative of the population. On the other hand, if men intend to vote 53-47 for the Coalition while women intend to vote 51-49 for Labor, then failing to weight (or, inability to weight) for gender would produce a 0.7% bias to the Coalition in that poll. sampling bias Pollster herding
Another potential source of correlation is what is known as pollster herding, which refers to when
pollsters produce results which are much closer to each other than we should expect by chance. Pollster herding is not necessarily data fraud – there are many decisions which go into the modelling and weighting of data where there is no clear answer, 2 and many assumptions which are legitimate given the available data. However, there are many ways to pick a different set of assumptions when your initial model comes out with a result which seems implausible or wildly out of line with other polls, and doing so results in polls which are regularly x closer to each other than should happen by chance.
Pollster herding was a significant concern in both the 2016 and
2019 Australian federal elections due to how under-dispersed the polls were, although only the latter produced a systematic polling error. At the latter, the reported 2pp varied within just a 1% range (51% to 52% for Labor), which is very unlikely to occur with random sampling. 3 Under-dispersal (i.e. the polls being much closer to each other, or less dispersed, than expected from random sampling) makes polls more correlated with each other, which reduces the accuracy of polling averages. x How these sources of error relate to the 2019 polling failure
In 2019, the polls missed by a significantly larger margin than usual (I estimate the average error on individual polls since 1983 is about 2%; the 2019 misses were between 2.5% to 3.5%) on the 2pp, especially
when one takes into account the various improvements in polling (for example, the 4 – 5.5% misses by Morgan in 2004 would have been smaller if they had used last-election preference flow modelling, as all pollsters now do). More crucially, unlike in past elections where polls tended to err in random directions (e.g. in 2013, Morgan, Nielsen and Newspoll overestimated the Coalition while ReachTEL, YouGov and Essential underestimated them, producing a highly accurate average overall), the polls all overestimated Labor in 2019 by nearly the same amount (2.5 – 3.5%), which is highly unlikely through sampling error alone. 4 x
Instead, it is very likely that either the samples acquired by the pollsters were all skewed towards Labor (sampling bias), and/or that pollsters suppressed the release of/adjusted the models for polls which showed the Coalition ahead (pollster herding). Both hypotheses have evidence for them – the
inquiry into the 2019 polls by AMSRO suggests that failure to weight by education might be responsible for about 23% of the error on the margin between the two major parties (which if true would have reduced the error by about 1.2% in 2019), while there have been reports of pollsters shoving polls showing the Coalition ahead in the drawer for fear of being the embarrassing outlier.
To test both hypotheses, I’ve been building and refining a model of pollster herding which includes a variety of factors such as sample bias, models of changing voting intention throughout a campaign and of course pollster herding. I’ve uploaded
here in R for those who want to look into how it works (warning, the code is very unpolished and I haven’t had the time to insert comments yet), but broadly speaking: With both models, there’s a time series model (i.e. it “starts” at a certain number of weeks before the election, then changes the vote towards the final result every week according to a randomly generated model). The “true value” in each simulation always reaches the pre-set actual vote margin by the final week (so no last-minute errors). With the unherded model, it basically generates a random sample bias for each simulation, then uses sampling error to generate a random set of polls from the “sample”. With the herded model, it does whatever the unherded model does, but a few other factors also come in. Firstly, with outliers, there’s some chance they get junked by the “pollster” (though this chance reduces to 0 by election day so I don’t have simulations where there are no final polls). With the remainder, most are adjusted such that they are no longer outliers (e.g. if the poll showed Coalition 51%, the most recent polling average showed Labor 52%, and an outlier was defined as being more than 2% out from the average, the “pollster” would adjust the result to be 50-50), though there’s a small number of outliers anyhow. What is defined as an outlier changes as the campaign goes on – taking my cues from Nate Silver’s graph of deviation from polling average versus time to election, the model starts out at defining 2% or more deviation from the last polling average to be an outlier (so if the average is 52% Coalition, anything >54% Coalition or with Labor ahead would be an outlier), then uses a logistic function to reduce this limit to 1% by election week (with more rapid reduction the closer to the election it is). This model is already being very generous to the extreme level of herding seen in the 2019 election – if I were to accurately model the extreme under-dispersion in there, the limit would have to be halved in both cases.
With all polls, I’ve followed our Australian pollsters’ conventions and rounded to the nearest 0.5% before “release”. Polling averages and other statistics are calculated from these rounded outputs instead of the raw data to more closely simulate how the polls shift in Australian elections; this may result in some rather odd-looking distributions in our graphs.
Testing hypotheses of the 2019 polling failure Hypothesis 1: Random chance; no sample bias or pollster herding
This is a fairly simple check I perform to see how likely it is that our pollsters got hit by “bad luck” – i.e. what’s the chance that we would have gotten polling errors like the ones we saw if the polls were unbiased and there was no herding?
Data generated under the assumption of sample size = 2500. If the sample size was higher, the chance of getting polls this far out from the true result would be even lower.
So it’s fairly unlikely. More importantly, it’s even less likely that we would have gotten a 5-poll average as wrong as it was in 2019 if the samples were unbiased and there was no herding:
Furthermore, it’s fairly unlikely we would have gotten polls as under-dispersed as they did if there was no herding:
More specifically, there’s a 2.9% chance we would have gotten a standard deviation less than what we found in our sample (0.374) if the polls weren’t herding. Also, note the irregular shape of the distribution – in my modelling I rounded all poll outputs to the nearest 0.5% as our pollsters do.
With the possibility of random sampling error being a cause out of the way, let’s look at our other two hypotheses:
Hypothesis 2: Samples were biased to Labor, no herding
Given the final error in the polls was about 3% to the Coalition, one possible cause could be that the samples used in constructing the polls were systematically biased to Labor by about 3% throughout the campaign.
To simulate this, I’ve set the sampling bias to be 3% to Labor (-3 in my model), with a further 0 set as the initial value – the model “starts” at 10 weeks prior to the election, and at 10 weeks before the election Labor was polling at about 53-47. If I don’t set the initial value to be equal to 0 (no side is favoured), the model will basically assume that the “average” initial value is 51.5 to Labor instead (final result is 48.5 Coalition, 3% bias to Labor means Labor up 51.5), which would not match up with the actual events in 2019; setting it to zero means the initial value (or 2pp in polling, 10 weeks out from election) would be Labor 53-47.
A quick caveat: this is a simulation done knowing the results – we won’t know what the sample bias will be for future elections (if we did we would basically know the results of the election). These simulations are solely for the purpose of seeing which models best explain the pattern of errors seen at the 2019 Australian federal election, knowing what we already do about what happened.
With these settings in place, we can see that a sample bias can explain the errors seen in the 2019 Australian federal election fairly well:
Sample bias also explains the error in an average of the polls:
However, sample bias simply explains why the polls are as far off as they are, and not why the polls were so closely clustered together by the end of the campaign. This is reflected in a graph of the standard deviation in the polling sample:
Hence, it’s pretty unlikely that the pattern of polling errors we saw in 2019 was solely due to an unrepresentative sample. Even if, say, weighting by education would have fixed the systematic bias to Labor (which it’s not clear it would fully do so – going off the estimates by AMSRO the polls would still have come out around 50.5 to 51 to Labor), the pattern of highly clustered polling is something we should be concerned about. Although highly clustered polls won’t always fail like they did in 2019 (see 2016 for an example of similarly under-dispersed polls which performed very well), when they do they will systematically err in the same direction, making polling averages useless and damaging public confidence in opinion polling as a means to gauge public opinion.
Hypothesis 3: No systematic sample bias, polls herd according to model
To simulate this, we first set our sample bias to the default (i.e. no net sample bias in either direction), then we change a few things to match how polling shifted over the course of the 2019 campaign. Firstly, we set the initial value, or the average 2pp at the start of the campaign, to 53% Labor (-3 in the model) – in other words we assume that the polls were broadly correct as of March 2019, when they reported Labor up around 53-47. Note that we don’t actually have to make this assumption per se; I just do it because it’s a convenient way to make the model output polls around 53% Labor for the “start” of the campaign. Pollster herding refers to pollsters producing results which are abnormally in-line
with other polls, not with the true result (which is unknown ahead of the election) – even if the actual share of people intending to vote/preference Labor was about e.g. 48%, the patterns of pollster herding would be roughly similar with the same set of initial polls.
Secondly, we set the time series model – the model of how voting intention shifted throughout the campaign – to type 4, which basically produces less change at the start and more change in the final weeks, though the underlying “true” value will always be Coalition 51.53% by the end of the election. We did this as it seems like there was
more swing to the Coalition at the end of the campaign, although it usually doesn’t have a large impact in the model compared to the linear time series model.
In other words, we broadly assume that the polls were fairly accurate as of 2 months out; and that the pattern of the shift in voting intention as picked up in polls is broadly correct though too small. Doing so we can see that pollster herding also serves to explain the polling error in 2019:
Although the herded distribution is slightly less representative of the polls we got (in particular it’s slightly closer to the “true” result than the sample-biased distribution), it still does a pretty good job of explaining the polling error. Ditto with the herded polling average:
Note how much flatter, or spread out, the distribution of herded polling averages is compared to the non-herded averages. As noted above, herding means that polling errors don’t cancel out; this means that when the polls are significantly “off” (e.g. when the polls start out around 54-46 Labor) all the polls will be off in the same direction and by similar magnitudes.
Most importantly, pollster herding also explains the under-dispersal in the polls seen in 2019:
Note how much “tighter” the distribution of possible deviations from polling average is for herded polling.
Hence, it appears very likely that pollster herding was the largest source of the polling error seen at the 2019 federal election. Based on our simulations, it appears that the primary reason was Labor’s early strength upon Scott Morrison’s replacement of Malcolm Turnbull in polling, resulting in pollsters herding towards a Labor win. In contrast, sampling bias does not explain the severe degree of under-dispersal seen in the polls; a fact which we actually understate here, considering we only looked at the under-dispersal for the final polls prior to the election and not the high degree of under-dispersal throughout the campaign.
From the data above, it seems that sampling bias played a relatively small role in the 2019 polling failure, once herding has been accounted for. The average polling average in the herding simulations was 50.9% to Labor, suggesting that if our model of herding is broadly correct, at most the various forms of sampling bias e.g. failure to weight for education would only have had an effect of about 0.5% on the 2pp. Interestingly, this lines up rather well with the estimated effect of failure to weight by education in the 2019 election; if we use the AMSRO’s estimate of a 23% reduction in the error on the margin between the Labor and the Coalition by weighting for education, the error is reduced from 5.2% to 4%, or about a 1.2% reduction in the (Coalition primary – Labor primary) statistic. This would translate into an approximately 0.6% reduction of the 2pp error, which is very close to the statistic we derived here; suggesting that pollster herding and not sampling bias was primarily to blame for the 2019 Australian polling error.
Edit: I’ve written an addendum where I analyse what the impacts would be if pollsters had weighted by education but continued to herd in 2019; that’s up