If you haven’t had a look at our Western Australian 2021 forecast yet, it’s up and running here.

Forecasts are useful little devils – they help to translate certain pieces of evidence (e.g. a poll saying Labor is ahead 68-32) into useful information (e.g. that means that given historical accuracy, Labor has a greater than 99% chance of winning, with X % probability of winning seat Y).

But like devils, they can sometimes lead us astray. Like all models, Meridiem is built off certain assumptions which may or may not hold. Hence, it’s important to regularly check how our forecasts perform on actual predictions instead of simply ensuring they backtest well.

Having said that, I personally don’t agree with most of the common ways of judging a forecast. A lot of times, you’ll see discussion about how many outcomes a forecast “got right” (e.g. the gushing over Nate Silver for getting all 50 states “right” in 2012), or, more recently, discussion about how far “off” the forecasts were. By the logic of the first, for a given forecast which estimates that a hundred candidates each have a 51% chance of winning, if all 100 candidates win, then the forecast “got all 100 right” even though there’s clearly something wrong with a forecast which keeps predicting a 51% chance for events that happen at a ~ 100% rate. Of course, in an election forecast, all 100 candidates’ elections could be correlated – e.g. maybe all 100 candidates were from the Liberal Party and the Liberals outperform their polls everywhere.

However, that shouldn’t keep happening over many, uncorrelated elections. If it does, it suggests that the forecast is either really under-confident (maybe the probability it should be giving to such candidates is 99% instead of 51%) or it keeps stuffing up its expected value in some way (e.g. the forecast systematically underestimates Liberals, over a long-term basis). By the logic of the second, if a forecast of Labor’s vote share claims to have a margin of error of ± 1% and ends up being wrong by 1.5%, that would be a better result than a forecast which claims to have a margin of error of ± 4% and ends up being wrong by 3%. For those who aren’t sure why this is a problem: if a model keeps promising a high level of accuracy but fails to deliver, it’s very likely that some underlying assumption in the model is wrong.

Even if the forecast outperforms its competitors for one election, it’s very likely that it will mess up somewhere in the future.

For example, maybe the forecast noticed a 2% skew to Labor in historical polls, and once corrected for that, the average error on polls is just 0.5% (the margin of error is roughly twice that of the average error). If a forecaster simply used these statistics to build a forecast without adjustment, they would ignore the possibility of e.g. pollsters changing their methods to correct the skew, changes in the electorate which change how polls skew, simple regression to the mean (or even the potential that the period they built their forecast off being an anomaly).

As a result, their model would be really confident (the average error on polls they backtested on is just ± 0.5%!), it *might* be more accurate than its competitors (because there might really be some factor which skews polls to Labor), and it will also get much bigger errors than its model says “should” happen. Furthermore, once pollsters and/or the electorate inevitably change, their model will explode when compared to more cautious forecasts which incorporate the possibility of change.

Finally, such forecasts will output overconfident probabilities – for example a Labor candidate with a “true” probability of winning of 80% might be estimated to be a 99% favourite instead.

So how do we intend to judge a forecast? Well, to do so, we first need to familarise ourselves with what a forecast primarily outputs (and no, I’m not talking about probabilities).

## The value the devils bring to the table

While pretty much everyone focuses on the probabilities output by an election forecast (one of the reasons why we placed them as far down on the page as possible), probabilities are like the diamonds found in dead volcanoes: they’re what everyone’s after but usually aren’t the primary outputs of the process. Most election forecasts produce two main things of value:

**Expected value**: This is basically what we would expect a certain party’s vote share to be like given a massive sample of elections with the same data leading up to them. For example, if we were given lots of elections where the Coalition was ahead 52% 2pp in the final polling average, what would we expect the average 2pp result to look like? What about if we were given a large sample of state elections where the state government was of opposite party to the federal government?

Expected value is really important – if e.g. Labor keeps winning 51% when your forecast expects them to get 48%, your forecast will be skewed against Labor (and you’re almost certainly going to be less accurate). However, people do tend to focus a little too much on expected value and not the other product of value…**Variance**: Also known as uncertainty, precision, or confidence. This is how confident we are that the expected value above is the actual outcome; usually you hear this cited in the media as “margin of error”. To produce accurate forecasts, a good model needs to be able to recognise when there is a high degree of variance (e.g. when there are few polls, when there’s a lot of time till the election [hence more possibility of a last-minute shift]) and where there’s less (e.g. when we have a lot of polls which don’t appear herded, when it’s election day).

Getting the variance right is really important – it’s the sort of thing which can help your forecast predict that a candidate who’s ahead by just 1% is actually favoured to win 90% of the time whereas a candidate who’s ahead by 4 – 5% is only tipped to win 70% of the time. (the Electoral College played a role in both, of course, but given they estimated the possibility of an Electoral College/popular vote split at 10%, this suggests that their popular vote model would at most estimate that both candidates had an 80% chance of winning, even with different polling leads)

Both affect what the probability estimate is; shifting the expected value towards the Liberals can increase their chance of winning, but so can increasing the variance in an election where Labor is ahead. The probabilities output by our forecast are more of a side-effect of asking the model, for example, “what is the chance of Labor winning more than 50% of the 2cp in Kalgoorlie?”. Probabilities can be easier to understand than vote share estimates (e.g. “Labor has a 65% chance of winning Kalgoorlie” versus “Labor is expected to win 52.2% of the two-party-preferred in Kalgoorlie, ± 10%”), Not to mention a few other factors – for example, the standard margin of error only covers 95% of all outcomes; and it’s usually calculated assuming no skew in the distribution of outcomes (basically meaning that it assumes that if there are outliers, they will be equally likely to overshoot as undershoot the vote; usually this is a workable assumption but it can break in a landslide election or with minor party vote shares). even if people do tend to misinterpret probable outcomes as sure things (thinking, for example, that a 90% forecast means that the event is certain to occur).

Understanding that forecasts, at their core, produce an expected value and an estimate of uncertainty (variance) around that expected value can help with understanding the more complex forms of probabilistic forecast scoring we discuss below. For example, if our forecast scores badly, was it because our estimate of the e.g. Labor vote was off, or did we get the Labor vote right but underestimate how much each district would deviate from the statewide swing? These are important questions in modelling – incorrect expected values are usually harder to reliably correct in election forecasting (especially if they’re mostly based off a polling average – pollsters usually try to correct for their historical errors) whereas over/under-confidence is easier to correct (but harder to detect, excepting massive over/under-confidence).

Breaking forecasts into their expected value and uncertainty also helps you to think about whether the underlying issue is expected value or uncertainty (especially when you hear about some of the model feuds going on in political forecasting); while for both events the FiveThirtyEight forecast estimated a higher probability of a Trump victory than its competitors, this discussion between Ryan Grim and Nate Silver is pretty clearly a disagreement over expected value whereas this discussion between Elliot Morris and Nate Silver is more of a disagreement about estimated variance.

(There’s also discussions around between-election correlations, but I don’t intend to get into those as they have very little impact on our forecast right now).

With that out of the way, let’s get into how we intend to score our forecasts!

## How many diamonds can the devil pull out of the volcano? Absolute performance of our forecast

A brief introduction to forecast scoring rules is in order. Firstly, the rules we use here are what is known as **proper**, meaning that the best score one can get is from reporting the true probability for an event (i.e. you will lose points if you report 90% or 70% for an event which occurs 80% of the time). Secondly, we’ve normalised (scaled and shifted) the rules we use so that the highest score the forecast can achieve is the same on both scores as well as ensuring that the rules give the forecasts the same score for a forecast of 50%.

**Log scoring rule**: This score basically takes all the events that happened (e.g. Labor winning in Rockingham), and scores them by the probability the forecast gave them ahead of time, using a logarithm (more specifically, the natural logarithm of the probability). This produces a curve which gives much more negative scores the lower the probability you estimated for that event:

Log-scoring has several advantages; it only scores based on events which have happened (something known as locality) and, more colloquially, it corresponds better to how evidence and odds work. Broadly speaking, it should take more evidence to move from e.g. 4% to 1% than it should take to move from 50% to 47% even though the change in probability is 3% in both cases; in the first case you’re going from saying something is a 1-in-25 event to a 1-in-100 event whereas in the second you’re going from saying something is a 1-in-2 to a 1-in-2.13 event.

I find it can be helpful to talk about this in terms of coin tosses. Let’s say you want to prove that a supposedly fair coin is actually biased to land heads 6 in 10 times, instead of 1 in 2. How many coin tosses do you need? There’s actual statistics to calculate this, but I think most people would be fairly happy to say that you probably need a hundred coin-tosses to be sure that it’s not 1 in 2, and maybe a couple of thousand to be sure that the coin is biased to give 6 heads for every 10 tosses instead of say 5.8 heads.

What about if you wanted to show that the coin is actually biased to*always*give heads? In that case, you have to keep flipping the coin forever as there’s no way to conclusively prove that you will never get a single tails; even with trillions of heads all you can say is that the probability of a tails is less than one in x trillion. Conversely, a single tails will break your hypothesis; not only do you need more evidence to demonstrate your hypothesis, less evidence is also required to disprove it (contrast with the 6-in-10 bias hypothesis – for a sample of 1000 flips, anywhere between about 570 to 630 flips would be roughly consistent with a “true” 6-in-10 bias). Log-scoring captures this well by noting that an event happening which was estimated to be a 1-in-100 event is a lot stronger evidence against the model than one which was estimated to be a 1-in-50 event.

(of course, 1-in-100 events do happen, but with a sufficiently large sample size of events, the log-score will be able to penalise models which estimate a 1-in-50 probability for 1-in-100 events)

Hence, log-scoring basically measures how unexpected, or surprising, a given set of events is to a forecast, and penalising forecasts which find a lot of events surprising (a forecast which fails to anticipate a lot of events is not a very good forecast).

However, log-scoring has one issue when it comes to being implemented on our forecast: the nature of logarithms mean that they output negative infinity when used on 0 (or in other words: the log-score basically notes that your model found this so surprising that the score freezes and explodes, like the worst form of Medusa known to humanity). Our model doesn’t technically assign any outcome with a probability of 0%, but since it only runs 50 000 simulations, very-low-probability outcomes will be unlikely to show up in the 50 000 simulations (and the probability of those that do won’t be estimated very well). This shouldn’t be an issue (by definition, very low probability outcomes don’t happen very often), but it is something to note in the event that one does occur.**Brier score:**This score basically measures the distance between the forecast probability and the “ideal” probability we would want from a perfect forecast. In an ideal world where we had perfect information on how everyone was going to vote, a perfect forecast would simply issue probabilities of 100% for every candidate which was going to win, and probabilities of 0% for every candidate which was going to lose. Hence, the Brier score takes all the probabilities we assess for candidates, regardless of whether they win or lose, and compares them to whether the candidate won (in which case the perfect forecast is 100%) or whether the candidate lost (in which case the perfect forecast is 0%).

For example, let’s say that we estimate Labor has a 6 in 10 chance of winning Kalgoorlie, while we estimate the Liberals have a 3 in 10 chance and the Nationals 1 in 10. If Labor does win Kalgoorlie, the Brier score for that forecast is:

Brier = (1 – 0.6)^{2}+ (0 – 0.3)^{2}+ (0 – 0.1)^{2}

= 0.4^{2}+ (-0.3)^{2}+ (-0.1)^{2}

= 0.16 + 0.09 + 0.01

= 0.26

Similar to the log-score, the Brier score penalises highly confident but incorrect forecasts harshly. Unlike the log-score, the Brier score violates locality – in other words, it scores forecasts based on events which did not happen. Additionally, because it is bounded (it can only go from -3 to +1, whereas the log-score goes from -infinity to +1), it can sometimes give odd results.

For example, let’s say we had two events which did happen, and two forecasts. Forecast A says that the probability of Event 1 happening is 0% (not “we don’t have the sample size to estimate this”, an actual 0% forecast) Although most models use continuous distributions which in theory produce a probability above 0% for any outcome, the assumptions used in modelling can change that.

For example – the beta distribution (commonly used in modelling proportions, making it theoretically appropriate for use in modelling vote shares) can only go from 0 to 100% (or 0.0 to 1.0). A modeller might (reasonably) assume that the Labor and Liberal primary vote will not go below 10% (in seats without a National running) or above 90%, and scale their distribution appropriately.

Such a model would assign no probability at all to, for example, the Labor candidate in Warringah only winning 6.6% of the first-preference vote in 2019. and the probability of Event 2 happening is 70%, while Forecast B says that the probability of each happening is 25%. Even though Forecast A is pretty clearly “wrong” – something which it said should*never happen*did happen! – the normalised Brier score gives Forecast A a score of -1.18 and Forecast B a score of -1.25.

In contrast, log-scoring would note that Forecast A has been MEDUSA-ed (**M**assive**E**rror**D**etected, resulted in**U**nbounded**S**coring**A**nomaly), and (correctly, in my opinion) assign Forecast A an negative-infinity score which it cannot recover from by being a little better on its other forecasts.

At the same time, the Brier score doesn’t explode when attempting to score events which didn’t occur in our simulations (i.e. something which has a probability > 0 but which is basically impossible for us to estimate due to sample size). Additionally, for multi-category forecasts (for example, instead of “Will Labor or the Liberals/Nationals win Kalgoorlie”, a multi-category forecast would ask “Will Labor, or the Liberals, or the Nationals, or the Greens, or One Nation, or another party/independent win Kalgoorlie?”) the Brier score produces different results which captures more information. A forecast of Labor 60%, Liberal 30% and National 10% is different from a forecast of Labor 60%, Liberal 40%, and the Brier score does take that into account.

For both scoring rules, we’ll be releasing a total of four scores (you can calculate them yourself from our final simulations package) calculated from the results. The first set will be calculated from **which party wins the 2pp** in each seat (Labor, or the Liberals/Nationals), while the second will be calculated from **which party wins the seat**. The first score is a better measure of how well our vote modelling does (since 2pp is always calculated between Labor and the Liberals/Nationals), whereas the second is a better measure of our seat modelling (basically how well our vote predictions translate into probabilities for each party in each seat) and our model for handling minor parties.

### Other factors we intend to judge our forecast on

In addition to scoring our probabilistic forecast, we also intend to take a multi-faceted look at individual parts of our model. These parts aren’t easily captured in the forecast scoring, and include but are not limited to:

**Our fundamentals model**: We’ve brought this up in the methodology piece, but broadly speaking, there’s two ways of weighting the fundamentals we use in our model. One is much more theoretically appropriate, but the second performs a little better in backtesting. We picked the first, for a variety of reasons including the small sample size we tested on (n = 9) and the fact that the second performed worse in recent elections; but I will be testing both to see how the model performs.**Our electorate lean model**: There might be a case to be made that the 2017 WA state election should be weighted more heavily in our lean calculation, partly because of the reduced accuracy of attempting to redistribute votes from two elections back, and also partly because of the fact that 2017 is much more similar to how the current election is shaping up (a landslide Labor win) than 2013 (which was a landslide Liberal win). There could be seats or areas which don’t usually swing much, but which shift massively when a large swing is on.**Our electorate elasticity model**: In particular, I do think that a case can be made that the elasticity model isn’t appropriate in this election, due to how massive Labor’s leads are (see above about the possibility of different voter behaviour in a landslide environment). I’ll discuss this a little more in the next section, but I intend to compare the elastic swing model to a uniform swing to see which model better predicted the outcomes in each seat (on a 2pp basis).**Our candidate effects model**: This is fairly simple – we just compare how candidates with a certain effect (e.g. “incumbent”) compare to candidates without said effect, and compare that to how our model estimated them to perform. The most interesting one to look at will probably be Geraldton; Liberal/National switchers are fairly rare so this will provide valuable information for future forecasts.**Our final-two candidates model**: Since we don’t have the computing resources to redistribute votes from every candidate in every district in every electorate, our model takes a shortcut by modelling the probability of each candidate getting into the final-two calculation, and simulates final-two pairings partly based on that. This is a very important thing to be able to model; getting the final-two pairing wrong can completely stuff up a model’s predictions, and thus we’ll be comparing our model’s predictions for final-two candidates to the actual final two candidates for each electorate.

If we get seat polling, we’ll also have a closer look at how our district correlation model performs.

## Clash of the Titans: Relative performance

If you’ve been reading along, you might notice a snag – all of the above metrics only spit out a number without explaining what that means in terms of forecast performance. While we can probably guess that, say, a positive Brier score is good, a score of e.g. 0.76 doesn’t really tell us whether our model needs improvement or not. Hence, to truly understand how well our model is performing, we need a baseline to compare it against. Usually, this baseline is what is known as an **unskilled forecast**, or broadly speaking a forecast which is a simple guess based on historical data (Fivethirtyeight). In elections, this usually takes the form of assuming every candidate has an equal shot at winning (for example assuming that the Democrat and the Republican each have a 50% chance of winning).

Personally, I don’t think the unskilled forecast is a very useful baseline to compare against in electoral politics, not to mention the fact that it doesn’t work well in a system with two major parties but also some minor parties which do win seats. For example, what is a true unskilled forecast for a seat where Labor, the Liberals, the Nationals, the Greens, and One Nation all field a candidate?

Giving all five candidates 20% is very clearly not in line with historical results, but assuming that the probability is split 50-50 between Labor and the Liberals/Nationals is also not in line with historical data – the Greens and One Nation have won seats at both the state and federal level before.

The unskilled forecast may be a little more appropriate for comparisons to scores calculated from which party wins the two-party-preferred (2pp) vote, however. For the average seat, assuming that Labor and the Liberals/Nationals have a 50-50 chance each of winning the 2pp there (even if they don’t win the seat due to minor parties/independents) is a reasonable assumption. Furthermore, in most elections, there’s usually a large number of seats which are very safe for some party, making it very easy to outperform a 50-50 across the board prediction. It might be useful on an overall electoral outcome basis (usually, Labor and the Coalition will have a roughly even chance at forming government over a lot of elections), but we don’t have many of those and thus it would be hard to judge a forecast on that basis (not to mention the possibility that a forecast might correctly predict the overall winner, but completely bungle the seat-by-seat forecast).

So, what we’ve done instead is build a much simpler forecast to compare Meridiem to. (You can download and run it yourself here) Unlike Meridiem, this forecast (which we’re terming Basic) takes the polling average at face value, and simply assumes that historical patterns in polling (i.e. polling skews, polling accuracy) will remain constant. It does adjust the polling accuracy for changes in the vote share of parties in this election versus previous elections.

The margin of error on a party which is polling at 15%, for example, is nearly three times that of a party which is polling at 5%. To model seat results, it runs lots of simulations of the statewide vote, then applies a uniform swing in each party’s primary vote with some random deviation in each seat. It then assumes that the top-two candidates in each electorate will end up in the final two-candidate-preferred, and uses the 2pp swing for classic (Labor vs Liberal/National) matchups while using some simple methods of determining a “winner” in non-classic matchups (detailed in the comments in our code). For the purposes of comparison with our Meridiem forecast, it also calculates the 2pp estimate for each district using a uniform swing, with the standard deviation (or the variance around the expected value) of ± 4% (which is the average difference in each district’s swing from statewide uniform swing in historical WA state elections). In other words, Basic is pretty much as close as you can get to a pure polls-only, what-does-historical-precedent-say model of the WA state election as possible in modelling.

When the final results are in, we’ll be comparing the performance of the Basic forecast on our two scoring rules (log-scoring and Brier) to Meridiem, as well as analysing how certain components of Meridiem perform. For the latter, most of the components we’re interested in are things which aren’t implemented in Basic (district elasticity, fundamentals, candidate effects) and which mostly affect the expected vote share. Hence, for those we’ll simply be comparing the expected value of both models (e.g. comparing the actual swing to the swing predicted by the elasticity model and that of a uniform swing), or, in the case of the candidate effects model, comparing how candidates with certain attributes fare compared to candidates without said attributes (e.g. comparing the swing in Labor’s vote in electorates with a Labor incumbent to electorates with no incumbent).