One of the main purposes of statistics is to help inform
decisions. Cricket statistics are often used when deciding on selection of
players, or (more often) arguments about who is the best at a particular
aspect. They can help decide which strategies are best, what an equivalent
score is in a reduced match (with a particular case of Duckworth Lewis Stern)
or which teams should automatically qualify for the World Cup (David Kendix).
They are often also used by bookmakers (both the reputable, legal variety and
the more dubious underworld version) to set odds about who is going to win.

I decided to attempt to build a model to calculate the probability
of each team winning, based on their previous form. This was going to allow me
(hopefully) to predict the probabilities of each outcome of the world cup, by
using a simulation. It didn’t prove to be as easy as I had hoped.

My first thought was to look at each team’s net run rate in
each match, adjust for home advantage, and then average it out. That seemed
sensible, and the first attempt at doing that looked like it would be perfect.
Most teams (all except Zimbabwe) had roughly symmetrical net run rates, and
they fitted a normal curve really well. The only problem was that Afghanistan
was miles ahead of everyone else. The fact that they had mostly played lower
quality opponents in the past 4 years meant that they had recorded a lot more
convincing wins than anyone else.

This was clearly a problem. India and England both had
negative net run rates, while Afghanistan, Bangladesh and West Indies were all
expected to win most of their matches.

I then tried a different approach, based off David Kendix’s
approach of using each result to adjust a ranking. But rather than having a
ranking that was based off wins, I based it off net run rate. So if a team had
an expected net run rate of 0.5, and another had an expected net run rate of
0.6, the first team would have an expected net run rate of -0.1 for their
match. If they did better than that, they went up, and if they did worse than
that, they went down.

However, I found that some results ended up having too much
bearing. If I made it sensitive to a change in the results, it ended up
changing way too much based off one big loss/win. England dropped almost a
whole net run per over based on the series in the West Indies. So this was
clearly not a good option.

Next, I decided to try using logistic regression, and seeing
how that turned out. Logistic regression is a way of determining probabilities
of events happening if there are only two outcomes. To do that, I removed every
tie or match with no result, and set to work building the models.

My initial results were exciting. By just using the team,
opposition and home/away status, I was able to predict the results of the
previous three world cups quite accurately using the data from the preceding 4
years. (I could not go back further than that, as they included teams making
their ODI debut, and there was accordingly no data to use to build the model.

The results were really pleasing. I graphed them here, grouped
to the nearest 0.2 (ie the point at 0.6 represents all matches that the model
gave between 0.5 and 0.7 as the chance for a team to win), compared to the
actual result for that match. It seems that they slightly overstate the chance
of an upset (possibly due to upsets being more common outside world cups, where
players tend to be rested against smaller nations), but overall they were
fairly reliable, and (most importantly) the team that the model predicted would
win, generally won.

I could then use this to give a ranking of each team that
directly related to their likelihood of winning against each other. The model
gave everything in relation to Afghanistan, with the being 0, and any number
higher than 0 being how much more likely a team was to win against the same
opponent as Afghanistan. (Afghanistan was the reference simply because they
were first in the alphabet).

This turns out to be fairly close to the ICC rankings. So
that was encouraging.

I tried adding a number of things to the model (ground
types, continents, interactions, weighting the more recent matches more highly)
but the added complexity did not result in better predictions when I tested
them, so I stuck to a fairly simple model, only really controlling for home
advantage.

Next I applied the probabilities to every match and found
the probabilities of each team making the semi-finals.

The next step was to then extend the simulation past the
group stage, and find the winner.

After running through the simulation a few more times, I
came out with this:

A couple of points to remember here: every simulation is an
estimate. The model is almost certainly going to estimate the probabilities
incorrectly, but it will get them close, and they will be close enough to give
a good estimate of the actual final probabilities. It is also likely to
overstate Bangladesh’s ability due to their incredible home record; overstate
Pakistan’s ability as a lot of neutral matches for them they have had a degree
of home advantage in UAE; and understate West Indies, due to them having not
played their best players in a lot of matches in the past 4 years. But these
are not likely to make a massive difference to the semi-finalist predictions.

Given this, I’d suggest that if you are wanting to bet on
the winner of the world cup, these are the odds that I would consider fair for
each team:

I will try to update these probabilities periodically throughout the world cup, and report on their accuracy.