Friday, 31 May 2019

Preview - World Cup group match 2 - West Indies vs Pakistan

Today's match is at Trent Bridge, Nottingham.

If any ground in the world has taken over the mantle of "most batting friendly ground in the world" from the Antigua Recreational Ground in St Johns, it's Trent Bridge. The groundsman seems to have taken WG Grace's famous statement "they came to see me bat, not you bowl" to the next level. The pitch seems to have been designed to make batting as easy as possible.

As a result the par score here is quite high.

Score 290 here, and you're on the wrong side of recent history. In order to have a 75% chance of winning after batting first, your team needs to score 349.

If anywhere is going to see 500 achieved, it is likely to be either Nottingham or Southampton (which also has a bowler-hating groundsman).

The regression model that I used in the previous article gives Pakistan a 73% chance of coming out on top in this match. However, the West Indies have been looking better recently than they were a couple of years ago, and Pakistan (conversely) have been looking like they're at a low ebb. As a result, this match feels more like it could go either way.

Pakistan have a habit of lifting their game significantly when they get momentum, and, as a result, have had a very good record recently against all the teams who are not currently ranked in the top five. The West Indies will need to start well to avoid Pakistan getting on a roll. 

This match is an important fixture for both sides, as a loss here will mean that the losing team will need to beat at least two of the top five ranked teams if they are to progress to the semi-finals.

Thursday, 30 May 2019

A simulation to see who will win the World Cup


One of the main purposes of statistics is to help inform decisions. Cricket statistics are often used when deciding on selection of players, or (more often) arguments about who is the best at a particular aspect. They can help decide which strategies are best, what an equivalent score is in a reduced match (with a particular case of Duckworth Lewis Stern) or which teams should automatically qualify for the World Cup (David Kendix). They are often also used by bookmakers (both the reputable, legal variety and the more dubious underworld version) to set odds about who is going to win.

I decided to attempt to build a model to calculate the probability of each team winning, based on their previous form. This was going to allow me (hopefully) to predict the probabilities of each outcome of the world cup, by using a simulation. It didn’t prove to be as easy as I had hoped.

My first thought was to look at each team’s net run rate in each match, adjust for home advantage, and then average it out. That seemed sensible, and the first attempt at doing that looked like it would be perfect. Most teams (all except Zimbabwe) had roughly symmetrical net run rates, and they fitted a normal curve really well. The only problem was that Afghanistan was miles ahead of everyone else. The fact that they had mostly played lower quality opponents in the past 4 years meant that they had recorded a lot more convincing wins than anyone else.

This was clearly a problem. India and England both had negative net run rates, while Afghanistan, Bangladesh and West Indies were all expected to win most of their matches.

I then tried a different approach, based off David Kendix’s approach of using each result to adjust a ranking. But rather than having a ranking that was based off wins, I based it off net run rate. So if a team had an expected net run rate of 0.5, and another had an expected net run rate of 0.6, the first team would have an expected net run rate of -0.1 for their match. If they did better than that, they went up, and if they did worse than that, they went down.

However, I found that some results ended up having too much bearing. If I made it sensitive to a change in the results, it ended up changing way too much based off one big loss/win. England dropped almost a whole net run per over based on the series in the West Indies. So this was clearly not a good option.

Next, I decided to try using logistic regression, and seeing how that turned out. Logistic regression is a way of determining probabilities of events happening if there are only two outcomes. To do that, I removed every tie or match with no result, and set to work building the models.

My initial results were exciting. By just using the team, opposition and home/away status, I was able to predict the results of the previous three world cups quite accurately using the data from the preceding 4 years. (I could not go back further than that, as they included teams making their ODI debut, and there was accordingly no data to use to build the model.

The results were really pleasing. I graphed them here, grouped to the nearest 0.2 (ie the point at 0.6 represents all matches that the model gave between 0.5 and 0.7 as the chance for a team to win), compared to the actual result for that match. It seems that they slightly overstate the chance of an upset (possibly due to upsets being more common outside world cups, where players tend to be rested against smaller nations), but overall they were fairly reliable, and (most importantly) the team that the model predicted would win, generally won.

I could then use this to give a ranking of each team that directly related to their likelihood of winning against each other. The model gave everything in relation to Afghanistan, with the being 0, and any number higher than 0 being how much more likely a team was to win against the same opponent as Afghanistan. (Afghanistan was the reference simply because they were first in the alphabet).







This turns out to be fairly close to the ICC rankings. So that was encouraging.

I tried adding a number of things to the model (ground types, continents, interactions, weighting the more recent matches more highly) but the added complexity did not result in better predictions when I tested them, so I stuck to a fairly simple model, only really controlling for home advantage.
Next I applied the probabilities to every match and found the probabilities of each team making the semi-finals.


The next step was to then extend the simulation past the group stage, and find the winner.

After running through the simulation a few more times, I came out with this:


A couple of points to remember here: every simulation is an estimate. The model is almost certainly going to estimate the probabilities incorrectly, but it will get them close, and they will be close enough to give a good estimate of the actual final probabilities. It is also likely to overstate Bangladesh’s ability due to their incredible home record; overstate Pakistan’s ability as a lot of neutral matches for them they have had a degree of home advantage in UAE; and understate West Indies, due to them having not played their best players in a lot of matches in the past 4 years. But these are not likely to make a massive difference to the semi-finalist predictions.



Given this, I’d suggest that if you are wanting to bet on the winner of the world cup, these are the odds that I would consider fair for each team:


I will try to update these probabilities periodically throughout the world cup, and report on their accuracy.

Saturday, 23 March 2019

A new way to look at bowling economy rates for the IPL

Sunrisers Hyderabad had made a great start, but their innings had started to plateau. At 161/7 off 18 overs they had the opportunity to get a score of 190+, or, if things went really poorly 175. Andre Russell was running into bowl...

He bowled a very good over, removing Braithwaite before only conceding 7 in his final 5 balls. All thoughts of a big finish were gone.

A week later, the Sunrisers were in the qualification final, and things were not going well. After 8 overs they were on 54/4, going at less than 7 an over, and at serious risk of scoring less than 100.

Dwayne Bravo was the bowler this time. He bowled a wide, then a couple of deliveries that Yusuf Pathan managed to hit for 2 each, and ended with a couple of easy singles. It was an over where almost no pressure was put onto the batsmen. And yet, it only went for 7 runs, the same as Andre Russell's excellent over a week earlier.

There's something wrong with any statistic that rates those overs as being of the same value to the team, and yet that's exactly what the traditional Economy Rate does. 7 runs is 7 runs.

Wednesday, 2 January 2019

Paine vs Pant

The Instagram photo. 
I wanted to quickly share my thoughts about the Paine - Pant sledge.

I've been an outspoken critic of "mental disintegration" -- the tactic of using personal abuse and insults to get under a player's skin and put them off their game, but I really liked what I heard from Paine, and think it's the sort of sledging that is totally appropriate.

Saturday, 24 November 2018

Historical statistical preview of the Second Test, Pakistan vs NZ

I've decided to put together a short summary of some of the historical trends at Dubai, before this match.

First, the probability of different results based on first innings scores. This suggests that a score of 300 is roughly the point where a team is more likely to win than lose, while the 50% winning score is roughly 370.


Thursday, 22 November 2018

Can we determine a batsman's ability based on how he gets out?

I saw an interesting discussion online recently, suggesting that we could tell that some players had a better technique than others, based on how often they got out to different types of dismissals.

The theory was that players who get bowled or lbw have technical issues, while players who get out caught more often don't have those same issues.

This immediately stuck me as a multivariate statistics problem. Can we tell how good a batsman is based off the proportions of dismissals?

So I gathered together a sample of 160 players, all of whom had been dismissed in the 5 most common ways at least once each, and looked at what we could tell based off those players.

I grouped them based on batting average into 5 roughly equally sized groups. The groups were: Under 27.5, 27.5 to 37.5,37.5 to 43.5, 43.5 to 48.5 and over 48.5.

Once I filtered out players who hadn't played enough innings, I ended up with 29 from group 1, 30 from group 2, 28 from group 3, 36 from group 4 and  37 from group 5.

Their distributions were as follows:

There are some differences between the groups, but there seems to be more variation within the groups than between them.

I also looked at the raw numbers, without grouping, and adding in trend lines.

A pattern emerges - batsmen who have a very low average, and very high average tend to get bowled and run out more often than players who have an average between 20 and 50. Instead of getting out bowled, the players who have their average between 20 and 50 tend to get out caught more often.

This made me wonder if I could find some technique to group them effectively. I wasn't hopeful, because again the variation within the different groups seemed to be greater than the difference, other than from the very edge to the middle.

The methods that I chose to try were Linear Discriminant Analysis (LDA), Quadratic Discriminant Analysis (QDA), Random Forest (categorical) and Random Forest (regression then rounding). I'm aware that most readers won't have studied multivariate statistics, so I'll briefly explain how these methods work. If you don't care how they work, click here to skip.

Linear and Quadratic Discrimination (LDA and QDA) can be imagined like plotting all the information on a giant multi-dimensional graph, then rotating the axes until the different groups are as separated as much as possible. The Linear version assumes that there are straight lines that separate the different groups, while the Quadratic version allows the groups to be separated by a curve. This is a more powerful technique, but it needs more data to be able to get an answer.

An example of how LDA works is in the two graphs below. There's a set of data that has two variables and is in two groups, and can't be easily distinguished by splitting any variable alone (displayed on the graph on the left). By making a decision based just on Variable 1, the best split gets 20 out of the 30 points classified correctly, and the best split on Variable 2 gets 22 out of the 30 points classified correctly. But if we put in a set of axes that are rotated, (the green lines) and redraw the graph (on the right) the two groups are able to be split quite well by being greater than or less than -0.45 on the new rotated x axis. By splitting on the rotated axis, 27 out of the 30 points are now classified correctly.


Random Forest is a technique that creates lots of decision trees, each based on small samples of the data and only some variables. Then it averages them out, giving the ones that performed best a higher weighting. This is a harder method to explain easily, but the process itself is actually reasonably simple. It's often remarkably effective for making predictions, but can be hard to explain how the final model actually works.

To test the four methods, I first of all tried doing leave-one-out cross validation to see how they performed. This is where I use all but one data point to build the model, then test that model on the remaining data point and see how well it was allocated - I do this for all 160 batsmen.

The groups are roughly evenly split, so randomly allocating the different batsmen to different groups saw about 20% get put in the right place. The different methods saw these results:

MethodSuccess
LDA25.00%
QDA28.13%
Random Forest (groups)18.75%
Random Forest (average)23.75%

I wondered if these were within the range of what I would expect from just randomly allocating batsmen to groups, so I randomly allocated the batsmen to groups 10000 times, and saw what the distribution of the number correct looked like. Below is a graph of that, with the 4 methods on it.


The method that worked the best here (Random Forest to predict averages) was still beaten by about 1% of the randomly allocated trials.

One of the issues with just looking at the proportion correct is that it doesn't tell us about how many are close to being right. I wondered how they would go if I plotted the actual group against the expected group, and found a measure of goodness of fit for it. The values for the goodness of fit here go from 1 down, where 1 is a perfect fit, 0 is the result of just giving every point the average, and negative values are even worse than that.

Here's how the 4 methods stacked up:

MethodGoodness-of-fit
LDA-0.710
QDA-0.505
Random Forest (groups)-0.624
Random Forest (average)-0.064

The Random Forest (average) method was the best, but was still not as good a fit as just allocating every batsman to group 3. (That had a goodness of fit value of -0.009). 

I put these methods against a random allocation, to see how well they actually did:

Even though the Random Forest average method was better than random arrangement, it was not as good as just classifying everybody as group 3. 

What this tells us is that this data really is not very useful at all for making a classification. We know that the methods would all have done a reasonable job of distinguishing between the different groups if there was actually a difference between them that could be found. However, there is not any real way to distinguish how good a batsman is by the way that they got out.

I also tried some other methods, that theoretically shouldn't have been as good, just to see how well they went. I managed to get a couple that had a goodness-of-fit as high as -0.235 with 26% correct classification. These seem to be reasonable, but they were still not as good as just classifying every batsman as group 3. Any method that isn't as good as that, is really worthless in making a classification.

In conclusion, looking at the proportion of ways that a batsman has been dismissed is not particularly helpful in deciding how good they are. The differences within groups are much larger than the differences between groups. What that means is that if you're having an argument with someone on the internet, and they say that a player should be dropped because they get out LBW so often that it shows that they have a bad technique, you can smile smugly to yourself, knowing that they are speaking nonsense. You could even send them a link to this article if you want.

Thursday, 1 November 2018

Comedy Run Out Set to Music 5

In a field of strong contenders, this might be the best yet.

Will Somerville got hit by Ben Horne, dived, probably made his ground (based on fairly low quality video quality), then was given out.

It was clearly not his day.