Saturday 24 November 2018

Historical statistical preview of the Second Test, Pakistan vs NZ

I've decided to put together a short summary of some of the historical trends at Dubai, before this match.

First, the probability of different results based on first innings scores. This suggests that a score of 300 is roughly the point where a team is more likely to win than lose, while the 50% winning score is roughly 370.


Thursday 22 November 2018

Can we determine a batsman's ability based on how he gets out?

I saw an interesting discussion online recently, suggesting that we could tell that some players had a better technique than others, based on how often they got out to different types of dismissals.

The theory was that players who get bowled or lbw have technical issues, while players who get out caught more often don't have those same issues.

This immediately stuck me as a multivariate statistics problem. Can we tell how good a batsman is based off the proportions of dismissals?

So I gathered together a sample of 160 players, all of whom had been dismissed in the 5 most common ways at least once each, and looked at what we could tell based off those players.

I grouped them based on batting average into 5 roughly equally sized groups. The groups were: Under 27.5, 27.5 to 37.5,37.5 to 43.5, 43.5 to 48.5 and over 48.5.

Once I filtered out players who hadn't played enough innings, I ended up with 29 from group 1, 30 from group 2, 28 from group 3, 36 from group 4 and  37 from group 5.

Their distributions were as follows:

There are some differences between the groups, but there seems to be more variation within the groups than between them.

I also looked at the raw numbers, without grouping, and adding in trend lines.

A pattern emerges - batsmen who have a very low average, and very high average tend to get bowled and run out more often than players who have an average between 20 and 50. Instead of getting out bowled, the players who have their average between 20 and 50 tend to get out caught more often.

This made me wonder if I could find some technique to group them effectively. I wasn't hopeful, because again the variation within the different groups seemed to be greater than the difference, other than from the very edge to the middle.

The methods that I chose to try were Linear Discriminant Analysis (LDA), Quadratic Discriminant Analysis (QDA), Random Forest (categorical) and Random Forest (regression then rounding). I'm aware that most readers won't have studied multivariate statistics, so I'll briefly explain how these methods work. If you don't care how they work, click here to skip.

Linear and Quadratic Discrimination (LDA and QDA) can be imagined like plotting all the information on a giant multi-dimensional graph, then rotating the axes until the different groups are as separated as much as possible. The Linear version assumes that there are straight lines that separate the different groups, while the Quadratic version allows the groups to be separated by a curve. This is a more powerful technique, but it needs more data to be able to get an answer.

An example of how LDA works is in the two graphs below. There's a set of data that has two variables and is in two groups, and can't be easily distinguished by splitting any variable alone (displayed on the graph on the left). By making a decision based just on Variable 1, the best split gets 20 out of the 30 points classified correctly, and the best split on Variable 2 gets 22 out of the 30 points classified correctly. But if we put in a set of axes that are rotated, (the green lines) and redraw the graph (on the right) the two groups are able to be split quite well by being greater than or less than -0.45 on the new rotated x axis. By splitting on the rotated axis, 27 out of the 30 points are now classified correctly.


Random Forest is a technique that creates lots of decision trees, each based on small samples of the data and only some variables. Then it averages them out, giving the ones that performed best a higher weighting. This is a harder method to explain easily, but the process itself is actually reasonably simple. It's often remarkably effective for making predictions, but can be hard to explain how the final model actually works.

To test the four methods, I first of all tried doing leave-one-out cross validation to see how they performed. This is where I use all but one data point to build the model, then test that model on the remaining data point and see how well it was allocated - I do this for all 160 batsmen.

The groups are roughly evenly split, so randomly allocating the different batsmen to different groups saw about 20% get put in the right place. The different methods saw these results:

MethodSuccess
LDA25.00%
QDA28.13%
Random Forest (groups)18.75%
Random Forest (average)23.75%

I wondered if these were within the range of what I would expect from just randomly allocating batsmen to groups, so I randomly allocated the batsmen to groups 10000 times, and saw what the distribution of the number correct looked like. Below is a graph of that, with the 4 methods on it.


The method that worked the best here (Random Forest to predict averages) was still beaten by about 1% of the randomly allocated trials.

One of the issues with just looking at the proportion correct is that it doesn't tell us about how many are close to being right. I wondered how they would go if I plotted the actual group against the expected group, and found a measure of goodness of fit for it. The values for the goodness of fit here go from 1 down, where 1 is a perfect fit, 0 is the result of just giving every point the average, and negative values are even worse than that.

Here's how the 4 methods stacked up:

MethodGoodness-of-fit
LDA-0.710
QDA-0.505
Random Forest (groups)-0.624
Random Forest (average)-0.064

The Random Forest (average) method was the best, but was still not as good a fit as just allocating every batsman to group 3. (That had a goodness of fit value of -0.009). 

I put these methods against a random allocation, to see how well they actually did:

Even though the Random Forest average method was better than random arrangement, it was not as good as just classifying everybody as group 3. 

What this tells us is that this data really is not very useful at all for making a classification. We know that the methods would all have done a reasonable job of distinguishing between the different groups if there was actually a difference between them that could be found. However, there is not any real way to distinguish how good a batsman is by the way that they got out.

I also tried some other methods, that theoretically shouldn't have been as good, just to see how well they went. I managed to get a couple that had a goodness-of-fit as high as -0.235 with 26% correct classification. These seem to be reasonable, but they were still not as good as just classifying every batsman as group 3. Any method that isn't as good as that, is really worthless in making a classification.

In conclusion, looking at the proportion of ways that a batsman has been dismissed is not particularly helpful in deciding how good they are. The differences within groups are much larger than the differences between groups. What that means is that if you're having an argument with someone on the internet, and they say that a player should be dropped because they get out LBW so often that it shows that they have a bad technique, you can smile smugly to yourself, knowing that they are speaking nonsense. You could even send them a link to this article if you want.

Thursday 1 November 2018

Comedy Run Out Set to Music 5

In a field of strong contenders, this might be the best yet.

Will Somerville got hit by Ben Horne, dived, probably made his ground (based on fairly low quality video quality), then was given out.

It was clearly not his day.