Wednesday, 2 January 2019

Paine vs Pant

The Instagram photo. 
I wanted to quickly share my thoughts about the Paine - Pant sledge.

I've been an outspoken critic of "mental disintegration" -- the tactic of using personal abuse and insults to get under a player's skin and put them off their game, but I really liked what I heard from Paine, and think it's the sort of sledging that is totally appropriate.

Saturday, 24 November 2018

Historical statistical preview of the Second Test, Pakistan vs NZ

I've decided to put together a short summary of some of the historical trends at Dubai, before this match.

First, the probability of different results based on first innings scores. This suggests that a score of 300 is roughly the point where a team is more likely to win than lose, while the 50% winning score is roughly 370.

Thursday, 22 November 2018

Can we determine a batsman's ability based on how he gets out?

I saw an interesting discussion online recently, suggesting that we could tell that some players had a better technique than others, based on how often they got out to different types of dismissals.

The theory was that players who get bowled or lbw have technical issues, while players who get out caught more often don't have those same issues.

This immediately stuck me as a multivariate statistics problem. Can we tell how good a batsman is based off the proportions of dismissals?

So I gathered together a sample of 160 players, all of whom had been dismissed in the 5 most common ways at least once each, and looked at what we could tell based off those players.

I grouped them based on batting average into 5 roughly equally sized groups. The groups were: Under 27.5, 27.5 to 37.5,37.5 to 43.5, 43.5 to 48.5 and over 48.5.

Once I filtered out players who hadn't played enough innings, I ended up with 29 from group 1, 30 from group 2, 28 from group 3, 36 from group 4 and  37 from group 5.

Their distributions were as follows:

There are some differences between the groups, but there seems to be more variation within the groups than between them.

I also looked at the raw numbers, without grouping, and adding in trend lines.

A pattern emerges - batsmen who have a very low average, and very high average tend to get bowled and run out more often than players who have an average between 20 and 50. Instead of getting out bowled, the players who have their average between 20 and 50 tend to get out caught more often.

This made me wonder if I could find some technique to group them effectively. I wasn't hopeful, because again the variation within the different groups seemed to be greater than the difference, other than from the very edge to the middle.

The methods that I chose to try were Linear Discriminant Analysis (LDA), Quadratic Discriminant Analysis (QDA), Random Forest (categorical) and Random Forest (regression then rounding). I'm aware that most readers won't have studied multivariate statistics, so I'll briefly explain how these methods work. If you don't care how they work, click here to skip.

Linear and Quadratic Discrimination (LDA and QDA) can be imagined like plotting all the information on a giant multi-dimensional graph, then rotating the axes until the different groups are as separated as much as possible. The Linear version assumes that there are straight lines that separate the different groups, while the Quadratic version allows the groups to be separated by a curve. This is a more powerful technique, but it needs more data to be able to get an answer.

An example of how LDA works is in the two graphs below. There's a set of data that has two variables and is in two groups, and can't be easily distinguished by splitting any variable alone (displayed on the graph on the left). By making a decision based just on Variable 1, the best split gets 20 out of the 30 points classified correctly, and the best split on Variable 2 gets 22 out of the 30 points classified correctly. But if we put in a set of axes that are rotated, (the green lines) and redraw the graph (on the right) the two groups are able to be split quite well by being greater than or less than -0.45 on the new rotated x axis. By splitting on the rotated axis, 27 out of the 30 points are now classified correctly.

Random Forest is a technique that creates lots of decision trees, each based on small samples of the data and only some variables. Then it averages them out, giving the ones that performed best a higher weighting. This is a harder method to explain easily, but the process itself is actually reasonably simple. It's often remarkably effective for making predictions, but can be hard to explain how the final model actually works.

To test the four methods, I first of all tried doing leave-one-out cross validation to see how they performed. This is where I use all but one data point to build the model, then test that model on the remaining data point and see how well it was allocated - I do this for all 160 batsmen.

The groups are roughly evenly split, so randomly allocating the different batsmen to different groups saw about 20% get put in the right place. The different methods saw these results:

Random Forest (groups)18.75%
Random Forest (average)23.75%

I wondered if these were within the range of what I would expect from just randomly allocating batsmen to groups, so I randomly allocated the batsmen to groups 10000 times, and saw what the distribution of the number correct looked like. Below is a graph of that, with the 4 methods on it.

The method that worked the best here (Random Forest to predict averages) was still beaten by about 1% of the randomly allocated trials.

One of the issues with just looking at the proportion correct is that it doesn't tell us about how many are close to being right. I wondered how they would go if I plotted the actual group against the expected group, and found a measure of goodness of fit for it. The values for the goodness of fit here go from 1 down, where 1 is a perfect fit, 0 is the result of just giving every point the average, and negative values are even worse than that.

Here's how the 4 methods stacked up:

Random Forest (groups)-0.624
Random Forest (average)-0.064

The Random Forest (average) method was the best, but was still not as good a fit as just allocating every batsman to group 3. (That had a goodness of fit value of -0.009). 

I put these methods against a random allocation, to see how well they actually did:

Even though the Random Forest average method was better than random arrangement, it was not as good as just classifying everybody as group 3. 

What this tells us is that this data really is not very useful at all for making a classification. We know that the methods would all have done a reasonable job of distinguishing between the different groups if there was actually a difference between them that could be found. However, there is not any real way to distinguish how good a batsman is by the way that they got out.

I also tried some other methods, that theoretically shouldn't have been as good, just to see how well they went. I managed to get a couple that had a goodness-of-fit as high as -0.235 with 26% correct classification. These seem to be reasonable, but they were still not as good as just classifying every batsman as group 3. Any method that isn't as good as that, is really worthless in making a classification.

In conclusion, looking at the proportion of ways that a batsman has been dismissed is not particularly helpful in deciding how good they are. The differences within groups are much larger than the differences between groups. What that means is that if you're having an argument with someone on the internet, and they say that a player should be dropped because they get out LBW so often that it shows that they have a bad technique, you can smile smugly to yourself, knowing that they are speaking nonsense. You could even send them a link to this article if you want.

Thursday, 1 November 2018

Comedy Run Out Set to Music 5

In a field of strong contenders, this might be the best yet.

Will Somerville got hit by Ben Horne, dived, probably made his ground (based on fairly low quality video quality), then was given out.

It was clearly not his day.

Wednesday, 31 October 2018

Comedy run out set to music 4

Sean Solia has had a bit of a golden run recently. His List A averages are scarcely believable, averaging 56 with the bat and 18 with the ball. But he learned today that it's not a good idea to go out walking when the ball isn't dead.

Also watch for Henry Nicholls getting high fived in the face by Tom Latham.

Monday, 29 October 2018

Comedy run out 3

Time for the 3rd instalment of comedy run outs set to music.

The context here was that it was the start of the 46th over, Auckland vs Wellington in a 50 over match. The pair are the wicket had added 66 runs in just over 8 overs, but this started a collapse where Auckland lost 4 wickets for 21 over 3.4 overs.

I hope you enjoy.

Sunday, 21 October 2018

Plunket Shield update - Round 2

At the end of round 2, I thought it would be good to do an update on the progress of the tournament, and look at some trends that have emerged.

One thing that I thought I would focus on is how the runs have been scored, rather than just how many.

I looked at each innings and looked at the total runs from boundaries, and the total other runs (I called them run runs, but they include no balls and wides, as they were too hard to separate).

I plotted them on a graph, to see if there were any interesting patterns emerge.

There were a couple of things that I noticed. Auckland, Otago, Wellington and Canterbury have all had similar rates across the different innings that they've batted, while Northern Districts and Central Districts have had more variety in how they've accumulated their runs.

The triangles seemed to be higher up the chart on average, with all of them being above the median boundary rate, so I thought that I'd see if there was a correlation between the rates and the total competition points gathered in a match.

There is a reasonably strong relationship between the boundary rate and the points earned in a match, however, there's almost no relationship at all between the speed of accumulation of non-boundary runs and the points earned.

There is a theory that regularly rotating the strike makes it easier to survive a match, as it doesn't allow the bowlers to settle. I certainly know that I hated batsmen hitting singles off my bowling, and I remember Dale Steyn saying in a press conference something to the effect of "I don't mind dropped catches that much. Dropped catches happen. But I get really upset when a fielder lets a batsman get off strike when I had him under pressure."

Of the 7 innings played by a losing side, 5 of them had a run-runs rate below the median. That made me wonder if there was a pattern there. I looked at the final innings by teams that batted out a draw or lost, and looked to see if there was a difference in the rates for the teams that lost vs the teams that drew.

This graph isn't particularly meaningful at the moment, with only 5 innings to look at, but I intend on building this up as the season goes on.

Looking at it as individual points makes it more clear:

I've circled the point at the bottom, because that was an innings where Canterbury lost their last wicket with only 6 balls remaining, and so it was very close to being a saved match. Interestingly the teams that have scored a lot of boundaries have lost, but it is a very small sample to be drawing too many conclusions from.

The final table, with other information, looks like this:

This made me wonder which correlation was stronger, scoring rate with total points, or the traditional value of Net Average Runs Per Wicket (batting average minus bowling average).

The Net Average Runs Per Wicket seems to be a better predictor of success, but there is a clear relationship with the scoring rate also.

I'll be interested to see how these develop as the season progresses, but for now we seem to have a separation between the sides, with Auckland, Canterbury and Otago all needing to find another gear for the next round.