Tuesday 6 August 2019

Second only to Bradman?

Steven Smith has just celebrated his test come back by scoring a century in each innings at Edgbaston in Birmingham. Not content with just scoring a "come-from behind fighting century" when the bowlers were on top, he also added a "rub the salt in" century when the batsmen were on top.

It was such a match defining performance that the questions have been asked again, is he the best since Bradman?

I won't attempt to do a complete statistical breakdown right here now, but I will focus on a couple of statistics that suggest either "yes" or "not quite."

One thing that I've started to be more and more interested in is the performance of a batsman at their peak. It is hard to deny that a batsman's skill level changes throughout their careers. Some start off as amazing players, but then fade, others start slowly, then blossom into better players. Most start off slowly, have a strong middle period of their career then fade again at the end.

The graph below illustrates three players that had quite different career trajectories, but were all very good players.

Dennis Compton started off with an amazing run of scores, only Don Bradman averaged more in his first 30 test matches. His career never really reached those heights again, however, and he had a period where he really struggled, before modifying his game and ending his career on a (less dramatic) high.

Martin Crowe was picked as a teenager, and sent on a difficult tour, before he was really ready. He struggled and was in and out of the side at first. It took him a while to really own his position. After a while, he developed into one of the best batsmen in the world. Later on he struggled with injuries and his career petered out to a shadow of what he had previously been.

Marvin Atapattu scored only one run in his first 6 innings. That start was not an easy one to recover from. Throughout his career he tended to have a mixture of exceptionally large scores and regular ducks, which meant that it looked like he had patchy form. But for the majority of his career he tended to average above 40 in any given 30 match sequence after his horrific early period.

The story is clear, however, that an overall career average does not necessarily tell us about how good a player actually was. Looking at a player's peak is actually a better idea than looking at their overall career. That's especially true when comparing former players with current ones, or comparing players who retired at their peak with ones who continued on because even though they were no longer at their best, they were still better than the alternatives.

To compare players at their peak requires finding a way to define their peak. It's difficult to know how many matches to choose as a player's peak. It will certainly differ from player to player. Some will maintain their peak form for a number of years, while others may get injured, banned for ball tampering or retire just as they are starting to hit it. Added to that, the number of tests played has greatly increased for most nations, so while an old player like Jack Cowie never missed a test for 12 years and yet never made it to 30, someone playing for England now could potentially reach 30 tests after only playing test cricket for 20 months.

There's also the issue of sampling variability in small samples. If we look at 30 tests as defining a player's peak, that makes a maximum sample size of 60 innings (more likely to be closer to 55). 50 tests would make a maximum sample size of 100 innings (more likely to be close to 90).

If we simulate innings based on a player with a batting average of 45, we can find the range of likely 30 match and 50 match averages if the results are distributed randomly. For this, I've used geometric distribution to create random scores, and then found the average of them. This has been shown to be a reasonably useful way of simulating cricket scores, so it will give some indication of the expected variance in the averages.

The red and green lines here are the 95% bands for the simulated data. With the 30 match averages, the player who should have averaged 45 tended to average somewhere between 33 and 58. With 50 matches, the player tended to average between 36 and 54.

This needs to be remembered whenever comparing averages. A peak can be a player's skill improving, or it can be just random variation. Someone who averages 52 is not necessarily a better player than another one who averaged 49. It is just not possible to be confident statistically that there's a difference between these two player's ability. That's just based on sampling variability, and not accounting for non-sampling factors such as the opposition that they faced or the conditions that they played in.

Given that, is there any point in comparing at all? Well, it's not going to definitively say who was the best, but it can tell us who played the best.

For this analysis I am only including matches for players where they actually batted. As a result Don Bradman only has 50 tests, as there are two where he got injured fielding/bowling and did not end up batting. I am also not including the WSC Supertests or any matches played for the ICC World XI.

The top 21 instances of the best 30 matches by either average or total runs are the 21 combinations of 30 in a row out of Bradman's 50 matches.

He is so far ahead of the rest of the players in history that in his worst ever 30 matches he still scored 14% more runs than the best 30 matches by any other player.

Here are the tables of the top 10.


The top name is consistent, but the other names in table are much less consistent. 18 players appear at least once, with Bradman, Ponting, Sangakkara and Smith being in all 4 tables, while Sobers, Kallis and Yousuf all in the list 3 times.

This does not tell us definitively who is second. There is enough sampling variation alone that there's not enough evidence to say that Waugh was better in his best 30 innings than Hayden was, just that he performed better. But that's really all we can hope for.

Steven Smith may not be the best since Bradman, but he may well be also.

Sunday 14 July 2019

Statistical preview, World Cup final, New Zealand vs England

Here is a brief statistical preview.

Recent head to head:

In the past 5 years England lead 8-5.
In the past 2 years England lead 6-3.

At Lord's the ball tends to bounce a bit more. As a result it tends to not suit England as much as their other home grounds. It is the only ground that England have a losing record at over the past few years, with 3 wins and 4 losses in their last 6 years.

It is also a ground where scores have been defended quite regularly.

The slope, large straight boundaries and the bounce combine to make a more bowler friendly ground than most in England, but grounds in this world cup have not exactly gone to type.

Adding in times where New Zealand bat first, and where England bowl first, gives the following result:

New Zealand had a clear plan to use the pressure of the situation as a weapon to help them defeat India, and the pressure from playing at home may do the same against England.

The model that I used to build my simulation has England at 69.8%, while New Zealand are at 30.2%. That feels about right too, New Zealand have a realistic chance, but England are certainly favorites.

The bookies have England at 73%, CricViz have England at 68%, New Zealand at 30% and a tie at a fairly high 2%.

The two teams are close enough that nobody can say exactly who will win, but it is a World Cup final - that's exactly how it should be.

A better World Cup format

As this world cup draws to a close, I've been thinking about the positives and negatives of the format.

There are quite a few of both.

Firstly the positives:

  1. Everyone plays each other.
  2. Not too many matches that seem like a mismatch on paper.
  3. Teams that lose a couple of games early still have the chance to compete.
  4. Guaranteed 9 matches for India, so the ICC get enough money to keep growing international cricket.
  5. There was a match or two every day through the majority of the tournament, so that the momentum built towards the finals.
Then the negatives:

  1. Not enough representation from lower level teams. The qualification was too difficult, and so the goal of making the world cup became unrealistic for most teams.
  2. There were only 3 matches in the final week, meaning that the momentum was lost.
  3. Dead rubbers, or similar - 3 teams were effectively eliminated with 2 weeks to go.
  4. Incomplete rounds - India being 2 matches behind made the narratives and changes in fortune less obvious. 
  5. Pitches were too different from how they've played over the past 4 years, meaning that there was too much of a role of luck in the event. 
The negatives are too great to mean that it's a good idea to continue with the same format in my opinion. But, the positives are things worth keeping.

So, using those positives as constraints as much as possible, and also keeping the tournament to the same length, I have come up with a format that I believe will make for a better event.

Friday 5 July 2019

World Cup Simulation update - 5 July

Here's the latest update for the world cup simulation. I have New Zealand at 100%, but that's simply due to the probability of Pakistan getting the required run-rate being so low that that possibility never eventuated in the 50000 trials that I used. The probability of Pakistan going through is slightly lower than the probability of someone being shot accidentally by a dog running along a beach while holding a handgun in it's mouth during the next week,
The next graph is the expected points. The simulation has had the correct top 4 from the second match on, however, the expected points and the order of the teams have changed considerably

The top 4 was looking fairly likely from about match number 6 on. There was some excitement from the two upset losses by England, but Pakistan never got beyond 40% on the simulation.

The complete make up of the semi-finalists has not yet been decided, nor has the team in 5th place. Pakistan, Bangladesh and Sri Lanka could all end up 5th. 

Next I looked at the winning probability. This is getting close to the point where it can be calculated analytically without much trouble.
 The next thing to look at is the rankings. A thing to remember here is that it is all relative to Afghanistan, so everybody going up is more an indication that Afghanistan has gone down.

The order that the teams are in here is the same as David Kendix' official rankings order, with one exception - I have India ahead of England, rather than the other way round.

Finally, a little graph to show what Pakistan needs to do to make the semi-finals. They need to keep Bangladesh below the green line.

Monday 1 July 2019

World Cup simulation update - 1 July

Here's the latest update to the simulation. The first two graphs disagree slightly, and that's because I have two different methods to calculate the expected net run rate. The first one seemed to be slightly more accurate than the second, but there was not a big difference when I tested them. (The margin of victory in cricket matches is actually really difficult to estimate - teams batting second tend to cruise to victory rather than try to win by as big a margin as possible) I decided to use both when doing the calculations. With the first method, New Zealand and India both have a higher than 99.98% probability of going through, while it's 99% for India and 97.7% for New Zealand with the second method. These seem more realistic.

The big thing to notice is the change to England's probability, and how England beating India damaged the chances of both Pakistan and Bangladesh. Pakistan's probability went down by slightly more than Bangladesh's probability because the ranking of India dropped slightly, and Bangladesh need to beat India to get through.

This graph shows expected value - not the most likely value. Those are actually different things. The expected value is the mean of all the expected outcomes. As a result, none of the teams will actually end up with the points that this shows, but they should mostly get close to it.

 It's now looking like there's a roughly 45% chance that net run rate will be a deciding factor in who goes through to the semi-finals.

If Bangladesh beat India (which is admittedly a fairly unlikely outcome), we could then see a situation where Pakistan and Bangladesh are playing for the opportunity to be level on points with New Zealand and India on 11 points. If that is the case, then (in all likelihood) the rained out match between New Zealand and India will have allowed both to progress at the expense of the winner of Pakistan vs Bangladesh.

The most likely semi-finals at this point are Australia vs New Zealand and England vs India, but these are by no means confirmed yet.

In individual matches, England effectively has a higher ranking than that, because teams playing at home get a ranking boost of 0.86 over their opponent. That's why I have England back on top in the next graph:
This one is quite different to what the book-makers have. I have England as favourites, while they have India and Australia both tied for favourite on roughly 30%. They also have Pakistan and Bangladesh at about double the probability that I do.

I used the first net run rate model for the winning probability, but the difference in numbers suggests that the bookies are possibly using a model that is more similar to the second one.

Wednesday 26 June 2019

World Cup simulation update - 26 June

Are the wheels falling off?

England have now got a 4 win, 3 loss record, and, with 2 difficult matches coming up, have a genuine chance of not going through to the semi-finals. They are still not relying on other results, but they're getting close to the point where they are.

There's been a significant change, with Australia going up, and England going down. England are now expected to get to 10 points. That might still be enough. But it also might not be.
England's ranking has now dropped well below India's, to the point where the expected probability of England winning against India has dropped by almost 10%. They're still ahead due to home advantage, but the difference is decreasing.
There's about a 15% chance that a tie-breaker (total wins or net run rate) will be required. This may count out Sri Lanka, who have had two rain affected matches, and so will probably be on fewer wins than anyone else with the same number of points.

We see a huge drop in the semi-final probability of England, and a resultant increase in Bangladesh, Pakistan and Sri Lanka. Australia have qualified now, and there are fewer options now for New Zealand to be knocked out also (only 35 out of 50000 trials saw New Zealand miss the semi-finals.)

The decrease in England, and increase in probability of lower ranked teams making the semi-finals has meant that there are a lot more semi-final combinations with more than a 0.5% chance of happening. West Indies vs New Zealand was an epic match in the pool play, and that's now a reasonable possibility for a semi-final. The ICC and Star Sports will be licking their lips at the prospect of the 8th most likely outcome - an India Pakistan semi-final would be absolute ratings gold.
This is the first time that England has dipped below India on the winning probability graph, but it's hard to win the final if you don't get out of the group stage.

Monday 24 June 2019

World Cup Simulation update 24th June

 Here's the update after the South Africa vs Pakistan match

Firstly, this pushed Pakistan's ranking back above Bangladesh's ranking, although they are both so close that the match between them is now predicted as 50.2% to 49.8%.
 Looking at the expected points, Pakistan have now jumped ahead of Sri Lanka and Bangladesh.

It's looking fairly likely that 5th place will be on 9 or 10 points, while 4th will be on 10, 11 or 12 points.

My simulation only uses net run rate as the tie breaker. Accordingly, there's actually a slightly higher probability of Sri Lanka and Pakistan getting through than this shows, and a slightly lower chance of England and Bangladesh.

It's takes a lot of processor time to improve the simulation, and it's likely to be less than 1% difference, but I might have a go at improving it once we get to the last 5 matches.

England are still the overwhelming favourite to be the 4th team to go through. There were still 41 out of the 50000 trials where New Zealand hadn't made it. So nobody is guaranteed through just yet.

If you have semi-final tickets - this is who you're likely to see.

The probabilities for Bangladesh and Pakistan being so low here are understandable. They both have about a 5% chance of making the semi-final, but, given that they both have about a 1/3 chance of winning each match against the top teams, it gives them a roughly 0.5% chance of winning the tournament from here. However, if Bangladesh, Australia and Pakistan win the next 3 matches, that number will rise.

It's starting to look like England's style that is so effective in series may not be so effective in one off matches. It will be interesting to see if that trend continues.

Sunday 23 June 2019

World Cup Simulation Update, 23 June

Here's the latest outputs from the simulation.

England's loss to Sri Lanka opened the door somewhat, but we can still be fairly confident in who the semi-finalists are.
 England's ranking has gone down, after two losses to fairly ordinary sides.
It's looking like 10 points will be the magic number. Roughly a 10% chance that we'll rely on a tie-breaker.

The average points expected certainly favour England on that count to be in fourth

Accordingly, they have a much higher chance of making it through.

What the likely match ups are. (Teams in alphabetical order, rather than placings)

England are still firm favourites by my model. Home advantage is massive.

Monday 17 June 2019

World Cup simulation update

The group stage of the World Cup is now roughly half way through, and there are 4 clear favourites to be the semi-finalists.

Afghanistan is the first team to be eliminated (they may have a mathematical possibility, but they don't have a statistical one). At this point, Sri Lanka are not far behind.

The rankings of the teams have remained fairly consistent, suggesting that the extra weighting for world cup matches is about right.
The fact that almost all the teams seem to have gone up is due to them all being relative to Afghanistan. Afghanistan do not seem to be quite as good as they were seeming to be and so they have dropped, but as they are set to 0, it's pushed everyone else up slightly.

The semi-final probability is the most interesting. 

I personally feel that this is underestimating the chances of South Africa, but we will see as the tournament progresses.

The key point on this graph is match 5, where Bangladesh overcame South Africa. If South Africa had won that match, they would be on about 40% and New Zealand and Australia would both be a lot lower.

The simulation also puts out the points for 4th, 5th and the difference between them. This suggests at the moment that there's only a fairly low chance that net run rate will come into play. However, one more rained out match, or a Bangladesh upset of Australia, and this could change dramatically. This makes the expected lines to be 9 points for 5th place, and 11 points for 4th place.

So far of the teams that I've had as favourite to win, 14 out of the 17 have won. Given the probabilities that the models assigned them, that's slightly higher than I would have expected - I would have expected there to have been 4 upsets rather than 3, but it's still telling me that my model is working quite well. That may be due to teams not always playing their best combinations in every match between the world cup, adding extra uncertainty to the results than exist inside a world cup.

It will be interesting to see if it continues to have the same success rate after the cup is finished.

Finally, applying the same system to find the probable winner gets the following results:
England are still favourites, but India are not far behind them.

Sunday 16 June 2019

India vs Pakistan statistical preview.

Here's a couple of little charts etc for today's match up

This suggests that 250 would be a quite defendable total. The par score here is much, much lower than on most grounds in England.

The ground is actually fairly well balanced between both pace and bat and spin and bat - but still favouring both types of bowler slightly.

Old Trafford is the black spot, the grey points are other grounds around the world. Spin and pace friendliness are calculated based on the success of different types of bowlers on those grounds, taking into account runs conceded, balls bowled as wickets taken.

Adding to the ground data all the matches where India has batted first and all the matches where Pakistan has batted second, brings this graph:

This suggests that, taking into account the teams, that a more normal curve applies. If India score under 200 they're unlikely to win, 250 is the 50/50 point and 300 is more like a 75% chance of defending.

Thursday 6 June 2019

World cup simulation update

Just before the World Cup started I wrote a post about a simulation that I had written to find the teams chances of making the semi-finals and chances of winning.

I've spent quite a bit of time improving it over the past week or so, learning some new machine learning techniques to improve my rankings etc.

Below are 3 graphs that show the change in rankings, semi-final probability and win likelihood.

These rankings are all relative to Afghanistan (who are first in the alphabet) Afghanistan will always be on 0. Every other team will change around them. any team ranked lower than them will get a negative rating.
The new model gave New Zealand and South Africa lower chances of making it, and Bangladesh and Australia higher chances of making it. South Africa have dropped lower still, while West Indies and Bangladesh have made up ground.

New Zealand has moved ahead of South Africa into the 4th most likely to win, but both teams are still at fairly long odds.