this article about Sachin Tendulkar having trouble after having a break from cricket. I get sent quite a few Tendulkar stats articles and normally I see them as a bit of a waste of time. There are a lot of articles and most of them are poorly written and make dubious statistical claims. There are some exceptions (Nicholas Rohde's contriversial piece springs to mind), but most are either Indian fans wanting to say that their man is the greatest ever, or anti-Indian fans trying to get a reaction.
When I read Jigar's piece, however it read well, and seemed well researched. My statistics antenna picked up, however when I read through his research, and wondered how significant his findings really were.
Here is a brief summary, for people who haven't read the article: Tendulkar has scored a large number of low scores immediately after returning from an injury or other extended time away from the game.
Jigar's theory was that SRT needed some time to get back to top form, and the troubles against New Zealand were simply because he hadn't been playing enough.
But I wanted to know if the numbers that he had provided were significant. The more I looked at them, the less certain I was. I needed a statistical tool to assess the significance.
To do this in a non-cricketing environment, the first call would have been to look at a confidence interval. Find a 95% confidence interval for the scores where Tendulkar was coming off a break, and, if his actual mean lay outside that, then we could say there was a significant difference. The only problem is that means and standard deviations don't really work with cricket. Not outs make things complicated. I've tried to develop a cricketing equivalent to standard deviation, but have yet to be successful.
My next port of call statistically would be to get an informal confidence interval based on median and inter-quartile range. The problem with this is that again not out's become very tricky to deal with fairly. If a batsman come in with 8 runs to win and scores 2*, how do you count that?
So I had to use a technique that I had heard about, but hadn't actually used myself previously: randomisation.
The idea with this technique is to see if a particular sample fits within the normal range for something. I attended a lecture where it was explained how it was used to test the distribution of berries in a forest and various other conservation related topics. The way that I approached it was this:
Jigar had taken out 47 innings that he described as being shorty after a break. So I looked at those 47 innings and found their batting average. Then I took a random sample of 47 out of the 773 international innings that Tendulkar had played (he's played since I started working on this) and found the batting average of the sample. I took 1002 samples (I intended to take 1000, but forgot to clear my first two test samples before running the program) and looked at how they were distributed.
The 47 innings that came after a break had an average of 41.8.
Here is a graph of my results (I grouped them by the nearest 0.25 of a run).
We can see that the idea of a set of 47 innings having an average of 41.8 isn't particularly unusual. There are a number of samples that were in that region.
We can also see that the distribution looks normal. I calculated the mean and standard deviation of the averages and looked to see how it fitted. Here's the graph of the distribution:
Given that it is so well modeled by a normal distribution, we can use the mean and standard deviation to decide if we can make a call.
The mean of the averages is 49.41. (this is slightly higher than Tendulkar's career average across all formats of 48.68, but close enough to use). The standard deviation was 8.24.
If we were able to make a call that Tendulkar was badly affected by a break, we would expect his average in matches where he has had a break to be at least two standard deviation lower than the mean. This would require the average to be lower than 32.2. Instead his average in the 47 innings that Jigar mentioned was 41.8. It is lower, but we can't really claim that it is significantly lower.