Independence Day, 4 July 1977 is a date I remember well. Besides being one of the hottest days in England for many years, it was the day of my doctoral thesis examination in Oxford. Independence, albeit of a slightly different sort, turned out to be of some importance because the first question the examiners asked me wasn’t about cosmology, the subject of the thesis, at all. It was about statistics. One of the examiners had found 32 typographical errors in the thesis (these were the days before word-processors and schpel-chequers). The other had found 23. The question was: how many more might there be which neither of them had found? After a bit of checking pieces of paper, it turned out that 16 of the mistakes had been found by both of the examiners.
Knowing this information, it is surprising that you can give an answer as long as you assume that the two examiners work independently of each other, so that the chance of one finding a mistake is not affected by whether or not the other examiner finds a mistake.Let’s suppose the two examiners found A and B errors respectively and that they found C of them in common. Now assume that the first examiner has a probability a of detecting a mistake while the other has a probability b of detecting a mistake.
If the total number of typographical errors in the thesis was T, then A = aT and B = bT. But if the two examiners are proofreading independently then we also know the key fact that C = abT. So AB = abT2 = CT and so the total number of mistakes is T = AB/C, irrespective of the values of a and b. Since the total number of mistakes that the examiners found (noting that we mustn’t double-count the C mistakes that they both found) was A + B – C, this means that the total number that they didn’t spot is just T – (A + B – C) and this is (A – C)(B – C)/C. In other words, it’s the product of the number that each found that the other didn’t divided by the number they both found. This makes good sense. If both found lots of errors but none in common then they are not very good proofreaders and there are likely to be many more that neither of them found.
In my thesis we had A = 32, B = 23, and C = 16, so the number of unfound errors was expected to be (16 × 7)/16 = 7.This type of argument can be used in many situations. Suppose different oil prospectors search independently for oil pockets: how many might lie unfound? Or if ecologists want to know how many animal or bird species might there be in a region of forest if several observers do a 24-hour census.A similar type of problem arose in literary analysis.
In 1976 two Stanford statisticians used the same approach to estimate the size of William Shakespeare’s vocabulary by investigating the number of different words used in his works, taking into account multiple usages. Shakespeare wrote about 900,000 words in total. Of these, he uses 31,534 different words, of which 14,376 appear only once, 4,343 appear only twice and 2,292 appear only three times. They predict that Shakespeare knew at least 35,000 words that are not used in his works: he probably had a total vocabulary of about 66,500 words. Surprisingly, you know about the same number.
0 comments:
Post a Comment