Central Limit Theorem

The core concept of Central Limit Theorem is that a large, properly drawn sample will resemble the population from which it is drawn. This is a powerful concept in statistics. In the book Naked Statistics

At times, statistics seems like magic. We are able to draw sweeping and powerful conclusions from relatively little data. Somehow we can gain meaningful insight into a presidential election by calling a mere one thousand American voters. We can test a hundred chicken breasts for salmonella at a poultry processing plant and conclude from that sample alone that the entire plant is safe or unsafe. Where does this extraordinary power to generalize come from? Much of it comes from the central limit theorem.

Central limit theorem enables us to

  1. Generalize conclusions on the entire population based on the results from the sample population.
  2. Sample size of 30 or more is enough for this theorem to take effect.
  3. The distribution of sample mean will approach a normal distribution even though the underlying population does not have a normal distribution.

What does generalize conclusions on the entire population mean?

You are assigned with the task to find out the mean weight of all the residents in your city. If you are going to find out the weight of every resident and then calculate the mean, you are not going to complete the task.

Instead of that you can find out the mean weight of randomly selected 70 residents (any number >= 30) and take repeated samples of randomly selected 70 members. These are called as sample means. The mean of sample means will be very close to the population (entire city) mean. Thus with little effort we can generalize the result for the entire population.

Power of Normal Distribution

The sample means are distributed roughly as a normal curve. This enables us to apply the normal distribution rule.

  1. 68% of the data fall with in 1 standard deviation of the mean.
  2. 95% of the data fall with in 2 standard deviation of the mean.
  3. 99.7% of the data fall with in 3 standard deviation of the mean.

normaldistribution

Understanding the theorem by using an online tool

I used the tool from online stat book to understand this concept.

Given below is a normal distribution of some population. The mean for the population is 16. 

normalpopulation

Using the above data random samples of 25 items is selected and the mean is computed. This is repeated 10,000 times and the sample means and its frequencies are plotted in a graph. You can clearly see the the sample means are distributed normally. The mean for the sample is also 16.

samplemeannormal25

Given below is a skewed distribution of some population. The mean for the population is 8.08. 

parentskewed

Using the above data random samples of 25 items is selected and the mean is computed. This is repeated 10,000 times and the sample means and its frequencies are plotted in a graph. The sample means are distributed normally even though the population distribution is not normal. This is the power of central limit theorem. The mean for the sample is also 8.08. I strongly recommend to play with the tool and see how it works in action.

samplenormalskewed

Unravel the concept with a problem

Here is a problem on mean family income in the United States. [Source]

According to a 1995 study, the mean family income in the US was $38,000 with a standard deviation of 21,000. If a consulting agency surveys 49 families at random, what is the probability that it finds a mean family income of more than $41,500?

From Central Limit Theorem we know that
Population mean and the Sample mean will be the same

Hence the average of sample means will also be $38,000
Let n be the sample size. In this case it is 49 families.
Standard Error = Standard Deviation / sqrt (n)
                 = 21,000 / sqrt (49)
                 = 21,000 / 7
                 = 3,000

$41,500 is $3,500 more than the sample mean of $38,000

Z Score = $3,500 / $3,000
         = 1.17

Using the table of normal curves 

Z score of 1.17 falls in 37.9% to the right of mean.
Hence this value is at 50% (Middle) + 37.9% = 87.9%

Hence the probability of finding the mean value of more than $41,500 is

Probability of mean income more than $41,500 = 1 - 0.879
                                             = 0.121

It is 12.1%

Use this tool to understand the table of normal curves.

zscore-centrallimittheorem

While solving the problem we came across two new concepts

  1. Standard Error
  2. Z Score

Standard Error is the standard deviation of the sample means

Standard Error = Standard Deviation / sqrt (sample size)

Here is the explanation from the book Naked Statistics

Don’t let the appearance of letters mess up the basic intuition. The standard error will be large when the standard deviation of the underlying distribution is large. A large sample drawn from a highly dispersed population is also likely to be highly dispersed; a large sample from a population clustered tightly around the mean is also likely to be clustered tightly around the mean.

Z Score is defined as a standardized score that indicates how many standard deviations a data point is from the mean

Z Score = (X – Mean) / Standard Deviation

X is the raw data. In our example X is $41,500. If the Z score is 0 then the value is the same as mean.

Inference

You are the principal of an Elementary school. There are 500 fifth grade students in the school. The mean Mathematics mark of the 500 students is 70 out of 100. The standard deviation is 10. One day a local authority comes to the school and randomly picks up 50 students and gives them a test on Mathematics. The average score comes to 82 out of 100. The local authority is not convinced with the results and suspects that the students copied. You are assigned with the task of investigating this issue. How would you proceed?

There are several reasons why the mean score is higher

  1. The students did the work genuinely and did not copy.
  2. The question paper could have been super easy.
  3. The students copied in the exam.
  4. The teachers helped the students to answer the questions.
  5. All the randomly selected 50 students happened to be the best (selection was not random).

Statistics alone cannot prove anything. Instead it is used to accept or reject explanations on the basis of their relative likelihood. Your job as a principal is to form hypothesis.  Hypothesis is defined as

A supposition or proposed explanation made on the basis of limited evidence as a starting point for further investigation.

You go ahead and form two hypothesis

  1. Null hypothesis
  2. Alternative hypothesis

Null hypothesis in this case will be

Students are innocent and they genuinely scored high marks (Option 1)

Alternative hypothesis in this case will be

Students cheated and scored high marks (Option 2 to 5)

Note that null hypothesis and alternative hypothesis are logical complements. If one is true, then the other is not true. If we reject a null hypothesis we should accept the alternative hypothesis. Let us solve the problem using central limit theorem.

Let n be the sample size. In this case it is 50.

Standard Error = Standard Deviation / sqrt (n)
Standard Error = 10 / sqrt (50)
Standard Error = 1.41

According to the central limit theorem:
The mean of the sample should be the same as the population mean.
In this case the mean of the sample 12 more than the population(82 - 70)

Z Score = (X - Mean) / Standard Error
Z score = 12 / 1.41
Z score = 8.51

At Z score of 3 you are already in the 99.9% interval. Take a look at the graph below. For 8.51 the probability will be zero.

zvalue3

From this you can reject the null hypothesis and accept the alternative hypothesis and find out what actually happened in the exam. From the book Naked Statistics

One of the most common thresholds that researchers use for rejecting a null hypothesis is 5 percent, which is often written in decimal form: .05. This probability is known as significance level, and it represents the upper bound for the likelihood of observing some pattern of data if the null hypothesis were true.

In general you need to come up with the significance level before conducting the test. Why? If not you will be biased with the outcomes. In general significance level of 1% and 5% will be used to reject the null hypothesis. From the book Naked Statistics

Obviously rejecting the null hypothesis at the .01 level (meaning that there is less than a 1 in 100 chance of observing a result in this range if the null hypothesis were true) carries more statistical heft than rejecting the null hypothesis at the .1 level (meaning that there is less than a 1 in 10 chance of observing a result in this range if the null hypothesis were true)

Type 1 error

A Type 1 error occurs if we reject a null hypothesis incorrectly. In general if the burden of proof for rejecting the null hypothesis is too low for example 0.1 (1 in 10), we are going to find ourselves periodically rejecting the null hypothesis. This is also called as false positives. I always get confused with the term false positive. Here is an explanation from the book Naked Statistics

When you go to the doctor to get tested for some disease, the null hypothesis is that you do not have that disease. If the lab results can be used to reject the null hypothesis, then you are said to test positive. And if you test positive but are not really sick, then it’s a false positive.

Type 2 error

A Type 2 error occurs if we accept a null hypothesis incorrectly.If the burden of proof of rejecting the null hypothesis is very high for example 0.001 (1 in 1000), we are going to find ourselves periodically accepting the null hypothesis even though it is false. This is also called as false negatives.

Which error is worse?

It depends on the situation. For example a patient is going through a test for cancer. The null hypothesis is that the patient does not have cancer. Type 1 error is better in this case. Why?

Doctors and patients are willing to tolerate a fair number of Type 1 errors (false positives) in order to avoid the possibility of a Type 2 error (missing a cancer diagnosis).

Remember the saying

A stitch in time saves nine

Read the news about the cheating scandal in standardized tests. From the book Naked Statistics

Some classrooms had answer sheets on which the number of wrong-to-right erasures were twenty to fifty standard deviations above the state norm. (To put this in perspective, remember that most observations in a distribution typically fall within two standard deviations of the mean.) So how likely was it that Atlanta students happened to erase massive numbers of wrong answers and replace them with correct answers just a matter of chance? The official who analyzed the data described the probability of the Atlanta pattern occurring without cheating as roughly equal to the chance of having 70,000 people show up for a football game at the Georgia Dome who all happen to be over seven feet tall. Could it happen? Yes. Is it likely? Not so much.