Measuring the middle in a given set of data is called Measures of Central Tendency. It is a descriptive statistic, which summarizes huge sets of data.
Sachin Tendulkar played 463 ODI cricket matches. If I tell you all the runs he scored in every match, can you make sense out of it? It is very hard. Instead if I tell you his average is 44.83, then it is easy to understand. Average or Mean is one type of central tendency. Given below are the Math marks for 7 students.The average mark is 66.43.
Mark | Alice | Mary | Thomas | Peter | Mike | George |
55 | 66 | 52 | 89 | 90 | 35 | 78 |
Average or Mean = Sum of all the marks/Total No of Students = (55 + 66 + 52 + 89 + 90 + 35 + 78)/7 = 465/7 = 66.43
Mean is prone to distortions by outliers. Excerpt from the book ‘Naked Statistics‘
Imagine that ten guys are sitting on bar stools in a middle-class drinking establishment in Seattle; each of these guys earns $35,000 a year, which makes the mean annual income for the group $35,000. Bill Gates walks into the bar witha talking parrot perched on his shoulder. (The parrot has nothing to do with the example, but it kind of spices things up.) Let’s assume for the sake of the example that Bill Gates has an annual income of $1 billion. When Bill Gates sits down on the eleventh bar stool, the mean annual income for the bar patrons rises to about $91 million.
Clearly the mean of $91 million does not make any sense. The mean got distorted by the outlier 1 billion. Median which is another type of central tendency which can rescue us from the outlier distortions. To calculate the Median you need to
- Arrange the items in the ascending order
- Pick up the middle number; for even no of items there will be 2 middle no’s and you need to take the average of them.
Person 1 | Person 2 | … | Person 6 | … | Person 10 | Bill Gates |
$35,000 | $35,000 | … | $35,000 | … | $35,000 | 1 billion |
There are 11 persons. The middle person is the 6th one and his income is $35,000. Hence using Median we have avoided the distortions caused by Bill Gates entering the Bar. Now you should understand why median home price is used to compare real estate prices. It is less biased than the mean price since it is not as heavily influenced by small number of very highly priced homes. Take a look at the following Excerpt from the book ‘Naked Statistics‘
Suppose I collected data on the weights of 250 people on an airplane headed for Boston, and I also collected the weights of a sample of 250 qualifiers for the Boston Marathon. Now assume that the mean weight for both the groups is roughly the same, say 155 pounds. Anyone who has been squeezed into a row on a crowded flight, fighting for the armrest, knows that many people on a typical commercial flight weigh more than 155 pounds. But you may recall from those same unpleasant, overcrowded flights that there were lots of crying babies and poorly behaved children, all of whom have enormous lung capacity but not much mass. When it comes to calculating the average weight on the flight, the heft of the 320-pound football players on either side of your middle seat is likely offset by the tiny screaming infant across the row and the six-year-old kicking the back of your seat from the row behind.
To understand this I created 2 datasets with 10 members. All weights are in pounds. Clearly both Mean and Median are the same in both the cases. Is there a way to differentiate the weight of Marathon Runners and Airplane Passengers?
Person | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | Mean | Median |
Runners | 155.00 | 160.00 | 154.00 | 150.00 | 159.00 | 158.00 | 157.00 | 153.00 | 161.00 | 151.00 | 155.80 | 156.00 |
Passengers | 24.00 | 20.00 | 120.00 | 140.00 | 159.00 | 158.00 | 170.00 | 175.00 | 280.00 | 300.00 | 154.60 | 158.50 |
Standard Deviation(SD) measures how dispersed the data is from the mean. For Marathon Runners the standard deviation is 3.6 and for airplane passengers it is 86.10. A very large standard deviation indicates wide range of values. We can see that 86.10 is very big compared to 3.6 and hence airplane passengers have wide range of weights from 20 pounds to 300 pounds.
How to calculate Standard Deviation
- Calculate the Mean
- Subtract every data item from the mean
- Square the values from step 2. If you are curious find out why do we need to square?
- Sum the squares
- Average the squares – This is called as Variance
- Take the square root – This is the Standard Deviation. Since we squared at step 3 we are taking a square root at this step.
Given below is the SD calculation for airplane passengers
Weight(A) | Mean(B) | A – B | Square(A – B) | |
1 | 24.00 | 154.60 | -130.60 | 17056.36 |
2 | 20.00 | 154.60 | -134.60 | 18117.16 |
3 | 120.00 | 154.60 | -34.60 | 1197.16 |
4 | 140.00 | 154.60 | -14.60 | 213.16 |
5 | 159.00 | 154.60 | 4.40 | 19.36 |
6 | 158.00 | 154.60 | 3.40 | 11.56 |
7 | 170.00 | 154.60 | 15.40 | 237.16 |
8 | 175.00 | 154.60 | 20.40 | 416.16 |
9 | 280.00 | 154.60 | 125.40 | 15725.16 |
10 | 300.00 | 154.60 | 145.40 | 21141.16 |
Variance | 7413.44 | |||
SD | 86.10 |
About 68.27% of the values lie within 1 SD of the mean. For airplane passengers the SD is 86.10 and the mean is 154.60. Hence 68.27% of the values lie between 68.5(154.60 – 86.10) and 240.7(154.60 + 86.10). Similarly, about 95.45% of the values lie within 2 SD of the mean. Nearly all (99.73%) of the values lie within 3 SD of the mean. This rule is only applicable if the distribution of the data is normal.
The math behind the calculations is not very important. If you understand that SD is used to measure the dispersion from the mean that is more than enough.
I am a Software Engineer and I monitor the health of production servers at work. One metric which I always look at, is the average latency. Latency is defined as the total time taken to service a single call. My colleague pointed out that it is better to look at 99th percentile instead of the average. What is a percentile? If you divide the distribution of data into hundredths, or percentiles. Each percentile represents 1 percent of the distribution. To calculate the 99th percentile
- Arrange the numbers in ascending order
- (99/100) * N ; where 99 is the percentile that is being calculated. N is the total number of items.
- Round up.
- Pick up the number in the position calculated at step 3. All the numbers to the left of this(99%) are lesser than this.
Given below are the latency in milliseconds(ms) for 10 calls. The average is 5.7 which tells that things are fine. But if you look at the 99th percentile then (99/100) * 10 = 9.9. Rounding it up we get 10. Looking at the 10th value we see the latency as 16 ms. This indicates that the top 1% of the calls are taking a long time. It helped me to probe further into the details of why it is taking a long time. Clearly Average missed it and Percentile helped me to catch it. My colleague was correct. Do you know that Median is also called as the 50th percentile. Why? Remember in median we divided the number of items by half. (50/100) * N = (1/2) * N. Both are the same!
Latency(ms) | 1 | 1 | 1 | 2 | 2 | 2 | 10 | 11 | 11 | 16 |
I have one more central tendency to discuss. It is called as Mode. Mode is the number that is repeated more often than any other number. Given below are the sales data for T Shirts sizes purchased. XL size is purchased more often(3 times) which is the mode. How is this useful in real life? A Retailer may want to know the mode of the T-Shirt sizes purchased. Why? It helps him to determine the stocking levels.
L | S | S | XL | XL | XL | XXL | M | M | L |
Which one to use?
Depending on the situation you need to use your judgment. Sometimes mean alone would be sufficient. In certain cases you need to use all of them to make sense out of the data. If you are not careful then you can be easily fooled. Excerpt from the book ‘Naked Statistics‘
George W. Bush tax cuts which were touted by the Bush administration as something good for most American families. While pushing the plan, the administration pointed out that 92 million Americans would receive an average tax reduction of over $1,000($1,083 to be precise.) But was that summary of the tax cut accurate? According to the New York Times, “The data don’t lie, but some of them are mum.” Would 92 million Americans be getting a tax cut? Yes. Would most of those people be getting a tax cut of around $1,000? No. The median tax cut was less than $100.
If you think that Median is the solution then read the article – The Median Isn’t the Message by Stephen Jay Gould. Excerpt from the article
When I learned about the eight-month median, my first intellectual reaction was: fine, half the people will live longer; now what are my chances of being in that half. I read for a furious and nervous hour and concluded, with relief: damned good. I possessed every one of the characteristics conferring a probability of longer life: I was young; my disease had been recognized in a relatively early stage; I would receive the nation’s best medical treatment; I had the world to live for; I knew how to read the data properly and not despair.
Tony Wagner, the Harvard education specialist says
Knowledge is available on every Internet-connected device, what you know matters far less than what you can do with what you know.
Hi Jana,
Thanks for the posts.
I would recommend Head Start to Statistics book as a reference for all the new comers to statistics
I also would like to know, how it is possible for you master the things very faster, i am amazed by that, kindly share with me on this.
On a person note, i am your brother’s school mate in West Mambalam.
Hi Ganesh,
Thanks for referring that book. I will take a look as well to catch up on some unknown concepts.
I am a slow learner but spend a lot of time on understanding the concept. I keep going back to it again and again until I get it.
Nice to know that you are my brother’s school mate. What is your full name? And I have met you in person before?
Regards,
Jana