# Regression Analysis

Given below are the Math and Statistics scores of five randomly selected tenth grade students in 2012. You are the Math teacher in the school. In 2013 the school principal asks you the following question.

If a student scores 90 in math how much would he score in statistics?

How would you solve this problem.

 Student Math score out of 100 Statistics score out of 100 1 60 68 2 70 68 3 80 72 4 85 82 5 95 92

You are a smart math teacher and you come up with a scatter plot for this data.

From the above you see that Student 1 scored more marks in statistics than his math score. Other students scored less in statistics compared to their math scores. Unable to answer the question you go to the statistics teacher and ask for help. The statistics teacher tells that it is easy to solve this using Linear Regression. He tells that

Regression analysis allows to quantify the relationship between a particular variable and an outcome we care about while controlling for other factors. You need to draw a line that minimizes the squared error to the points.

After thanking the statistics teacher you come back to your office and start to decipher what he told. There are two variables involved in this problem. Explanatory and Dependent variables.

1. Explanatory – Math score is the explanatory variable. It is used to explain the statistics score.
2. Dependent   – Statistics score is the dependent variable.

You are trying to find out the statistics score (dependent) given the math score (explanatory). On the scattered plot you draw a straight line. I used the online tool to produce this graph. Math score is in the X axis. Statistics score is in the Y axis.

## Why a straight line?

A straight line can be represented by an equation. So what? If you have an equation then you can find the value of the statistics score (Y) given the Math score (X). This is what we are trying to solve.

Y = m * X + b

For the data given below the graph of the line is attached.

 X Y 0 2 1 4 2 6 3 8 4 10

In the equation of the line the variables X and Y are the inputs. What does m and b stand for?

b is called as the Y intercept. It means the value of Y when X is zero. In this case the value is 2. Look at the graph above.

m is called as the slope. It means how much does the value of Y increase for increase in X. In this example if X goes up by 1 then Y goes up by 2. Hence the slope is 2 / 1 = 2

Slope = Rise / Run

## What does minimize the squared error mean?

Look at the annotated graph given below.

Your goal is to draw a line that minimizes the sum of squared error (residuals). Why should this be minimum? If the points are closer to the line then the equation of the line will produce correct results. The technical term for this is called as Ordinary Least Squares.

From the above graph the squared error of the line can be defined by

```Squared Error of Line = (y1 - (m * x1 + b)) ^ 2 +
(y2 - (m * x2 + b)) ^ 2 +
...
(yn - (m * xn + b)) ^ 2 +```

The goal is to solve this equation and find the value for m (slope) and b (y intercept) which minimizes the squared error. To solve this equation you need to use algebra and partial derivates. Watch the explanation from Salman Khan on how to solve this equation.

Solving this equation you will get

```m (slope) = Mean(x) * Mean(y) - Mean(x * y) / Mean(x) ^ 2 - Mean(x ^ 2)
b (intercept) = Mean(y) - m * Mean(x)```

Let us substitute the math and statistics score to the above equation.

```Mean(Math Score)              = 78
Mean(Statistics Score)        = 76.4
Mean(Math * Statistics Score) = 6062
Mean(Math Score) ^ 2          = 6084
Mean(Math Squared Score)      = 6230

m (slope) = 78 * 76.4 - 6062 / 6084 - 6230
m (slope) = 5959.2 - 6062 / 6084 - 6230
m (slope) = -102.8 / -146
m (slope) = 0.70410958904

b (intercept) = 76.4 - 0.70410958904 * 78
b (intercept) = 76.4 - 54.92054794512
b (intercept) = 21.47945205488```

The equation of the line is

y = 0.70 * x + 21.47945205488

If the math score is 90 then using the above equation we get the statistics score as 84.48. You go to the principal and give the statistics score. The principal asks

How confident are you?

You once again run to the statistics teacher for help. He tells you to find out the R-squared. What does it mean?

R-squared is a statistic that will give some information about the goodness of fit of a model. In regression, the R-squared coefficient of determination is a statistical measure of how well the regression line approximates the real data points. An R-squared of 1 indicates that the regression line perfectly fits the data.

It tells you how much of change in statistics score is explained by the changes in the math score.

```Statistics score not explained by Math score =
Sum(Squared Error of Line) / Total variation in statistics score```
 Statistics Score (A) Statistics Score Using the equation (y = 0.70 * x + 21.47945205488) (B) C = (A – B) D =  (C ^ 2) 68 63.47945205 4.520547945 20.43535372 68 70.47945205 -2.479452055 6.147682492 72 77.47945205 -5.479452055 30.02439482 82 80.97945205 1.020547945 1.041518108 92 87.97945205 4.020547945 16.16480578

Summing column D (Sum (Squared Error of Line) ) you get 73.81375493.

 Statistics Score (A) B = A – 76.4 (Mean statistics score) C = B ^ 2 68 -8.4 70.56 68 -8.4 70.56 72 -4.4 19.36 82 5.6 31.36 92 15.6 243.36

Summing column C (Total variation in statistics score) you get 435.2

```Statistics score not explained by Math score = 73.81375493 / 435.2
Statistics score not explained by Math score = 0.169

R-squared = Statistics score explained by Math score

R-squared = 1 - Statistics score not explained by Math score
R-squared = 1 - 0.169
R-squared = 0.831```

In other words 83.1% of the changes in the statistics score is explained by the math score. You can tell the principal that

I am 83% confident.

Watch the explanation from Salman Khan on how to calculate R-squared.

## Closing thoughts

Regression analysis can demonstrate an association between two variables. But with that information alone you cannot prove that one variable is causing a change in the other. Excerpt from Naked Statistics

Suppose we were searching for potential causes for the rising rate of autism in the United States over the last two decades. Our dependent variable – the outcome we are seeking the explain – would be some measure of the incidence of the autism by year, such as the number diagnosed cases for every 1,000 children of a certain age. If we were to include annual per capita income in China as an explanatory variable, we would almost certainly find a positive and statistically significant association between rising incomes in China and rising autism rates in the U.S. over the past twenty years. Why? Because they both have been rising sharply over the same period. Yet I highly doubt that a sharp recession in China would reduce autism rate in the United States.

Anyone with a computer can do regression analysis and provide precise answers. But you might come to incorrect conclusions if you do not use your common sense.