LAB 9: Regression Analysis

In the lab, we will examine important concepts of regression analysis, including:

Relationships and causality
Independent vs dependent variables
Fitting a line
Evaluating the line of best fit
Examining the residuals
Calculating confidence intervals for predicted values
Limitations to regression
Using the 7 Steps to Regression Analysis in an example (the steps are summarized at the end of the lab).

Recommended Web Links:
Regression applets from the University of Illinois and David Howell

Regression is a complex but extremely useful statistical procedure that has applications in many areas of geography: social, economic, physical, geomatics, etc. As with any tool, regression analysis has several limitations, which a researcher must understand in order to apply it appropriately. In this lab, we will explore the fundamental concepts of regression, and then work through the application of regression analysis using an example.

Concepts in regression

From correlation to causality

In lab 8, we used correlation to assess the degree of linear association between two variables, in terms of direction and strength of the relationship. It was stressed, however, that a strong correlation did not imply the existence of a causal relationship between the variables.

Causal relationships do exist in nature. For example, the amount of runoff is related to the amount of rainfall in a watershed. Heavy rainfall causes more runoff; low rainfall causes less runoff. A geographer could use this information to predict runoff based on their knowledge of rainfall in the watershed. In order to apply regression analysis, it is necessary to assume a causal (or 'functional') relationship between two variables. A researcher can then develop an equation useful for prediction.

A. Independent vs dependent variables

Regression analysis requires the specification of the independent and dependent variables.

The independent variable creates an effect or has an influence.
The dependent variable receives the effect or is influenced by the independent variable

This distinction very important as the results of the analysis depend on how you specify your variables. In our watershed example, rainfall is the independent variable and runoff is the dependent variable. Rainfall will influence the amount of runoff, but runoff will have no effect on the amount of rainfall.

The convention in regression is to label your independent variable as X and your dependent variable as Y. This convention extends to scatter plots where the independent variable is plotted along the X-axis and the dependent along the Y-axis.

A functional relationship is often stated as "Y is a function of X". Using the rainfall example, we would say that "runoff is a function of rainfall". The short hand version of this statement uses the symbol 'f()' which stands for "function of". Therefore, we would write: runoff = f(rainfall). This format is very useful when you have multiple variables that may be causing an effect. For example: runoff = f(rainfall, soil storage, plant intake). In this course, we will only consider simple linear (or bivariate) regression with one independent variable.

B. Fitting a line

To estimate runoff based on rainfall, we need an equation that relates the values of one variable (rainfall) to values of the other (runoff). Examples of equations may include:

runoff = rainfall / 2
runoff = 1.5*(rainfall) - 30
runoff = (rainfall)² - 4*(rainfall)

How would you develop an equation between these two variables? You could develop an approximate relationship by looking at the data (see the table below).

The data for rainfall and runoff are presenting in the table below. For example, we could use the data in the first row (rainfall = 1842, runoff = 1156) to develop the equation: runoff = 0.628*rainfall.

We can test this equation by applying it to all other rows (Prediction) and calculating the difference between the observed and estimated value (Error). We find our equation gives small error for three cases and large errors for the rest. We conclude that this equation does not accurately predict runoff based on rainfall.

Rainfall	Runoff	Prediction	Error
1842	1156	1155	1
2202	1378	1381	-3
2036	1274	1277	-3
1868	947	1171	-224
1520	356	953	-597
1350	454	846	-392
1710	610	1072	-462
1510	473	947	-474
1285	127	806	-679
1522	651	954	-303
1585	380	994	-614
1682	750	1055	-305
1984	820	1244	-424
1594	641	999	-358
1693	850	1062	-212

You could try developing an equation from data in the 5th row or the 10th row or... However, the data in each individual row may not accurately represent the 'overall' relationship between the two variables, and you end up with large errors. What you really want is an equation that minimizes the overall error (the differences between observed and predicted value). This equation may not exactly predict each observed value, but it will give you results that are close for most cases.

One way to develop this equation is to prepare a scatter plot and draw a straight line through the points. This straight line should be as close as possible to each point (but you can't play 'connect the dots'!). There are many lines you could draw (in blue), but only one line will minimize the overall error for all points. This is called the Line of Best Fit (in red).

This line will have a specific equation, based on:

the slope of the line: defined by the vertical rise over the horizontal run

a positive slope (> 0) indicates a positive relationship between the variables
a negative slope (< 0) indicates a negative relationship.

the y-intercept: the point where the line of best fit meets ('intercepts') the y-axis.

The slope and y-intercept are expressed in algebra in the regression model:

where:
Y = the dependent variable (runoff)

= the y-intercept

= the slope
X = the independent variable (rainfall)

= the overall error

SPSS will calculate the values for and for you. Finding the line of best fit by hand would be an extremely long and slow process, which is why regression was not used in research until recently. Today, computers can do the necessary calculations in a few seconds.

How does the computer find the line of best fit?

Most statistical software packages use a procedure called Ordinary Least Squares (OLS) to find the line of best fit. This procedure evaluates the difference (or error) between each observed value and the corresponding value on the line of best fit. We use vertical distance as our measure of error because we are interested in how close our predicted answer (

) is compared to the observed values (

) (see diagram). This procedure finds a line that minimizes the sum of the squared errors.

Note: The errors are squared to remove the + or - sign. If the + and - signs are not removed, the positive errors can cancel out the negative errors, which could give you a value of zero.

C. Evaluating the line of best fit

Once we have obtained the value of and for the line of best fit, we need to assess how well the line fits the original data. In particular, we want to know how well the independent variable (X) explains or accounts for the variation in the dependent variable (Y). If X explains most of the variation in Y, we infer that the line fits the data well. Another consideration is that we need to have a measure that does not depend on the scale or units of the variables; this would allow us to compare the results of different regression analyses.

To evaluate how much of the variation in Y that X explains, we need a baseline for comparison. If X had no effect on Y, our best estimate for Y is the mean (). Therefore, the difference between (observed) and (mean) is our baseline for comparison. This difference is called the total variation. Then we can compare the variation explained by X to the total variation to quantify how much effect X has on Y.

The three types of variation are examined below:

Total variation = observed - mean (

).
Explained = predicted - mean (

).
Unexplained variation is the remaining variation or observed - predicted (

). The unexplained variation is the error we discussed in the section above.

The variations at each point are squared and then summed to obtain the overall total, explained and unexplained variation (variation is also called "sum of squares").

In the diagram above, we see that explained + unexplained = total. Therefore, we can calculate the proportion of explained variation to total variation . This proportion is called the coefficient of determination (). This coefficient ranges from 0 to 1; a high means the line is a good fit, a low means a poor fit. We interpret as the percentage of total variation in Y explained by X.

Note: the explained variation is also called the regression sum of squares (i.e. this is the variation explained by the regression model). The unexplained variation is also called the residual sum of squares. These terms are used in the SPSS output.

D. Testing the significance of

The final step in evaluating our model is to test significance of (using the 9 steps for hypothesis testing). The test statistic for the hypothesis test is F* which compares the explained variation to the unexplained variation. If a large portion of total variation is explained, F* will be large and we infer that a significant portion of the variation in Y is explained by X.

The following diagram illustrates the concept of this hypothesis test. A full bucket of water represents the total variation in Y. The sponge is the X variable. When the sponge is dipped in the bucket, it absorbs a certain amount of water. The water held by the sponge is the explained variation. The water remaining in the bucket is the unexplained variation.

In model A, the sponge absorbs 20% of the water ( = 0.20 or 20%). The portion of water absorbed by Sponge A is not significant (F* < )
In model B, the sponge absorbs 70% of the water ( = 0.70). This sponge absorbs a significant portion of the water from the bucket (F* > )

Regression Assumptions

The data are interval or ratio
The X's are measured without error
Because regression focuses on the explained and unexplained variation in Y, the procedure must assume that there is no error in X. It would be extremely difficult to account for (and calculate) errors in the independent variable.
The relationship between X and Y is linear
The errors terms (residuals) are pairwise uncorrelated
This means that all residuals must be independent of each other. For example, assume that you are studying food intake (calories) and a person's height. You measure the mean food intake and height of a person over 15 years. Problem? A person's height this year tends to be highly correlated with their height last year. Each pair of heights is not independent. Pairwise correlation is also called autocorrelation ("self-correlation")
The error terms (residuals) have a mean of 0
If you average the standardized residuals, the mean is assumed to be 0. For example, the average of 4 residuals; 1.1, -0.9, -0.4, 0.2 is 0. If we assume the mean of errors = 0, we can remove the E (error) term from the regression model. Removing this term makes the calculation of the regression terms ( and ) much easier.
The error terms (residuals) are homoscedastic
This means the errors have a constant/equal variance along the regression line. See the residual plots for more details
The errors (residuals) are normally distributed
Regression assume that more residuals are close to the regression line and fewer residuals fall away from the regression line.

Verifying the assumptions
Some assumptions can be easily verified. For example, you can verify assumption 3 by examining a scatter plot. Other assumptions must be verified by examining the residual plots (assumptions 4 and 6 - #4 are not covered in this lab). The remaining assumptions (2, 5 and 7) are required to reduce the computational load of the regression procedure. These assumptions are very difficult to check and the researcher must realize that hidden problems may affect the outcome of the analysis.

The probability distribution, critical values, etc. of the hypothesis test are detailed in the example.

E. Examining the residuals

Before using your model for prediction, you must check for patterns in the residual (unexplained) variation (the difference between the observed and predicted values). By examining the residuals, you may detect patterns or points that influence the fit of the OLS line, or which violate the assumptions of regression. Often these patterns are too subtle to see in a scatter plot.

SPSS calculates the residual variation for each point in the dataset. For each point, the actual residual ( - ) is converted to a standaridized residual by dividing by the standard error (which is similar to standard deviation). This process, analogous to producing Z scores, removes the units of the error term and allows comparison of residuals between different regression models. The standardized residuals are plotted on a scatter plot as follows:

The convention for residual plots is to put the standardized residuals (ZRESID in SPSS)on the Y axis and the observed values of Y on the X axis. This format will highlight the differences between the observed and predicted values of Y.
Two reference lines are added on the Y axis at -2 and +2. These lines show the outer limits of acceptable error.

The diagram below shows how the errors around the line of best fit are plotted as standardized residuals.

How to interpret the residual plot
In the residual plot, the 0 line is the Line of Best Fit (the predicted values). The residuals show the distance between the predicted values and observed values. If a residual is close to the 0 line, the distance, or error is small. If the residual is far from 0, the error between the predicted value and the observed value is large. Negative residuals mean the equation over-predicted (the predicted values are higher than the observed values). Positive residuals mean the equation under-predicted. The ideal pattern is a random scatter of residuals (positive and negative) between the two reference lines.

Some problems that you may encounter:

Outliers: An outlier outside the references lines may reduce the strength of the regression model (large error = lower ). Check for data entry errors. If the outlier is a real value, test how much it affects the model by removing it and re-running the analysis. If there is a major change in the strength of the model, you may decide to remove this point. However, you must report the outlier in your results.
You may also find a residual lying at the far end of the plot, directly on the 0 line. This outlier has pulled the line of best fit towards it. This type of outlier may boost the (this outlier has no error because the line has passed right through it). However, this point has influenced the direction of the line of best fit so your equation and predictions are likely to be incorrect.
Fan shape: A fan shape in the residuals indicates that the amount of error is not constant along the regression line. At the lower end of the line, the errors are small; at the high end, the errors are large. This pattern is called heteroscedastic (hetero = different, scedastic = scatter). When the error is constant along the regression line, the pattern called homoscedastic (same scatter). You often encounter heteroscedasticity when an additional variable is influencing the relationship.
Curve: Residuals in a curve indicate that the relationship between the two variables is not linear. You need to reconsider your variables or conduct non-linear regression analysis (not covered in this lab).

F. Prediction and confidence intervals

If the

is significant and the residuals have an acceptable pattern, you probably have a useful model. SPSS calculates the values for

and

(also called the regression coefficients). You use these values to specify the regression equation. Now you want to use the equation for prediction. There will always be some error associated with the line of best fit so we must construct a confidence interval around our predicted value. This confidence interval specifies an interval on either side of the regression line where the true value is expected to lie.

Notice that the confidence interval is curved along the regression line. At the edges of the line, there are few points to support the position of the line. A small change in the position of the line may greatly affect predicted value in these areas (see diagram on right). We are less certain about our estimate so the interval is larger to compensate. In the middle of the line, there are more points to support the position of the line. Therefore, we are more confidence about the position of the line and our interval can be narrower.

There are two formulas for calculating confidence intervals. The choice of formula depends upon the type of dataset you have:

If the dataset contains individual observations (i.e. 10 rainfall and runoff observations from one weather station), the formula is:

where:

the standard error of the estimate

If the dataset contains mean observations compiled from other datasets (i.e. mean rainfall and runoff compiled from 25 sets of observations), we remove the +1. This formula will give narrower confidence intervals than the formula above.

(the terms
are defined above)

Why two different formulas? When a dataset contains means, we are more confident that our sample data represent the population. Therefore, we are more confident that our line of best fit represents the true relationship between the variables. We will be more confident about our predictions so our confidence interval can be narrower.

Application of Regression

We will use the rainfall/runoff example to work through 7 Steps to Regression Analysis.

Step 1: Establish your theory

State what type of relationship you expect between the variables and describe the theory supporting this relationship. Identify one variable as dependent and the other as independent. Go to step 2 if you believe the relationship is supported by theory.

Step 2: Examine the data

Construct a scatter plot to view the relationship between the variables. Calculate the correlation between the variables to assess the strength of the linear association. Go to step 3 if the scatter plot shows a roughly linear relationship, and the correlation analysis indicates a moderate/strong relationship.

This scatter plot shows a strong positive linear relationship between rainfall and runoff. The correlation (r = 0.91) confirms the direction and strength of the relationship.

Step 3: Specify the regression model

Lay out the model you will use (in algebra) and define the X and Y terms. This step may seem redundant in simple regression (with few variables), but it is extremely important when conducting advanced, multi-variate regression analysis.

where:
Y = Runoff (mm) - dependent
X = Rainfall (mm) - independent

= error term

Step 4: Conduct the regression analysis and evaluate the model's performance

Interpret and test the significance of to determine the performance of the model. Go to step 5 if is significant.

SPSS

Assumptions:

Hypotheses:
: the independent variable (rainfall) does not explain the variation in the dependent (runoff) (i.e. is not different from 0)
: the independent variable explains a significant portion of the variation in the dependent (i.e. is significantly different from 0)

Significance level: We will use = 0.05 (95% confidence level)

Probability distribution: The hypothesis test uses the F probability distribution. The F distribution is similar to the Chi-square distribution (starts at 0, has only positive values, and is positively skewed). Click here for online F Table

Critical values: Critical F values are based on and 2 degrees of freedom:

Regression: the number of independent variables (across top of F table)
Residual: the sample size - 2 (down side of F table)

Test statistic:
The formula for the test statistic is:

The SPSS output gives the following values:

Regression sum of squares: 1,459,528.290
Residual sum of squares: 304,089.443

These formulas are given so that you understand how SPSS calculates the F value. We will use the F value from the SPSS output. For details on conducting the regression and interpreting the output, see the SPSS Help Manual.

Decision rule: The decision rule for this test is: reject if F* >
In this example, F* (63.47) is greater than (4.7) so we reject . The p-value calculated by SPSS (p=0.000) confirms this decision.

Inference: We infer with 95% confidence that the independent variable (rainfall) significantly explains the variation in the dependent variable (runoff) (i.e. our is significantly different from 0)

Step 5: Examine the residuals

Construct a residual plot: the dependent variable on the X-axis, the standardized residuals along the Y-axis. Comment on any problems in the residuals. Go to step 6 if the residuals have a random distribution within the two reference lines. (There are advanced techniques for dealing with heteroscedasticity, autocorrelation, and non-linear relationships but these are not covered in this course.)

The residual plot shows that the residuals are randomly distributed within the -2, +2 reference lines. This plot indicates that the assumption are probably not violated.

Step 6: Specify the regression equation

Write out the equation using the and values calculated by SPSS. Go to step 7 if you are using the equation for estimation purposes.

Step 7: Construct confidence intervals around your predicted value

Select the appropriate confidence interval formula (see critieria). Calculate a confidence interval for each predicted value.

Predict: We want to predict runoff when rainfall = 1,600 mm. We substitute X = 1,600 into the equation to get Y.

For rainfall of 1,600 mm, the predicted runoff is 607.2 mm.

Confidence Interval:

We express our predicted runoff as follows: 607.2 341.1 mm

By checking our original data, we see that for 1,594 mm of rainfall (close to 1,600 mm), the observed runoff is 641 mm. Our predicted value is similar to the observed value and falls within the confidence interval [266.1 to 948.3].

Limitations to Regression

Extrapolation
The line of best fit should only be used to predict values within or close to those contained in the dataset. Far beyond the dataset, the trend may change and the predictions will be incorrect. The diagram below shows the true relationship between rainfall and runoff. Runoff is less than predicted when rainfall is low, because the water is stored in the soil. Runoff is higher than predicted when rainfall is high, because the soil is saturated and the water runs over the ground surface.

Generalization
The regression equation developed from one dataset should not be applied to data collected in other regions or areas. A regression model developed from a specific dataset should not be used to generalize about other regions or the general occurrence of the phenomena. For example, although we find that rainfall and runoff have a particular relationship in this watershed, we cannot generalize about rainfall and runoff patterns in all watersheds.

Causation
Although we imply a causal relationship when we use regression, regression analysis will not prove causality between two variables. Recall that this procedure is strictly numeric; you may find that two totally unrelated variables give a significant . The researcher must understand the phenomena being researched and the regression procedure in order to interpret the results appropriately.

The 7 steps to regression analysis

Establish theory
Explore data
Specify regression model
Evaluate model performance
Evaluate residuals
Specify regression equation
Develop confidence intervals for predicted values