LAB 8: Correlation

 

In the lab, we will examine,

Recommended Web Links:
Interpreting Correlation Results: Background information on correlation from GraphPad
Correlation in Finland: Move your cursor over the sliding scale and see how the scatter plot changes.
 
Two correlation applets from the University of Illinois:


Introduction to Relationships

In this lab, we will examine the relationship or 'association' between two variables in a sample. Our questions are: what kind of relationship exists between the variables? How strong is this relationship?
 

Describing Relationships

There are two ways to summarize the relationship between two variables:

  1. Graphically using scatterplots (introduced in Lab 1)
  2. Numerically using a correlation coefficient.
     

1. Scatterplots

These graphs are an excellent way to get a sense of the relationship that exists between two variables. Scatterplots are usually based on the Cartesian coordinate system, where one variable is represented on the vertical or Y-axis and the other is represented on the horizontal or X-axis. The resulting pattern of points in the scatterplot shows the relationship between the variables, which can be described in terms of three criteria:

Example A: The table below contains 6 observations collected for a botany project. The diagram on the right presents a scatterplot of the data. What kind of relationship exists between tree height and soil moisture?
 

Tree ID Soil moisture
(cm/m)
Tree height
(m)
A 9 17
B 11 16
C 13 20
D 16 21
E 17 25
F 19 24

The scatterplot shows that soil moisture and tree height have a positive relationship: as soil moisture increases, tree height also increases. We can confirm this relationship by observing trees in a forest. Trees growing in dry or rocky areas tend to be smaller than trees growing in wetter areas or near creeks.

Example B: In a study on fuel efficiency and vehicle weight, data were collected from 6 vehicles. The data are presented as a scatterplot (below). What kind of relationship exists between these variables?
 

This scatterplot shows that fuel efficiency and vehicle weight have a negative relationship: as vehicle weight increases, fuel efficiency decreases.

However, this relationship may be weak as other factors - driving style, engine size, cargo weight - also affect fuel efficiency. Notice that the points are fairly spread out and do not form a definite line.

What kind of relationship would you expect between the following variables:

  1. number of cars per capita and carbon monoxide emissions?
  2. stream velocity and bedload particle size?
  3. elevation and average air temperature?
     

2. Correlation Coefficients

The descriptive terms associated with the scatterplots are useful. However, statisticians often want a more quantitative description of the relationship between two variables. Correlation coefficients and correlation analysis provides a numerical measure of the direction and strength of the relationship. The correlation between two variables ranges from -1 to +1 (see diagram below).

Some examples of correlation coefficients (and the scatterplots) are shown below:

Perfect, negative
Moderate, negative
Weak, positive
Strong, positive

There are two widely used correlation coefficients:

A. Pearson's r

Pearson's r is the most commonly used correlation coefficient that measures the linear association between two variables. This coefficient is based on:

  1. The variances of X and Y (S2x and S2y) which measure the amount of variation in each variable. The variance was introduced in Lab 1. The variances can also be written as Sx and Sy to reduce clutter.

     
  2. The covariance of XY (Sxy), which measures how X and Y vary or 'change' together.

The variances and co-variance are influenced by the units of the variables (meters, kilograms, mm, etc.) and are difficult to interpret. Dividing the covariance of XY by the variances of X and Y removes the influence of units (this process is called 'standardizing'). The result, r, is a ratio that ranges between -1 and +1.

Formula: where:
Sx and Sy = the variances of X and Y
Sxy = the covariance of XY

Note: We will not calculate this statistic by hand in this lab.

Pearson's r requires ratio or interval data that are normally distributed (it is a parametric statistic). For ordinal data or ratio/interval data that is skewed or biomodal, we use Spearman's rank correlation coefficient.
 

B. Spearman's rank ()

Spearman's measures the association between two sets of data that have been ranked.

Formula: where:
d2 = the squared difference between the ranks for each observation
N = number of observations

Example: The table below presents population and area information for 8 regions in a country. The data are presented as ranks from 1 to 8 (1 = smallest ,8 = highest)

Spearman's and Pearson's r have the same properties:

Correlation coefficients are often applied in exploratory analysis. However, they can also be used to make inferences about population parameters, where (called 'rho') is the population correlation coefficient. The hypothesis tests for r and are summarized below.
 

 



Testing for Significant Correlation

Pearson's r

Used to determine if an association exists between two variables. In this test, the null hypothesis is that no significant correlation exists between the two variables (: = 0). The alternate hypotheses are:

Generally, if we do not have a strong theory for the direction of the relationship, we use non-directional hypotheses. But if we know enough about the variables and can expect a certain direction of relationship, we can use directional hypotheses.

Assumptions

Probability distribution and Test statistic

The hypothesis test for r can be conducted in two ways.
  1. calculate a test statistic (t*) using r and compare t* to a critical value on the student's t distribution
  2. compare r to a critical value based on a probability distribution of r values. The r probability distribution has a similar shape to the Z distribution.
In this lab, we will use the second method, as it requires no additional calculations. We will use SPSS to calculate r as the formula is long and complex (as we saw above).

Critical values

Critical values are based on and the number of degrees of freedom (n-2, where n=sample size). All values in the table are positive. If you are using a lower tailed hypothesis, make the critical value negative. Go to the critical r table.

Decision rule

For non-directional hypotheses: reject if r* is greater than or less than -
 
For directional hypothesis:

Multiple correlations

Analysts often use SPSS to calculate multiple correlation coefficients. When conducting multiple significance tests, you can summarize the information that is identical in each test in a table (assumptions, probability distribution, etc).
 

Spearman's rs

Used to determine if there is a significant association between two variables. The test uses the same null and alternate hypotheses as Pearson's r.

Assumptions

Probability distribution and Test statistic

The hypothesis test for can also be conducted in two ways.
  1. calculate a test statistic (Z*) using and compare Z* to a critical value on the normal Z distribution
  2. compare to a critical value based on a probability distribution of values.
In this lab, we will also use the second method. We will obtain from SPSS.

Critical values

Critical values are based on and the degrees of freedom (n, where n=sample size). All values in the table are positive; if you are using a lower tailed hypothesis, make the critical value negative. Go to the critical rs table.

Decision rule

Use the same decision rules given for Pearson's r.

Example

We will continue with the population and country example from above:

2. Check assumptions

3. State hypotheses
If you have some background information about the relationship, you can establish directional hypotheses. For example, for the relationship between population and area, we may expect a positive relationship (small population in a small region, larger population in a large region). Therefore, we can test for a significant positive correlation. If you are unsure of the type of relationship, you would use a non-directional hypothesis which will test for a significant correlation (it may be positive or negative).

In our example, we expect a positive relationship so we will establish directional (upper tail) hypotheses.

5. Select significance level
We will use the standard = 0.05 (95% confidence level)

5. Select probability distribution
Because is our test statistic, we use the probability distribution for our critical values

6. Establish critical value
At = 0.05 and n=8, = 0.643

7. Calculate test statistic
From above, the * = 0.524

8. Compare using decision rule
Rule (for upper tailed hypothesis): reject if * >
* (0.524) is less than (0.643), so we cannot reject

The p-value of 0.21 obtained from SPSS confirms the decision not to reject .

9. State inference
We infer with 95% confidence that there is no significant correlation between population and area in the 8 regions.
 

Cautions with Correlation

Causality and Spurious Correlation

Causality is defined as "the relation between a cause and its effect" (Webster's dictionary). Often, researchers are tempted to believe that a strong correlation indicates a causal relationship between two variables (i.e. variable A causes an effect in variable B).

When two variables seem to be correlated, but are actually dependent on a third (or more) variables, the correlation is called spurious.

Important Note: Because correlation analysis (like all other statistical analysis) depends on the data values, you may find significant correlations between any two variables.

Example: The scatter plot on the right shows a moderate negative correlation (r = -0.51) between two variables.

The catch: the data for these variables were generated using a random number table. The relationship between the two variables is purely numeric.

Aggregation

In geographic research, data are often summarized by aggregating individual observations (people, households, etc.) into groups. Aggregation gives researchers an overall view of patterns in the data. However, because aggregation averages individual observations, the true relationship may be hidden or a false relationship may develop (based purely on numerical association). Different aggregation strategies will often produce different results.
 

Other problems

Five sets of variables all have a correlation of 0.80. However, when you view the scatterplots (below), five very different pictures emerge.

A.
B.
C.
D.
E.