LAB 8: Correlation

In the lab, we will examine,

What are relationships between variables
How to interpret scatterplots
How to use correlation coefficients: Pearson's r and Spearman's r_s
How to test for significant correlation using r and r_s
Cautions with the use of correlation
Critical values for r and r_s are listed at the end of the lab

Recommended Web Links:
Interpreting Correlation Results: Background information on correlation from GraphPad
Correlation in Finland: Move your cursor over the sliding scale and see how the scatter plot changes.

Two correlation applets from the University of Illinois:

Put Points: Click your mouse inside the empty scatter plot to add points and see what correlation you get. This is a good place to see the effect of outliers and other problems.
Guess Correlation: Click New plots and four scatter plots appear. Test your skill by matching the plot to the Pearson's r value.

Introduction to Relationships

In this lab, we will examine the relationship or 'association' between two variables in a sample. Our questions are: what kind of relationship exists between the variables? How strong is this relationship?

Describing Relationships

There are two ways to summarize the relationship between two variables:

Graphically using scatterplots (introduced in Lab 1)
Numerically using a correlation coefficient.

1. Scatterplots

These graphs are an excellent way to get a sense of the relationship that exists between two variables. Scatterplots are usually based on the Cartesian coordinate system, where one variable is represented on the vertical or Y-axis and the other is represented on the horizontal or X-axis. The resulting pattern of points in the scatterplot shows the relationship between the variables, which can be described in terms of three criteria:

strength: weak, moderate, strong
direction: negative, positive, neutral
shape: linear, curved, sinusoidal (s-shaped)

Example A: The table below contains 6 observations collected for a botany project. The diagram on the right presents a scatterplot of the data. What kind of relationship exists between tree height and soil moisture?

Tree ID	Soil moisture (cm/m)	Tree height (m)
A	9	17
B	11	16
C	13	20
D	16	21
E	17	25
F	19	24

The scatterplot shows that soil moisture and tree height have a positive relationship: as soil moisture increases, tree height also increases. We can confirm this relationship by observing trees in a forest. Trees growing in dry or rocky areas tend to be smaller than trees growing in wetter areas or near creeks.

Example B: In a study on fuel efficiency and vehicle weight, data were collected from 6 vehicles. The data are presented as a scatterplot (below). What kind of relationship exists between these variables?

This scatterplot shows that fuel efficiency and vehicle weight have a negative relationship: as vehicle weight increases, fuel efficiency decreases.

However, this relationship may be weak as other factors - driving style, engine size, cargo weight - also affect fuel efficiency. Notice that the points are fairly spread out and do not form a definite line.

What kind of relationship would you expect between the following variables:

number of cars per capita and carbon monoxide emissions?
stream velocity and bedload particle size?
elevation and average air temperature?

2. Correlation Coefficients

The descriptive terms associated with the scatterplots are useful. However, statisticians often want a more quantitative description of the relationship between two variables. Correlation coefficients and correlation analysis provides a numerical measure of the direction and strength of the relationship. The correlation between two variables ranges from -1 to +1 (see diagram below).

Some examples of correlation coefficients (and the scatterplots) are shown below:

Perfect, negative

Moderate, negative

Weak, positive

Strong, positive

There are two widely used correlation coefficients:

Pearson's product-moment

Spearman's rank

A. Pearson's r

Pearson's r is the most commonly used correlation coefficient that measures the linear association between two variables. This coefficient is based on:

The variances of X and Y (S²_x and S²_y) which measure the amount of variation in each variable. The variance was introduced in Lab 1. The variances can also be written as S_x and S_y to reduce clutter.
The covariance of XY (S_xy), which measures how X and Y vary or 'change' together.

The variances and co-variance are influenced by the units of the variables (meters, kilograms, mm, etc.) and are difficult to interpret. Dividing the covariance of XY by the variances of X and Y removes the influence of units (this process is called 'standardizing'). The result, r, is a ratio that ranges between -1 and +1.

Formula:

where:
S_x and S_y = the variances of X and Y
S_xy = the covariance of XY

Note: We will not calculate this statistic by hand in this lab.

Pearson's r requires ratio or interval data that are normally distributed (it is a parametric statistic). For ordinal data or ratio/interval data that is skewed or biomodal, we use Spearman's rank correlation coefficient.

B. Spearman's rank ()

Spearman's

measures the association between two sets of data that have been ranked.

Formula:

where:
d² = the squared difference between the ranks for each observation
N = number of observations

Example: The table below presents population and area information for 8 regions in a country. The data are presented as ranks from 1 to 8 (1 = smallest ,8 = highest)

Step 1

Step 2

Region	Population Rank	Area Rank	d	d²
A	1	1	0	0
B	2	2	0	0
C	3	8	-5	25
D	4	5	-1	1
E	5	3	2	4
F	6	6	0	0
G	7	4	3	9
H	8	7	1	1
Sum of squared differences (d²)				40

Step 3: Fill in the formula using the sum of squared differences = 40 and N = 8:

Note: if you have ratio or interval data that are skewed or bimodal, you can still use Spearman's . However, the ratio data much be converted into ranks before calculating the coefficient.

Spearman's and Pearson's r have the same properties:

r and range from -1 to +1
When X and Y both increase, r and are positive
When X increases and Y decreases (or vice versa), r and are negative.
When there is no relationship between X and Y, r and are close to 0. In other words, X and Y are uncorrelated.

Correlation coefficients are often applied in exploratory analysis. However, they can also be used to make inferences about population parameters, where (called 'rho') is the population correlation coefficient. The hypothesis tests for r and are summarized below.

Testing for Significant Correlation

Pearson's r

Used to determine if an association exists between two variables. In this test, the null hypothesis is that no significant correlation exists between the two variables (

= 0). The alternate hypotheses are:

two tailed, : 0 (significant correlation; may be negative or positive)
one tailed (upper), : > 0 (significant positive correlation)
one tailed (lower), : < 0 (significant negative correlation)

Generally, if we do not have a strong theory for the direction of the relationship, we use non-directional hypotheses. But if we know enough about the variables and can expect a certain direction of relationship, we can use directional hypotheses.

Assumptions

The data are interval or ratio
The variables have a linear relationship
The data are a random sample of paired observations
The variables are normally distributed (this assumption makes Pearson's r a parametric statistic)

Probability distribution and Test statistic

The hypothesis test for r can be conducted in two ways.

calculate a test statistic (t*) using r and compare t* to a critical value on the student's t distribution
compare r to a critical value based on a probability distribution of r values. The r probability distribution has a similar shape to the Z distribution.

In this lab, we will use the second method, as it requires no additional calculations. We will use SPSS to calculate r as the formula is long and complex (as we saw above).

Critical values

Critical values are based on

and the number of degrees of freedom (n-2, where n=sample size). All values in the table are positive. If you are using a lower tailed hypothesis, make the critical value negative. Go to the critical r table.

Decision rule

For non-directional hypotheses: reject

if r* is greater than

or less than -

For directional hypothesis:

Multiple correlations

Analysts often use SPSS to calculate multiple correlation coefficients. When conducting multiple significance tests, you can summarize the information that is identical in each test in a table (assumptions, probability distribution, etc).

Spearman's r_s

Used to determine if there is a significant association between two variables. The test uses the same null and alternate hypotheses as Pearson's r.

Assumptions

The data are ordinal or ratio/interval that have been 'downgraded' to the ordinal scale
There must be at least five pairs of observations
The data are a random sample of paired observations.

Note: Both correlation tests assume that the sample is random. In a non-random sample, there may be some other factor(s) influencing the relationship between the two variables. However, often our data are collect in a specific geographic region (all municipalities within a city, all cities within a province, etc.) and our sample is not truly random. You can still use correlation, but you must acknowledge that this assumption may be violated.

If several cities were selected from all possible cities in a province, then the sample is considered random and this assumption is not violated.

Probability distribution and Test statistic

The hypothesis test for

can also be conducted in two ways.

calculate a test statistic (Z*) using and compare Z* to a critical value on the normal Z distribution
compare to a critical value based on a probability distribution of values.

In this lab, we will also use the second method. We will obtain

from SPSS.

Critical values

Critical values are based on

and the degrees of freedom (n, where n=sample size). All values in the table are positive; if you are using a lower tailed hypothesis, make the critical value negative. Go to the critical r_s table.

Decision rule

Use the same decision rules given for Pearson's r.

Example

We will continue with the population and country example from above:

2. Check assumptions

The population and area data are ordinal ranks
There are more than 5 observations
The sample is random

3. State hypotheses
If you have some background information about the relationship, you can establish directional hypotheses. For example, for the relationship between population and area, we may expect a positive relationship (small population in a small region, larger population in a large region). Therefore, we can test for a significant positive correlation. If you are unsure of the type of relationship, you would use a non-directional hypothesis which will test for a significant correlation (it may be positive or negative).

In our example, we expect a positive relationship so we will establish directional (upper tail) hypotheses.

5. Select significance level
We will use the standard = 0.05 (95% confidence level)

5. Select probability distribution
Because is our test statistic, we use the probability distribution for our critical values

6. Establish critical value
At = 0.05 and n=8, = 0.643

7. Calculate test statistic
From above, the * = 0.524

8. Compare using decision rule
Rule (for upper tailed hypothesis): reject if * >
* (0.524) is less than (0.643), so we cannot reject

The p-value of 0.21 obtained from SPSS confirms the decision not to reject .

9. State inference
We infer with 95% confidence that there is no significant correlation between population and area in the 8 regions.

Cautions with Correlation

Causality and Spurious Correlation

Causality is defined as "the relation between a cause and its effect" (Webster's dictionary). Often, researchers are tempted to believe that a strong correlation indicates a causal relationship between two variables (i.e. variable A causes an effect in variable B).

Example:

cause

No. There are many factors that influence the age structure of a neighbourhood. Seniors often move into apartments when maintaining a house becomes too much work. Therefore, neighborhoods with many apartments tend to have a larger proportion of seniors. The two variables are related but not linked in a 'cause and effect' way. The relationship may also be invalid if the study included more areas (i.e. the downtown area where the average age is lower).

When two variables seem to be correlated, but are actually dependent on a third (or more) variables, the correlation is called spurious.

Important Note: Because correlation analysis (like all other statistical analysis) depends on the data values, you may find significant correlations between any two variables.

Example: The scatter plot on the right shows a moderate negative correlation (r = -0.51) between two variables.

The catch: the data for these variables were generated using a random number table. The relationship between the two variables is purely numeric.

Aggregation

In geographic research, data are often summarized by aggregating individual observations (people, households, etc.) into groups. Aggregation gives researchers an overall view of patterns in the data. However, because aggregation averages individual observations, the true relationship may be hidden or a false relationship may develop (based purely on numerical association). Different aggregation strategies will often produce different results.

A.	B.	C.
D.	E.

Critical values of Pearson's r
	Level of significance () for one tailed test
df	0.05	0.025	0.01	0.005	0.0005
(n-2)	Level of significance () for two tailed test
	0.1	0.05	0.025	0.01	0.001
1	0.988	0.997	1.000	1.000	1.000
2	0.900	0.950	0.980	0.990	0.999
3	0.805	0.878	0.934	0.959	0.991
4	0.729	0.811	0.882	0.917	0.974
5	0.669	0.755	0.833	0.875	0.951
6	0.622	0.707	0.789	0.834	0.925
7	0.582	0.666	0.750	0.798	0.898
8	0.549	0.632	0.716	0.765	0.872
9	0.521	0.602	0.685	0.735	0.847
10	0.497	0.576	0.658	0.708	0.823
11	0.476	0.553	0.634	0.684	0.801
12	0.458	0.532	0.612	0.661	0.780
13	0.441	0.514	0.592	0.641	0.760
14	0.426	0.497	0.574	0.623	0.742
15	0.412	0.482	0.558	0.606	0.725
16	0.400	0.468	0.543	0.590	0.708
17	0.389	0.456	0.529	0.575	0.693
18	0.378	0.444	0.516	0.561	0.679
19	0.369	0.433	0.503	0.549	0.665
20	0.360	0.423	0.492	0.537	0.652
25	0.323	0.381	0.445	0.487	0.597
30	0.296	0.349	0.409	0.449	0.554
35	0.275	0.325	0.381	0.418	0.519
40	0.257	0.304	0.358	0.393	0.490
45	0.243	0.288	0.338	0.372	0.465
50	0.231	0.273	0.322	0.354	0.442
60	0.211	0.250	0.295	0.325	0.408
70	0.195	0.232	0.274	0.302	0.380
80	0.183	0.217	0.257	0.283	0.357
90	0.173	0.205	0.242	0.267	0.338
100	0.164	0.195	0.230	0.254	0.321

Critical values of Spearman's
	Significance level () for one tailed test
df	0.05	0.025	0.01
N	Significance () for two tailed test
	0.1	0.05	0.025
5	0.900
6	0.829	0.886	0.943
7	0.714	0.786	0.893
8	0.643	0.738	0.833
9	0.600	0.683	0.783
10	0.564	0.648	0.746
12	0.506	0.591	0.712
14	0.456	0.545	0.645
16	0.425	0.507	0.601
18	0.399	0.476	0.564
20	0.377	0.450	0.534
22	0.359	0.428	0.508
24	0.343	0.409	0.485
26	0.329	0.392	0.465
28	0.317	0.377	0.448
30	0.306	0.364	0.432

LAB 8: Correlation

Introduction to Relationships

Describing Relationships

1. Scatterplots

2. Correlation Coefficients

A. Pearson's r

B. Spearman's rank ()

Testing for Significant Correlation

Pearson's r

Assumptions

Probability distribution and Test statistic

Critical values

Decision rule

Multiple correlations

Spearman's rs

Assumptions

Probability distribution and Test statistic

Critical values

Decision rule

Example

Cautions with Correlation

Causality and Spurious Correlation

Aggregation

Other problems

Spearman's r_s