In the lab, we will examine,
Recommended Web Links:
Interpreting Correlation Results: Background information on correlation
from GraphPad
Correlation in Finland:
Move your cursor over the sliding scale and see how the scatter plot changes.
Two correlation applets from the University of Illinois:
In this lab, we will examine the relationship or 'association' between two
variables in a sample. Our questions are: what kind of relationship exists between
the variables? How strong is this relationship?
There are two ways to summarize the relationship between two variables:
These graphs are an excellent way to get a sense of the relationship that exists between two variables. Scatterplots are usually based on the Cartesian coordinate system, where one variable is represented on the vertical or Y-axis and the other is represented on the horizontal or X-axis. The resulting pattern of points in the scatterplot shows the relationship between the variables, which can be described in terms of three criteria:
Example A: The table below contains 6 observations collected for a botany project.
The diagram on the right presents a scatterplot of the data.
What kind of relationship exists between tree height and soil moisture?
|
The scatterplot shows that soil moisture and tree height have a positive relationship: as soil moisture increases, tree height also increases. We can confirm this relationship by observing trees in a forest. Trees growing in dry or rocky areas tend to be smaller than trees growing in wetter areas or near creeks.
Example B: In a study on fuel efficiency and vehicle weight, data were collected
from 6 vehicles. The data are presented as a scatterplot (below). What kind of relationship
exists between these variables?
This scatterplot shows that fuel efficiency and vehicle weight have a negative relationship: as vehicle weight increases, fuel efficiency decreases. However, this relationship may be weak as other factors - driving style, engine size, cargo weight - also affect fuel efficiency. Notice that the points are fairly spread out and do not form a definite line. |
What kind of relationship would you expect between the following variables:
The descriptive terms associated with the scatterplots are useful. However, statisticians often want a more quantitative description of the relationship between two variables. Correlation coefficients and correlation analysis provides a numerical measure of the direction and strength of the relationship. The correlation between two variables ranges from -1 to +1 (see diagram below).
Some examples of correlation coefficients (and the scatterplots) are shown below:
Perfect, negative | Moderate, negative | Weak, positive | Strong, positive |
There are two widely used correlation coefficients:
Pearson's r is the most commonly used correlation coefficient that measures the linear association between two variables. This coefficient is based on:
The variances and co-variance are influenced by the units of the variables (meters, kilograms, mm, etc.) and are difficult to interpret. Dividing the covariance of XY by the variances of X and Y removes the influence of units (this process is called 'standardizing'). The result, r, is a ratio that ranges between -1 and +1.
Formula: | where: Sx and Sy = the variances of X and Y Sxy = the covariance of XY |
Note: We will not calculate this statistic by hand in this lab.
Pearson's r requires ratio or interval data that are normally distributed
(it is a parametric statistic). For ordinal data or ratio/interval data that is skewed or biomodal, we use Spearman's rank correlation coefficient.
Formula: | where: d2 = the squared difference between the ranks for each observation N = number of observations |
Example: The table below presents population and area information for 8 regions in a country. The data are presented as ranks from 1 to 8 (1 = smallest ,8 = highest)
Region | Population Rank | Area Rank | d | d2 |
---|---|---|---|---|
A | 1 | 1 | 0 | 0 |
B | 2 | 2 | 0 | 0 |
C | 3 | 8 | -5 | 25 |
D | 4 | 5 | -1 | 1 |
E | 5 | 3 | 2 | 4 |
F | 6 | 6 | 0 | 0 |
G | 7 | 4 | 3 | 9 |
H | 8 | 7 | 1 | 1 |
Sum of squared differences (d2) | 40 |
Step 3: Fill in the formula using the sum of squared differences = 40 and N = 8:
Note: if you have ratio or interval data that are skewed or bimodal, you can still use Spearman's . However, the ratio data much be converted into ranks before calculating the coefficient.
Spearman's and Pearson's r have the same properties:
Correlation coefficients are often applied in exploratory analysis. However, they can also
be used to make inferences about population parameters,
where (called 'rho') is the population correlation coefficient.
The hypothesis tests for r and
are summarized below.
Generally, if we do not have a strong theory for the direction of the relationship, we use non-directional hypotheses. But if we know enough about the variables and can expect a certain direction of relationship, we can use directional hypotheses.
2. Check assumptions
3. State hypotheses
If you have some background information about the relationship, you can establish directional
hypotheses. For example, for the relationship between population and area, we may expect a
positive relationship (small population in a small region, larger population in a large region).
Therefore, we can test for a significant positive correlation. If you are unsure of the type
of relationship, you would use a non-directional hypothesis which will test for a
significant correlation (it may be positive or negative).
In our example, we expect a positive relationship so we will establish directional (upper tail) hypotheses.
5. Select significance level
We will use the standard = 0.05 (95% confidence level)
5. Select probability distribution
Because is our test statistic, we use the
probability distribution for our critical values
6. Establish critical value
At = 0.05 and n=8,
= 0.643
7. Calculate test statistic
From above, the
* = 0.524
8. Compare using decision rule
Rule (for upper tailed hypothesis): reject
if * >
* (0.524) is less than
(0.643), so we cannot reject
The p-value of 0.21 obtained from SPSS confirms the decision not to reject .
9. State inference
We infer with 95% confidence that there is no significant correlation between
population and area in the 8 regions.
Causality is defined as "the relation between a cause and its effect" (Webster's dictionary). Often, researchers are tempted to believe that a strong correlation indicates a causal relationship between two variables (i.e. variable A causes an effect in variable B).
No. There are many factors that influence the age structure of a neighbourhood. Seniors often move into apartments when maintaining a house becomes too much work. Therefore, neighborhoods with many apartments tend to have a larger proportion of seniors. The two variables are related but not linked in a 'cause and effect' way. The relationship may also be invalid if the study included more areas (i.e. the downtown area where the average age is lower).
Important Note: Because correlation analysis (like all other statistical analysis) depends on the data values, you may find significant correlations between any two variables. Example: The scatter plot on the right shows a moderate negative correlation (r = -0.51) between two variables. The catch: the data for these variables were generated using a random number table. The relationship between the two variables is purely numeric. |
In geographic research, data are often summarized by aggregating individual observations
(people, households, etc.) into groups. Aggregation gives researchers an overall view of
patterns in the data. However, because aggregation averages individual observations,
the true relationship may be hidden or a false relationship may develop (based purely on
numerical association). Different aggregation strategies will often produce different
results.
Five sets of variables all have a correlation of 0.80. However, when you view the scatterplots (below), five very different pictures emerge.
A. | B. | C. |
D. | E. |
A. | B. |
Critical values of Pearson's r | |||||
---|---|---|---|---|---|
Level of significance () for one tailed test | |||||
df | 0.05 | 0.025 | 0.01 | 0.005 | 0.0005 |
(n-2) | Level of significance () for two tailed test | ||||
0.1 | 0.05 | 0.025 | 0.01 | 0.001 | |
1 | 0.988 | 0.997 | 1.000 | 1.000 | 1.000 |
2 | 0.900 | 0.950 | 0.980 | 0.990 | 0.999 |
3 | 0.805 | 0.878 | 0.934 | 0.959 | 0.991 |
4 | 0.729 | 0.811 | 0.882 | 0.917 | 0.974 |
5 | 0.669 | 0.755 | 0.833 | 0.875 | 0.951 |
6 | 0.622 | 0.707 | 0.789 | 0.834 | 0.925 |
7 | 0.582 | 0.666 | 0.750 | 0.798 | 0.898 |
8 | 0.549 | 0.632 | 0.716 | 0.765 | 0.872 |
9 | 0.521 | 0.602 | 0.685 | 0.735 | 0.847 |
10 | 0.497 | 0.576 | 0.658 | 0.708 | 0.823 |
11 | 0.476 | 0.553 | 0.634 | 0.684 | 0.801 |
12 | 0.458 | 0.532 | 0.612 | 0.661 | 0.780 |
13 | 0.441 | 0.514 | 0.592 | 0.641 | 0.760 |
14 | 0.426 | 0.497 | 0.574 | 0.623 | 0.742 |
15 | 0.412 | 0.482 | 0.558 | 0.606 | 0.725 |
16 | 0.400 | 0.468 | 0.543 | 0.590 | 0.708 |
17 | 0.389 | 0.456 | 0.529 | 0.575 | 0.693 |
18 | 0.378 | 0.444 | 0.516 | 0.561 | 0.679 |
19 | 0.369 | 0.433 | 0.503 | 0.549 | 0.665 |
20 | 0.360 | 0.423 | 0.492 | 0.537 | 0.652 |
25 | 0.323 | 0.381 | 0.445 | 0.487 | 0.597 |
30 | 0.296 | 0.349 | 0.409 | 0.449 | 0.554 |
35 | 0.275 | 0.325 | 0.381 | 0.418 | 0.519 |
40 | 0.257 | 0.304 | 0.358 | 0.393 | 0.490 |
45 | 0.243 | 0.288 | 0.338 | 0.372 | 0.465 |
50 | 0.231 | 0.273 | 0.322 | 0.354 | 0.442 |
60 | 0.211 | 0.250 | 0.295 | 0.325 | 0.408 |
70 | 0.195 | 0.232 | 0.274 | 0.302 | 0.380 |
80 | 0.183 | 0.217 | 0.257 | 0.283 | 0.357 |
90 | 0.173 | 0.205 | 0.242 | 0.267 | 0.338 |
100 | 0.164 | 0.195 | 0.230 | 0.254 | 0.321 |
Critical values of Spearman's | |||
---|---|---|---|
Significance level () for one tailed test | |||
df | 0.05 | 0.025 | 0.01 |
N | Significance () for two tailed test | ||
0.1 | 0.05 | 0.025 | |
5 | 0.900 | ||
6 | 0.829 | 0.886 | 0.943 |
7 | 0.714 | 0.786 | 0.893 |
8 | 0.643 | 0.738 | 0.833 |
9 | 0.600 | 0.683 | 0.783 |
10 | 0.564 | 0.648 | 0.746 |
12 | 0.506 | 0.591 | 0.712 |
14 | 0.456 | 0.545 | 0.645 |
16 | 0.425 | 0.507 | 0.601 |
18 | 0.399 | 0.476 | 0.564 |
20 | 0.377 | 0.450 | 0.534 |
22 | 0.359 | 0.428 | 0.508 |
24 | 0.343 | 0.409 | 0.485 |
26 | 0.329 | 0.392 | 0.465 |
28 | 0.317 | 0.377 | 0.448 |
30 | 0.306 | 0.364 | 0.432 |
© University of Victoria 2000-2001
Geography 226 - Lab 8 Developed by S. Adams and M. Flaherty Updated: September 29, 2001 |