LAB 1: Data and Descriptive Statistics

 

In this lab, we will examine:
Recommended web links:
The Shodor Education Foundation, Inc : Try the Histogram, Measures, Normal distribution or Skew distribution Java applets. For each applet, click the HOW button for instructions.

STEPS Glossary from the University of Glasgow: Great definitions of statistical terms.

Statistics issues in the news: Majority of Canadians back missile shield

 
 

Understanding Data

What are data?

DATA - numbers, symbols or words: mm, 1,172.5,

INFORMATION - data in context:

A newspaper article reports that 1,172.5 mm is the mean rainfall for July recorded in Vancouver.

KNOWLEDGE - practical or theoretical understanding: based on your knowledge of the climate in south coastal BC, you know that 1,172.5 mm of rainfall in July is simply not possible.

What is a variable?

A variable is a characteristic of an entity that may take on different values. If you are studying local climate, your variables might include temperature, precipitation, wind speed, humidity and hours of sunlight. The values of each variable change over space and time (see the table of weather records for two locations). You may find trends and patterns in the variations that will help to explain local climate events.

Example: Weather conditions on July 24 at Victoria Airport

Variables Values
Location A B
Day temperature 22 26
Night temperature 6 8
Wind speed 10 km/h 2 km/h
Precipitation 2 mm 10 mm

 

Characteristics of Data

Data and variables are classified in three ways:

A. Quantitative
numeric data
OR Qualitative
number, text or symbol used as codes
 
B. Discrete
events or objects are integers, e.g. 1 earthquake or 2 earthquakes, not 1.5 earthquakes)
OR Continuous
variable may taken on any value on a number scale, e.g. 1.6 kg)
 
C. Level of measurement: indicates the amount of information a variable contains and which mathematical operations can be performed on the data. There are 4 levels of measurement:
 
1. NOMINAL: Numbers are used as labels, but there is no hierarchy implied by the codes. You can count the observations in each category, but you cannot use the codes to establish an order.

Example: features on a map are coded as follows:

    1 = Lake
    2 = River
    3 = Road
    4 = Campsite
    5 = Urban area

You can count the number of lakes, roads, etc. on the map but you cannot order, add, multiply, or divide the numerical codes.

 
2. ORDINAL: Observations can be ranked or have a rating scale attached. You can count and order the observations.

Example: in movie theaters, you can buy a small, medium or large popcorn. While you know a large popcorn is bigger than small or medium, you do not know exactly how much bigger.

 
3. INTERVAL: The distance interval between units on a scale can be determined, but the zero point is arbitrary. Values can be added and subtracted but not meaningfully multiplied or divided.

Example: an interval of 365 days separates the start of 1998, 1999 and 2000, but the zero point (0 AD) is arbitrary as time did not begin then. You can say that 1990 AD minus 1885 AD equals 115 years. However, 1990 AD divided by 300 AD equals 6.6, which is not meaningful (i.e. you cannot say that 1990 AD is '6.6 times bigger' than 300 AD). Additional examples include elevation (the zero point of 'mean sea level' is arbitrary) and temperature in °C or °F (0° is arbitrary on these scales).

 
4. RATIO: Values have a natural zero that indicates the absence of the property under consideration. Values can be multiplied and divided meaningfully.

Example: in a fish tagging survey, salmon A is 20.3 cm long and salmon B is 46.7 cm long. B is 26.4 cm or 2.3 times longer than A.


 
The diagram below summarizes the classification of variables:

Practice: classify the data from this forestry survey:

  Tree ID Species Height (m) Elevation (m) # of scars Damage rating
  B-11 Pine 23.5 347 1 Low
B-15 Pine 15.6 960 5 Severe
C-42 Spruce 22.3 826 3 Medium
F-01 Douglas-fir 45.2 450 0 Low
Qual or Quant            
D or C            
Level            
Check your answers

The statistical tests that you can apply depend on the measurement level of the data you collect. If you collect data on people's responses to political statements using rankings (1= strongly disagree to 5 = strongly agree), you cannot calculate statistics involving ratios (such as the mean or standard deviation).
 

Precision, Accuracy and Validity

Precision: the exactness of the measurement. Often, precision depends on the measurement instrument. With a ruler, you could measure the width of a tree branch to within 1 mm. With digital calipers, you could read to within 0.1 mm or even 0.01 mm. Therefore, the calipers are more precise than the ruler.

Accuracy: how close the measured value is to the true value. If the calipers are improperly calibrated, your measurements may be several millimetres over or under the true value. The measurements would be precise, but not accurate as they are not close the true value. If measurements are always over or under, they are considered to be consistently or systematically biased.

Validity: in its simplest form, validity is the degree to which a variable measures what it is supposed to measure. For many physical variables, validity is not an issue - stream depth is stream depth. However, surrogate variables are often used to measure and understand general concepts, such as quality-of-life, human development, ecological productivity. How would you 'measure' quality-of-life? Is tree height a valid measure of ecological productivity?

 

What are DESCRIPTIVE STATISTICS?

These are numerical or graphical tools for describing, exploring, summarizing, and presenting data.

Questions you might ask: Tools for finding the answers:
What is the pattern of values? Distributions: frequencies, histogram, skewness, kurtosis
What is the typical value? Measures of central tendency: mean, median, mode
How spread apart are the values? Measures of dispersion: standard deviation, interquartile range, range

 

A. Data Distributions

Consider a dataset containing the heights of 30 adults. This collection of heights is called a distribution of data.

Height (cm)
158 165 168 168 170
171 172 172 172 174
174 174 176 176 177
177 177 177 177 179
179 179 181 181 183
183 185 188 192 195

What do you expect the distribution to look like if you lined the group of people up according to height? There are several ways to view the pattern of your distribution. One way is to construct a dot plot, showing the values as dots on a number line. Whenever you have 2 or more subjects with the same value, you stack the dots. You can then begin to see the shape of the distribution: many dots in the middle, a few at each end.

Constructing dot plots is very time-consuming when you have more than about 20 points. A more efficient method is the histogram, which is based on a frequency table. To build a histogram you must establish classes or intervals for your data (if you have nominal or ordinal data, your classes are already defined).

Guidelines for defining intervals for interval or ratio data:

  1. Use simple limits for your intervals: 10.0 to 14.9, 15.0 to 19.9, etc., are better limits than 3.4 to 7.8, 7.9 to 12.2, etc. Start your lowest limit at the first round number below the first observation and your upper limit just above the highest observation.
     
  2. Make sure your intervals are all the same width; otherwise, your summary may be very misleading.
     
  3. Make sure your intervals include all the observations and do not overlap. If you use the intervals 10 to 15 and 15 to 20, where you put place the value 15? Also, it's a good idea to specify your intervals to one more decimal place than your most precise observation so that it is clear how your data fit into the intervals.
     
  4. Select an appropriate number of groups:
    • Order your data and look at the distribution of values - are there any natural gaps or groups? Try to make your intervals capture these trends in the distribution (refer back to the dot plot of heights).
       
    • Find a balance between data completeness and simplicity. If you use many groups, it will be difficult to see a pattern as each group may only have one or two observations. If you use only a few groups, your observations may be lumped together and you miss important trends.

     

In this case, we divided the data into 5 groups, starting at 150 cm and using intervals of 10 cm. You could divide the data into more or fewer intervals or start at a different place (perhaps 155 cm). Once the intervals are established, we count the number of observations in each category to find the frequency. We can also calculate the relative frequency, which is the frequency of each interval divided by the total number of observations. Relative frequencies make it easier to compare distributions with different samples sizes or different measurement units. Relative frequencies can be expressed as a decimal or a percentage.

Frequency Table for Height Sample
Interval Frequency Relative Frequency
150.0 - 159.9 1 0.03 3%
160.0 - 169.9 10 0.10 33%
170.0 - 179.9 17 0.57 57%
180.0 - 189.9 7 0.23 23%
190.0 - 199.9 2 0.07 7%
Total 30 1.00 100%

Drawing a histogram of the frequencies provides a visual picture of the data. The x-axis contains the categories or intervals; the y-axis is labeled with the frequencies (relative frequencies are also shown on our graph). Notice how the rectangles in the middle of the graph are taller than those at either end of the graph - the size and area of each rectangle should be proportional to the frequency of observations in each interval.

Notice how the histogram makes the distribution look 'blocky'. This is fine for nominal, ordinal or discrete data and spaces are usually left between each category on the graph in these cases. For continuous data (where the data may have any value within the distribution), a frequency polygon is a better representation of the frequencies. The area of the polygon is equal to the sum of each rectangle area.

The shape of this polygon is still crude because it is based on few observations. If we had a large dataset (with thousands of observations), the polygon will start to smooth out into a curve. The area under this curve represents the frequencies in your distribution. The concept of area under the curve is very important in inferential statistics as you will see in later labs.

In the diagram above, the curve is bell-shaped, has one peak in the middle and is symmetrical (the left side is a mirror-image of the right side). Distributions with this shape are considered normal as many distributions in nature follow this shape. For example, if you measure the size of fish caught in a net, you would probably find a few small fish, a few big fish, while the bulk of the fish would be somewhere between the two extremes.

Characteristics of the normal curve

However, there are distributions that do not follow this ideal shape. The table below shows three ways of describing the shape of your curve. Compare these shapes to the normal curve above.

Peakedness or kurtosis (not required for Geog 226)
  • Mesokurtic: distribution has central peak but data also spread to ends (also called "bell-shaped")
  • Leptokurtic: observations are very clustered in a central peak.
  • Platykurtic: distribution is flat - peak is minor or absent

Number of peaks
  • Unimodal: one main peak
  • Bimodal: two peaks (may be two separate factors influencing the distribution of data: eg. one peak in height for males, one for females)
  • Multi-modal: many peaks

Symmetry or skewness
  • Symmetrical: the two sides of the distribution are equal
  • Positive skew: the long tail of the distribution stretches to the positive (large values) end of the x-axis
  • Negative skew: the long tail of the distribution stretches to the negative (small values) end of the x-axis

The width and number of intervals selected will influence the shape of your histogram and your conclusions about the nature of the distribution. Your histogram should highlight the characteristics of your data, not hide them. The diagram below shows how different interval widths can change the shape of a histogram, and your interpretation of the distribution. The distribution on the left would be considered platykurtic and unimodal, the distribution on the right is mesokurtic and bimodal.

Many statistical tests use the normal curve as a model for the 'expected' distribution. If a distribution is highly skewed or has multiple peaks, you may not be able to use certain tests. Always look at your data before plunging into statistical analysis!

 

B. Measures of Central Tendency

An important characteristic of a distribution is the location of the typical value or 'center' of the data distribution. There are several numerical measures of central tendency, each with certain advantages and disadvantages.
 

1. Arithmetic Mean

The arithmetic mean, , represents the 'average' observation in a data set. It is also the center of gravity or balancing point in the data set. The mean is the sum of all observations divided by the number of observations. The mean can be calculated for interval or ratio data only.

Formula:

where:
is the mean (population)
means 'take the sum of'
Xi = each observation in the dataset
N= number of observations (population)

Example: given stream velocities (m/sec): 2, 6, 17, 49, 88, 56, 33, 25, 4

The mean is the most widely used measure of central tendency. However, it is not always the best measure. It may be influenced by extreme values (outliers).

Example: the mean of {2, 3, 4, 5, 8, 9, 11, 15, 223} is also 31 m/sec. But if you remove the value 223, the mean is 7 m/sec (57/8 = 7)
 

2. Median

The median is the midpoint of an ordered data set: 50% of the observations fall avbove it, 50% fall below it. The median can be calculated for ordinal, interval or ratio data.

Interval or ratio data:

Md = (n+1)/2 where: n = the number of ranked observations.

Example:

For categorical data, use the formula (n+1)/2 to find out where (in which category) the median lies. This method can be used for ordinal data and interval or ratio data that have been organized into categories.

Example: The table below contains income data. What is the median income level?

Income category Number of people
$0 to $14,999 546
$15,000 to $29,999 1,220
$30,000 to $44,999 3,275
$45,000 to $59,999 60
Total observations (n) 5,101

Using the formula (n+1)/2, we find (5,101+1)/2 = 2551. This indicates that the median is the 2,551st observation. The 2,551st observation lies in the 3rd category (count observations starting at lowest category). Therefore, the median income level is $30,000 to $44,999.

The median is a good descriptive measure of central tendency for ordinal data and skewed interval or ratio data. However, it is insensitive to the spread of values in a data set.

Example:

3. Mode

The mode is the most frequently occurring value in a dataset. There can be more than one mode if two or more values are equally common.

Example: the data below are the number of children per family for 100 families.

Number of children Frequency
0 5
1 9
2 55
3 24
4 7

The mode is 2 children per family because 2 is the most frequently occuring value. In this case, it may be preferable to use the mode than to say there are, on average, 2.2 children per family.

The mode can be used for any level of measurement. Technically, nominal data have no 'center' because the data cannot be ordered. However, the mode can be used to identify the value with the highest frequency. Continuous data sets often do not contain identical values so the mode is not always a practical measure. However, if continuous data are grouped into categories, the modal class is the category with the highest frequency.
 

Which measure of central tendency should I use?
The mean is the most commonly used measure of central tendency as it takes into account the magnitude (or value) of each observation. However, in some cases, the mean may not be appropriate.

Which measure(s) of central tendency would you use for:

  1. Ordinal data?
  2. Distributions with extreme values?
  3. Nominal data?
  4. Bimodal distributions?
  5. Skewed distributions?
     

Interaction between the mean, median and mode

If data follow the normal distribution, the mean, median and mode all have the same value. Because the normal distribution is symmetrical, the mean falls in the middle of the distrbution, the median divides in the distribution in half. The modal class, with the highest frequency, also lies in the middle of the distribution.
If the distribution is skewed, the mean, median and mode no longer have the same value. For example, if the distribution is negatively skewed, the mode lies where the highest number of observations are. The mean is closer to the long tail because the extreme values influence its value. The median falls between the mean and mode as it divides the distribution in half.

 

C. Special Topic: Weighted Means

This section presents 3 methods for calculating means with weighting factors:

  1. Overall mean of several samples,
  2. Means using counts (areas or populations),
  3. Means using categories.
     

a) Overall Mean

You may need to calculate the overall mean for several samples. If each sample has the same size, you could simply average the means. If the sample sizes are different, your mean will be incorrect if you use this method. One option is to combine all the data in one large spreadsheet and calculate 'the' mean. However, if you do not have access to the original data, you could calculate the overall mean by weighting each samle by the sample size.

Formula: where:
= mean of each sample
n = number of observations in each sample

To find the overall mean:

  1. Multiply each mean by its sample size
  2. Sum the results
  3. Divide by the sum of the sample sizes

Example: To determine average tree height in a forested area, researchers laid out four 30m by 30m sample plots. At each plot, they measured the height of every 10th tree. The mean heights and sample sizes are summarized below. The individual tree heights are not available, so we cannot combine all the data and use the simple aritmethic mean.

Sample Mean height ( ) Sample size (n)
A 15.6 m 49
B 21.9 m 26
C 25.3 m 17
D 28.8 m 8

If you had simply calculated the average of the means alone, what would your result be? Would this result be correct?
 

b) Means using counts

You can also use areas or population counts as the weighting factor if required.

Formula: where:
Xi = each observation in the dataset
w = weigting factor

Practice: the mean income and the population for five enumeration areas are summarized on the map below. Calculate the overall mean income for the census tract.

= Population * Income for each enumeration area, divided by total census population

= 245($1,500) + 75($2,560) + 150($1,250) + ...

= Check your answer


 

c) Means using categories

If you have interval or ratio data in categories, you can calculate the overall mean using the number of observations as the weights and the category midpoints as approximate mean values.

Formula: where:
fi = the frequency of each category (the number of observations)
mi = the midpoint value of each category
n = total number of observations

Example: The data below are income categories and frequency counts; calculate the average income:

Income category Frequency
$15,000 to $29,999 25
$30,000 to $44,999 60
$45,000 to $59,999 38

Step 1 - Calculate the midpoint of each category as follows:

  1. Subtract the low value from the high value
  2. Divide this difference by 2
  3. Add this amount to the low value.

    Example: The midpoint for the $15,000 to $29,999 category is:

      mp = low value + (high value - low value)/2
      mp = $15,000 + ($29,999 - $15,000)/2
      mp = $15,000 + $7,499.50
      mp = $22,499.50

    The midpoint for the $30,000 to $44,999 category is:

      mp = $30,000 + ($44,999 - $30,000)/2
      mp = $30,000 + $7,499.50
      mp = $37,499.50

Step 2 - Multiply the midpoint by the frequency for each category. Add the frequencies to get n and calculate the sum of fimi:

Category Midpoint (mi) Frequency (fi) fimi
$15,000 - $29,999 $22,499.50 25 562,487.5
$30,000 - $44,999 $37,499.50 60 2,249,970
$45,000 to $59,999 $52,499.50 38 1,994,981
Sum   123 4,807,439

Step 3 - Divide (fimi) by n to obtain .

 

D. Measures of Dispersion

While measures of central tendency identify the average or most frequent value in a distribution, they give no indication about the spread or dispersion of the values. Are the values very similar (low dispersion) or is there a big difference between the smallest and largest values (high dispersion)? There are several measures of dispersion which 'measure' the degree of spread from the 'center' of the data set.
 

1. Range

The range is the difference between the maximum and minimum values in a dataset. It can be used for interval or ratio data.

Formula: Range = Xmax - Xmin

Example: your data are rainfall (mm): 102, 153, 185, 142, 112, 197, 138, 166, 88

The range provides a very limited insight into the amount of dispersion in a distribution as only the largest and smallest values are considered. Also, the range may be greatly influenced by a single extreme value in the distribution.

Example: 2nd sample of rainfall (mm): 73, 56, 61, 82, 64, 57, 165, 93, 85
The range is: 165 - 56 = 109 mm. However, 8 of the 9 values lie between 56 and 93.
 

2. Inter-quartile Range (IQR)

As the range is influenced by extreme values, it is often better to calculate the dispersion for the middle of the dataset. Quartiles are the values of the observations that divide the dataset into 4 equal parts. If you have 100 observations, the first quartile is the value of the 25th observation; the 3rd quartile is the value of the 75th observation. The Inter-Quartile Range is the difference between the first quartile and third quartile of the dataset. It can be calculated for interval and ratio data.

Formula: IQR = Q3 - Q1

How to calculate the IQR:

  1. order your dataset
  2. count the number of observations (n)
  3. find where Q1 and Q3 lie in the distribution using: Q1 = (n+1)/4 and Q3 = 3(n+1)/4
  4. find the values of Q1 and Q3
  5. IQR = Q3 - Q1

Example: using the sample of rainfall (mm):

1. Order the data: 88, 102, 112, 138, 142, 153, 166, 185, 197

2. Count observations: n = 9

3. Find where Q1 and Q3 lie in the distribution:

4. Find values for Q1 and Q3

5. Calculate IQR:

The inter-quartile range tells us where the bulk of the observations lie. Half of the observations in any dataset lie between Q1 and Q3; a quarter (25%) of observations are larger than Q3 and a quarter are smaller than Q1. The IQR is less likely to be influenced by extreme values than the range. FYI: the median is considered to be the 2nd quartile as it divides the distribution in two groups of equal size.

3. Variance and Standard Deviation

The variance, , is a measure of the average difference between all observations and the mean. When an observation is bigger than the mean, the difference is positive. When an observation is smaller than the mean, the difference is negative. If the differences are simply added together, the positive and negative differences will cancel each other out and you may believe there is no difference. To obtain a measure of total difference, each difference is squared to remove the negative (-) sign. To calculate the average difference, the sum of squared differences is divided by N.

The variance is used often in inferential statistics (as we will see in later labs). However, in practical situations, the variance is difficult to use because the units are squared. Therefore, we typically use the standard deviation, , to express the dispersion of observations around the mean. Taking the square root returns the sum of squared differences back to the same units as the mean (i.e. cm, mm, etc.)

The standard deviation can be calculated for ratio or interval data.

Formula: where
Xi = each observation in the dataset
= the arithmetic mean
N = number of observations

How to calculate:

  1. calculate the mean of the data set
  2. subtract the mean from each observation
  3. square this difference for each observation
  4. sum all squared differences
  5. divide by N (this is the variance)
  6. take the square root (to get the standard deviation)

Example: calculate using for the rainfall data (mm):

The standard deviation is usually reported with the mean. In this case, you would say "the mean is 142.6 mm with a standard deviation of 34.96 mm".

The standard deviation defines an interval around the mean. If your distribution follows the shape of the normal curve (symmetrical, bell-shaped, unimodal), about 68% of the observations will fall inside this interval. Therefore, the shape of your distribution is related to the size of the standard deviation.

Example: consider two datasets where N=100, and = 60 cm. In Case A, = 5 cm; in sample B, s = 10 cm. What shape will these distributions have?

4. Coefficient of Variation

Often you may want to compare the amount of dispersion contained in different data sets or samples. If the means are equal, the data set with the smaller standard deviation is less dispersed. However, if the means are different, comparing the standard deviations may be misleading. Why? A sample with a mean of 5,000 will usually have a larger standard deviation than a sample with a mean of 100. In this case, it is better to calculate a relative measure of dispersion in the two samples. The coefficient of variation, CV, is the ratio of the standard deviation to the mean.

Formula: CV = /    where = the standard deviation and = the mean

Example: summary data for rainfall (mm) at 3 weather stations

Stations CV
Station P 123.1 mm 33.1 mm 0.25
Station Q 56.3 mm 17.8 mm 0.32
Station R 264.9 mm 43.3 mm 0.16

Although Station R has the largest standard deviation, the CV shows that it has the smallest variation in rainfall.

The CV is dimensionless (has no units). When = 0 (no standard deviation), CV is 0. When is large, what happens to the CV?

 

More GRAPHING METHODS

You have used two graphical methods to view your data: the dot plot and the histogram. Four more ways to view your data are presented briefly below; some are appropriate for all levels of measurement, some may apply to numeric data only.

1. Bar chart

Bar charts are an excellent way of showing relative magnitudes between categories, intervals, or locations. Histograms are bar charts, but not all bar charts are histograms. The data may be presented vertically or horizontally, using real frequency, relative frequency or data values. You can also break down the data inside each major category to provide more detail (as in component bar charts). These graphs are appropriate for all levels of measurement.

2. Pie chart

Pie charts show the breakdown of a whole into parts, using relative frequencies or percentages represented by degrees of a circle. Pie charts are often used to show relative magnitude, similar to bar charts. Pie charts are appropriate for all levels of measurement, provided that interval or ratio data are collapsed into categories. For any data type, use pie charts when you have relatively few categories (less than 6 or 7). Pie charts become unreadable if too many categories/slices are used.

3. Line graph

Line graphs are used to show the changes in a variable over time. A point is marked at the intersection of time (on the horizontal x-axis) and the values of the variable (on the vertical y-axis). The points can then be linked by a line to show the trend over time. Line graphs are appropriate for numeric data (interval or ratio).

4. Scatter plot

Scatter plots are commonly used to illustrate the association, if any, between two variables. The pattern of points indicates the type (positive or negative) and strength (weak or strong) of the association. Scatter plots are appropriate for numeric data (interval or ratio). Note that it would be inappropriate to join the points on this graph by a line.



Practice Answers

Data Classification:

Variable Quantitative
or Qualitative
Discrete or
Continuous
Level of
Measurement
Tree ID Qual Discrete Nominal
Species Qual Discrete Nominal
Height Quant Continuous Ratio
Elevation Quant Continuous Interval
# scars Quant Discrete Ratio
Damage rating Qual Discrete Ordinal
Go back

Weighted mean:
Sum of Population (w) * Income (Xi) for the five enumeration areas = 902,190
Total population (w) = 637
= 902,190/637 = $1,416.31
Go back