LAB 4: Statistical Estimation

 

In this lab, we will examine:

Recommended web links:
Confidence Interval Applets: Experiment with confidence intervals and signficance levels.
 


 

Towards statistical inference...

The main goal of inferential statistics is to make inferences about population parameters based on sample data. There are two ways of making inferences about a population:

  1. estimate the value of a population parameter using sample data
  2. test a hypothesis about a population parameter or distribution using sample data.

Both methods reply upon probability and the sampling distribution theory that we explored in previous labs. In this lab, we will concentrate on estimation methods. In the next lab, we will examine hypothesis testing, which will form the basis for all subsequent statistical methods examined in this course.
 

There are two types of estimation procedures:

A. Point estimation

A single number, calculated from the sample data, is used as the best estimate of a population parameter. Point estimates can be developed for:

1. Population mean : There are several ways of estimating . You could use the median, the mode or the mean. However, if a sample is randomly selected from the population and has a normal frequency distribution, there is a high probability that the sample mean is close to the mean of the sampling distribution. The central limit theorem tells us that the mean of the sampling distribution equals the population mean . Therefore, is the best estimate of .


2. Population proportion : The best estimate of a population proportion is the sample proportion, p, calculated from the sample.

Point estimates are useful as they give us an estimated value for the parameter of interest (mean or proportion). We know that a statistic calculated from any one sample ( or p) will be close to the value of the population parameter ( or ). However, it is unlikely that a sample statistic will be identical to the population parameter and we may doubt the accuracy of our estimate. Therefore, it is useful to define a range or interval around our point estimate that is likely to include the population parameter.
 

B. Interval estimation

Two numbers define the interval within which the population parameter is thought to lie with a certain probability.

From Lab 3, we know that if we draw several different samples from the same population, the statistics calculated from each sample will differ. However, using information on probabilities and sampling distributions, we can specify an interval around our best estimate ( or p) so that there is a high probability that the population parameter lies within the interval. In other words, P(interval contains population parameter) = 0.95.

We will continue with the stream example from above where our best estimate of mean velocity was 10.8 m/sec.

In the example above, the size of the interval was given. In reality, you will not know how wide the interval should be. You could determine the size of the confidence interval by drawing all possible samples, plotting the sample means on a number line and finding the exact interval that contains 95% of the sample means. However, because we know that the sampling distribution of means follows the normal distribution (from the Central Limit Theorem), you can draw one sample and use a formula that incorporates Z scores to calculate a confidence interval around your best estimate.

Before we work with the formula, we need to examine the confidence interval probabilities more closely.
 

Confidence and Significance

Confidence intervals can be developed for any probability level:

When selecting a probability, you must consider the accuracy requirements of your analysis. For example:

These probabilities are called confidence levels and are expressed as (1 - ). The symbol (called 'alpha') refers to the probability that the parameter is outside the interval. When = 0.05, the confidence level is (1 - 0.05) = 0.95 or 95%. Alpha is also known as the significance level.

Significance Level ()  Confidence Level
0.10 (1 - 0.10) = 0.90 or 90%
0.05 (1 - 0.05) = 0.95 or 95%
0.01 (1 - 0.01) = 0.99 or 99%

When calculating a confidence interval, is evenly divided between the two sides of the sampling distribution. Therefore, the probability that the population mean is less than the confidence interval is divided by 2; the probability that the population mean is greater than the confidence interval is also /2.

Since the sampling distribution has the same shape as a normal distribution, you can express the confidence level probabilities using a Z score (from the standard Z distribution). To do this, you need to find the value of Z that defines your confidence interval. The steps to finding Z for a 95% confidence interval are outlined below and in the accompanying diagram.

  1. The confidence level is 95% so the probability that the parameter lies inside the interval is 0.95. The probability that the parameter lies outside the interval is = 0.05
     
  2. is shared between both ends of the curve: /2 = 0.025
     
  3. Because the Z distribution is symmetrical, you need only work with one side of the distribution. Therefore, we need to find Z at 0.500 - 0.025 = 0.475.
     
  4. The Z value at P=0.475 is 1.96 (look up the probability in the Z table and work outwards to obtain the Z value).
     
  5. Use - 1.96 and + 1.96 to define the appropriate confidence interval.
     


 

Probability and Sample Size

In the steps outlined above, we used the standard normal probability distribution (Z distribution) to model the probabilities in the sampling distribution. However, with a small sample, the probabilities associated with the Z distribution may underestimate the true variation in the population. For small samples, it is more appropriate to use the t-distribution. Therefore, we use the following rule when developing confidence intervals:

  1. With a large sample (n > 30), you specify the confidence level probabilities using the Z distribution.
     
  2. With a small sample (n < 30), you use the t-distribution to specify the confidence probabilities. In small samples, the standard deviation, s, may underestimate the true variation in the population. Therefore, there is a chance that your interval will be too narrow at a given confidence level. As the t-distribution is wider than the Z distribution for small values of n, it will compensate for this potential bias.

    The break between large and small samples ( 30) is not an absolute rule. It is a guideline or convention used by many statisticians. Any analyst could disregard the guidleline as long as he or she can provide a reasonable justification (e.g. you might use the t-distribution for a sample of n=35 because you want to be more conservative in your estimate).

Important note:
Confidence intervals should only be developed for variables with normal or approximately normal distributions. If the sample distribution is non-normal (highly skewed or bimodal), you should not calculate an interval.
 


Calculating Confidence Intervals

In this lab, we will calculate the following confidence intervals:

  1. Confidence interval for a mean (based on a large sample)
  2. Confidence interval for a proportion (based on a large sample)
  3. Confidence interval for a mean (based on a small sample)
  4. Special case: one-tailed confidence intervals

Note: In Lab 2, we started with an interval on the Z distribution and calculated the probability within the interval. In this lab, we start with a probability (or confidence level) and calculate the interval for this probability.

a) Confidence interval around a mean:

This formula is used to calculate confidence intervals (CI) for large samples:

Formula: where:
= sample mean (best estimate of )
= Z value associated with
desired confidence level
s = sample standard deviation
n = sample size

b) Confidence interval for a proportion:

The confidence intervals for proportions are constructed in a similar way to the intervals for means. The confidence intervals for proportions should only be calculated for very large samples (n>100). For smaller sample sizes, the sampling distribution of follows the binomial probability distribution (we are not covering this distribution in this course).

For a large sample, the formula is:

Formula: where:
p = sample proportion (best estimate of )
= Z value associated with
desired confidence level
n = sample size

 

c) Confidence interval for a mean (small sample):

The formula for small sample confidence intervals is:

Formula:

where:
= sample mean (best estimate of )
= t value associated with
desired confidence level & sample size
s = sample standard deviation
n = sample size
v or df: degrees of freedom

 

d) One-tailed confidence intervals

The confidence intervals calculated above are considered two-tailed intervals as the probability covers both end (tails) of the distribution. This is the standard type of confidence interval. However, one-tailed confidence intervals (lower or upper limit) can be calculated if you are interested in a minimum or maximum value only.

 

The 6 Steps for Calculating Confidence Intervals

  1. Determine what kind of point estimate you are using:
    • mean
    • proportion
       
  2. Select the appropriate probabilty distribution:
    • If mean and large sample (n > 30), use Z
    • If mean and small sample (n < 30), use t
    • If proportion and large sample (n > 100), use Z
       
  3. Determine the appropriate or required confidence level:
    • 90%
    • 95% (standard confidence level)
    • 99%
    • other
       
  4. Determine what kind of interval you need to calculate:
    • two-tailed
    • one-tailed (upper or lower)
      Read the question carefully and draw a quick sketch.
       
  5. Determine the Z or t value (at or /2).
     
  6. Calculate the interval or bound using the appropriate formula and present your results.

Remember that your confidence interval is based on one sample. There is a small probability (1%, 5%, etc.) that your sample mean or proportion actually falls at an extreme end of the sampling distribution. If you are using your confidence interval or estimate to recommend a course of action, it is often a good idea to take more samples to confirm the original estimate or inteval.




Practice Questions

Q1. Complete the table below:

  Estimate Interval type Confidence level n Z or t? Z or t value
A. two tailed 90% 45      
B. two tailed 95% 12      
C. one tailed 95% 36      
D. p two tailed 99% 180      
E. two tailed 80% 60      
F. one tailed 99% 23      

Q2. Calculate the following confidence intervals:

  1. Calculate a two-tailed 95% confidence interval around the estimate of mean length for a sample of 50 trout, = 35.6 cm and s = 9.4 cm.
     
  2. Calculate a 90% confidence interval for the sample above.
     
  3. In another sample of 200 trout, the proportion of fish over 4 years of age was 0.37. Calculate a two-tailed 95% confidence interval around this estimate.

Answers

Question 1

  Z or t? Z or t value
A 0.10 Z 1.65
B 0.05 t 2.20
C 0.05 Z 1.65
D 0.01 Z 2.58
E 0.20 Z 1.28
F 0.01 t 2.51

Question 2:

  1. 35.6 cm 2.6 cm. We estimate with 95% confidence that the true mean length lies between 33.0 cm and 38.2 cm.
     
  2. 35.6 2.2 cm. We estimate with 90% confidence that the true mean length lies between 33.4 cm and 37.8 cm.
     
  3. 0.37 0.07. We estimate with 95% confidence that the proportion of trout over 4 years of age lies between 0.30 and 0.44.


© University of Victoria 2000/2001       Updated: October 4, 2002