Measures of Variation
How do we makes sense of a value of Standard Deviation?
The range rule of thumb, the empirical rule, and Chebyshev's theorem.
Disadvantage of Variance
The variance is a statistic used in some statistical methods, but for our present purposes, the variance has the serious disadvantage of using units that are different than the units of the original data set. This makes it difficult to understand variance as it relates to the original data set.
Variance
The variance of a set of values is a measure of variation equal to the square of the standard deviation.
Mean Absolute Deviation (MAD)
Adding the deviations isn't good, because the sum will always be zero. To get a statistic that measures variation, it's necessary to avoid cancelling out of negative and positive numbers. One approach is to add absolute values. If we find the mean of that sum, we get the mean absolute deviation (or MAD), which is the mean distance of the data from the mean.
Why divide by n-1?
After finding all of the individual values of (x - x-bar) squared, we combine them by finding their sum. We then divide by n-1 because there are only n-1 values that can be assigned without constraint. With a given mean, we can use any numbers for the first n-1 values, but the last value will then be automatically determined. With division by n-1, sample variances tend to center around the value of the population variance; with division by n, sample variances tend to underestimate the value of the population variance.
Advantage of Standard Deviation
Because it is based on the square root of a sum of squares, the standard deviation closely parallels distance formulas found in algebra. There are many instances where a statistical procedure is based on a similar sum of squares. Consequently, instead of using absolute values, we square all deviations (x - x-bar) so that they are nonnegative, and those squares are used to calculate the standard deviation.
Why not use the MAD Instead of the Standard Deviation?
Computation of the mean absolute deviation uses absolute values, so it uses an operation that is not "algebraic." (The algebraic operations include addition, multiplication, extracting roots, and raising to powers that are integers or fractions.) The use of absolute values would be simple, but it would create algebraic difficulties in inferential methods of statistics. The standard deviation has the advantage of using only algebraic operations.
Why is Standard Deviation Defined with Sample Formula?
In measuring variation in a set of sample data, it makes sense to begin with the individual amounts by which values deviate from the mean. For a particular data value x, the amount of deviation is x - (x-bar). It makes sense to somehow combine those deviations into one number than can serve as a measure of the variation.
Comparing Variation in Different Samples or Populations
It's good practice to compare two sample standard deviations only when the sample means are approximately the same. When comparing variation in samples or populations with very different means, it is better to use the coefficient of variation. Also use the coefficient of variation to compare variation from two samples or populations with different scales or units of values.
General Procedure for Finding Standard Deviation
Step 1: Compute the mean. Step 2: Subtract the mean from each individual sample value. Step 3: Square each of the deviations obtained from step 2. Step 4: Add all of the squares obtained from step 3. Step 5: Divide the total from step 4 by the number n-1, which is 1 less than the total number of sample values present. Step 6: Find the square root of the result of step 5. The result is the standard deviation, denoted by s.
Standard Deviation of a Population
The definition of standard deviation apply to the standard deviation of sample data. A slightly different formula is used to calculate the standard deviation of a population. Instead of dividing by n-1, we divide by the population size N.
Advantage/Disadvantage of Chebyshev's Theorem
The empirical rule applies only to data sets with bell-shaped distributions, but Chebyshev's theorem applies to any data set. Unfortunately, results from Chebyshev's theorem are only approximate. Because the results are lower limits ("at least"), Chebyshev's theorem has limited usefulness.
Chebyshev's Theorem
The proportion of any data set of data lying within K standard deviations of the mean is always at least 1 - 1/K squared, where K is any positive number greater than 1. For K = 2 and K = 3, we get the following statements: At least 3/4 (or 75%) of all values lie within 2 standard deviations of the mean. At least 8/9 (or 89%) of all values lie within 3 standard deviations of the mean.
Range
The range of a set of data values is the difference between the maximum data value and the minimum data value. Range = (maximum data value) - (minimum data value)
Range Rule of Thumb for Understanding Standard Deviation
The range rule of thumb is a crude but simple tool for understanding and interpreting standard deviation. It is based on the principle that for many data sets, the vast majority (such as 95%) of sample values lie within 2 standard deviations of the mean.
Properties of the Range
The range uses only the maximum and the minimum data values, so it is very sensitive to extreme values. The range is not resistant. Because the range uses only the maximum and minimum values, it does not take every value into account and therefore does not truly reflect the variation among all of the data values.
Unbiased Estimator
The sample variance s squared is an unbiased estimator of the population variance, which means that values of s squared tend to center around the value of the population variance instead of systematically tending to overestimate or underestimate the population variance..
Properties of Standard Deviation
The standard deviation is a measure of how much data values deviate away from the mean. The value of the standard deviation s is never negative. It is zero only when all of the data values are exactly the same. Larger values of s indicate greater amounts of variation. The standard deviation s can increase dramatically with one or more outliers. The units of standard deviation s (such as minutes, feet, pounds) are the same as units of the original data values. The sample standard deviation s is a biased estimator of the population standard deviation, which means that values of the sample standard deviation s do not center around the value of the parameter.
Standard Deviation
The standard deviation of a set of sample values, denoted by s, is a measure of how much data values deviate away from the mean.
Properties of Variance
The units of the variance are the squares of the units of the original data values. The value of the variance can increase dramatically with the inclusion of outliers. (The variance is not resistant) The value of the variance is never negative. It is zero only when all of the data values are the same number. The sample variance is an unbiased estimator of the population variance.
Empirical Rule for Data with a Bell-Shaped Distribution
This rule states that for data sets having a distribution that is approximately bell-shaped, the following properties apply: About 68% of all values fall within 1 standard deviation of the mean. About 95% of all values fall within 2 standard deviations of the mean. About 99.7% of all values fall within 3 standard deviations of the mean.
Range Rule of Thumb for Estimating a Value of the Standard Deviation s
To roughly estimate the standard deviation from a collection of known sample data.
Round-off Rule for the Coefficient of Variation
Round the coefficient of variation to one decimal place (such as 25.3%)
Coefficient of Variation (CV)
The coefficient of variation for a set of nonnegative sample or population data, expressed as a percent, describes the standard deviation relative to the mean.
Biased Estimator
The sample standard deviation s is a biased estimator of the population standard deviation, which means that the values of the sample standard deviation s do not tend to center around the value of the population standard deviation. While individual values of s could equal or exceed values of the population standard deviation, values of s generally tend to underestimate the value of the population.
Round-Off Rule for Measures of Variation
When rounding the value of a measure of variation, carry one more decimal place than is present in the original set of data.