*STAT 1401 - Chapter 3: Measures of Center, Spread, and Position
Visualizing Quartiles
- the quartiles divide the data set into four parts, with approximately 25% of the data in each part.
Percentiles
divide a data set into hundredths. for a number p between 1 and 99, the pth percentile separates the lowest p% of the data from the highest (100-p)%.
Chebyshev's Inequality
in any data set the proportions of the data that will be within K standard deviations of the mean is at least 1-1 / K^2. specifically, by setting K = 2 or K = 2, we obtain the following results: at least 3/4, or 75%, of the data are within two standard deviations of the mean. at least 8/9, or 89%, of the data are within three standard deviations of the mean.
Compare the Properties of the Mean and Median
Both the mean and median are frequently used as measures of center. One important difference is that the mean is more influenced by extreme values than the median. A statistic is resistant if its value is not affected much by extreme values (large or small) in the data set. The median is resistant, but the mean is not.
Compute the Coefficient of Variation, or CV
CV = σ / μ or S / x̄ the coefficient of variation, or CV, tells how large the standard deviation in relative to the mean. it can be used to compare those spreads of data set whose values have different units. the coefficient of variation is found by dividing the standard deviation by the mean.
Bell-Shaped Histogram
many histograms have a single mode near the center of the data, and are approximately symmetric. such histograms are often referred to as bell-shaped.
Interquartile Range (IQR)
one method for detecting outliers involves a measure called the interquartile range (IQR) the interquartile range (IQR) is found by subtracting the ___ quartile from the __ quartile. IQR =
Computing Percentiles of a Data Set
percentiles describe the shape of a distribution by dividing it into fourths. sometimes it is useful to divide a data set into a greater number of pieces to get a more detailed description of the distribution.
Population Variance
population variance is denoted by σ^2. σ^2 = ∑(xi - μ)^2 / N let x1, x2, x3..., xN denote the values in a population of size N. let Mu denote the population mean.
Standard Deviation and Resistance
recall that a statistic is resistant if its value is not affected much by extreme values (large or small) in the data set. the standard deviation is not resistant. the standard deviation is affected by extreme values.
Calculator Procedures for Grouped Data (TI-84 Plus) (Demonstration videos on D2L)
Grouped Data
Population Mean
If x1, x2, ... xN is a population, then the population mean is given by: μ = ∑xi / N
Sample Mean
If x1, x2, ... xn is a sample, then the sample mean is given by: x̄ = ∑xi / n
SECTION 3.1
MEASURES OF CENTER Compute the mean of a data set. Compute the median of a data set. Compare the properties of the mean and median. Find the mode of a data set. Approximate the mean using grouped data.
SECTION 3.3
MEASURES OF POSITION Compute and interpret z-scores. Compute percentiles of a data set. Compute the quartiles of a data set. Compute the five-number summary for a data set. Understand the effects of outliers.
SECTION 3.2
MEASURES OF SPREAD Compute the range of a data set. Compute the variance of a population and a sample. Compute the standard deviation of a population and a sample. Approximate the standard deviation using grouped data. Use the Empirical Rule to summarize data that are unimodal and approximately symmetric. Use Chebyshev's Inequality to describe a data set.
Notation: Population vs. Sample
Recall that a population consists of an entire collection of individuals about which information is sought, and a sample consists of a smaller group drawn from the population. the method for calculating the mean is the same for both samples and populations - but the notation is different.
Sample Standard Deviation
S = √ S^2
Example: Compute the population variance for the San Francisco temperatures. (Table on pg. 1 of 3.2 Notes)
Solution: Step 1: Compute the population mean, μ. μ = ∑xi / N μ= 51 + 54 + 55 + 58 + 60 + 60 + 61 + 63 + 62 + 58 + 52 / 12 μ = 57.5 Step 2: For each population value xi, compute xi - μ. Step 3: Square the deviations obtained in Step 2 using (xi - μ)^2 Step 4: Sum the squared deviations Population Variance: σ^2 = ∑(xi - μ)^2 / N ∑(xi - μ)^2 = 42.25 + 12.25 + 6.25 + 2.25 + 0.25 + 6.25 _ 12.25 + 30.25 + 0.25 + 30.25 = 169 Step 5: Divide the sum obtained from Step 4 by population size N to obtain the sample variance. σ^2 = ∑(xi - μ)^2 / N = 169 / 12 = 14.083
Example: The table at the bottom of the page (p. 3) presents the annual rainfall, in inches, in Los Angeles during the month of February over several years. Compute the 60th percentile for the data.
Solution: *note that there are 45 values and they are already in increasing order. L = (60 / 100) * 45 = 27 since 27 is a whole number, the 60th percentile is the average if the numbers in the 27th and 28th position. (3.58 + 3.71) / 2 = 3.645 60th percentile = 3.645
Quartiles
the mean and median of a data set are measures of the center; sometimes it is useful to compute measures of position there than the center to get a more detailed description of the distribution. divide a data set into four approximately equal pieces. every data set has three quartiles: -the first quartile, denoted Q1, separates the lowest 25% of the data from the highest 75%. -the second quartile, denoted Q2, separates the lowest 50% from the highest 50%. -Q2 is the same as the median. -the third quartile, denoted Q3, separates the lowest 75% of the data from the highest 25%.
Measure of Center
the mean of a data set. If we imagine each data value to be a weight, then the mean is the point at which the data set balances.
IQR Method for Detecting Outliers
the most frequent method used to detect outliers in a data set is the __. The procedure for the IQR method is: Step 1: Find the first quartile (Q1) and the third quartile (Q3). Step 2: Compute the interquartile range IQR = Q3 - Q1 Step 3: Compute the outlier boundaries. These boundaries are the cutoff points for determining outliers: -Lower Outlier Boundary = Q1 - 1.5(IQR) -Upper Outlier Boundary = Q3 + 1.5(IQR) Step 4: Any data value that is __ than the lower outlier boundary or __ than the upper outlier boundary is considered to be an outlier.
Five-Number Summary
the preferred numerical summary when data are very skewed or outliers are present. includes these five values: -Minimum -First Quartile (Q1) -Median -Third Quartile (Q3) -Maximum
Range
the range of a data set is the difference between the largest value and the smallest value. Range = Largest Value - Smallest Value
Although range is easy to compute, it is not often used in practice.
the reason for this is because the range involves only two values form the data set, the smallest and the largest.
Sample Variance
the sample variance is denoted by S^2. S^2 = ∑(xi - x̄)^2 / n-1 when the data values come from a sample rather than a population, the variance is called the sample variance. the procedure for computing the sample variance is a bit difference from the one used to compute a population variance. in the formula, the mean μ is replaced by the sample mean x̄ and the denominator is n-1 instead of N.
z-score
the z-score of an individual data value tells how many standard deviations that value is from its population mean. for example, a value of one standard deviation above the mean has a z-score of z = 1 and a value two standard deviations below the mean has a z-score of z = -2.
Example: National Weather Service records show that over a thirty-year period, the annual precipitation in Atlanta, Georgia had a mean of 49.8 inches with a standard deviation of 7.6 inches, and the annual temperature had a mean of 62.2 Fahrenheit with a standard deviation of 1.3 degrees. Compute the coefficient of variation for precipitation and for temperature. Which has greater spread relative to its mean?
Solution: CV for precipitation = standard deviation for precipitation / mean precipitation CV = 7.6 / 49.8 CV = 0.15 CV for temperature = standard deviation for temperature / mean temperature CV = 1.3 / 62.2 = 0.02 the CV for precipitation is larger than the CV for temperature, so precipitation has a greater spread relative to its mean.
Example: Recall the Los Angeles annual rainfall data. Compute the five-number summary.
Solution: Minimum = 0.00 Q1 = 0.92 Q2 (Median) = 3.21 Q3 = 4.89 Maximum = 13.68
Example: The variance of the lifetimes for a sample of six batteries is S^2 = 2. Find the sample standard deviation.
Solution: S = √ S^2 S = √ 2 S = 1.414
Example: The population variance of temperatures in San Francisco is σ^2 = 14.083. Find the population standard deviation.
Solution: σ = √ σ^2 σ = √ 14.083 σ = 3.753
Example: Eight patients undergo a new surgical procedure and the number of days spend in recovery for each is as follows. Find the median number of days in recovery. 20, 15, 12, 27, 13, 19, 13, 21
Solution: Arrange in increasing order: 12, 13, 13, 15, 19, 20, 21, 27. The median is the average of the middle two numbers. 15 + 9 / 2 = 17 Median = 17
Example: During a semester. a student took five exams. The population of exam scores is: 78, 83, 92, 65, and 85. Find the median of the exam scores.
Solution: Arrange the data values in increasing order: 68, 78, 83, 85, 92. The median is the middle number, 83. Median = 83
Example: The table at the bottom of the page (p. 2) presents the annual rainfall, in inches, in Los Angeles during the month of February over the last several years. Compute the quartiles for the data.
Solution: Note that there are 45 data values, and that the data are already in increasing order. First Quartile: L = 0.25(45) = 11.25 since this is not a whole number, round up to 12. Q1 is the number in the 12th position. Q1 = 0.92 Third Quartile: L = 0.75(45) = 33.75 since this is not a whole number, round up to 34. Q3 is the number in the 34th position. Q3 = 4.89 Second Quartile: Q2 = the same as the Median. Q2 = 3.21
Example: The table at the bottom of this page (p. 3) presents the annual rainfall in Los Angeles during the month of February over several years. One year, the rainfall was 1.90. What percentile does this correspond to?
Solution: Note that there are 45 values and that they are already in increasing order. There are 17 values less than 1.90 Percentile = 100 * (17 + 0.5) / 45 = 38.9 Round 38.9 up to 39. The value 1.90 corresponds to the 39th percentile.
Example: The average monthly temperatures, in degrees Fahrenheit, for San Francisco and St. Louis are: (see table on page 1 of 3.2 notes).
Solution: The Range of the San Francisco temperatures is 63-51 = 12, therefore the Range = 12 degrees Fahrenheit.
Example: Five families have annual incomes of $25,000, $31,000, $34,000, $44,000, and $56,000. One family, whose income is $25,000, wins a million dollar lottery, so their income increases to $1,025,000. Before the lottery win, the mean and median are: Mean = $38,000 Median = $34,000 After the lottery win, the mean and median are: Mean = $238,000 Median = $44,000
Solution: The extreme value of $1,025,000 influences the mean quite a lot; increasing it from $38,000 to $238,000. In comparison, the median has been influenced much less; increasing from $34,000 to $44,000. That is, the median is resistant.
Example: During a semester, a student took five exams. The population of exam scores is 78, 83, 92, 68, 85. Find the mean.
Solution: The mean is given by: 78 + 83 + 92 + 86 + 85 / 5 = 406/5 = 81.2 Note that the mean is rounded to one more decimal place than the original data. This is generally considered to be good practice.
Example: A new type of battery is being tested for laptop computers. The lifetimes, in hours, of six batteries,, are 3, 4, 6, 5, 4, 2. Find the sample variance of the lifetimes.
Solution: The sample mean is x̄ = 3 + 4 + 6 + 5 + 4 + 2 / 6. The sample variance is given by: S^2 = ∑(xi - x̄)^2 / n-1 Step 1: S^2 = (3-4)^2 + (4-4)^2 + (6-4)^2 + (5-4)^2 + (4-4)^2 + (2-4)^2 / 6-1 Step 2: S^2 = 10 / 5 Step 3: S^2 = 2
Example: As part of a public health study, systolic blood pressure was measured for a large group of people. The mean was 120 and the standard deviation was 10. What information does Chebyshev's Inequality provide about these data?
Solution: We compute the following: x̄ - 2S = 120 - 2(10) = 100 x̄ + 2S = 120 + 2(10) = 140 x̄ - 3S = 120 - 3(10) = 90 x̄ + 3S = 120 + 3(10) = 150 at least 3/4, 75%, of the people had systolic blood pressures between 100 and 140. at least 8/9, 89%, of the people had systolic blood pressures between 90 and 150.
Example: The following table presents the US Census Bureau projection for the percentage off the population aged 65 and over for each state and the District of Columbia. Use the Empirical Rule to describe the data. (Table in 3.2 Notes, page 3)
Solution: We first note that the histogram is approximately bell-shaped and we may use the TI-84 Plus calculator, or other technology, to compute the population mean and standard deviation. Mean: μ = 13.25 Standard Deviation: σ = 1.693
Example: The following table presents the number of students absent in a middle school in northwestern Montana for each school day in January. 67, 71, 57, 51, 49, 44, 41, 49, 42, 56, 45, 77, 44, 42, 46, 100, 59, 53, 31 Identify any outliers.
Solution: ______________
Example: The temperature in a downtown location is measured for eight consecutive days during the summer. The readings, in Fahrenheit, are: 81.2, 85.6, 89.3, 91.0, 83.2, 8.45, 79.5, 87.8 Which reading is an outlier? Is the outlier an error or is it possible that it is correct?
Solution: ______________________
Compute the Median of a Data Set
The median is another measure of center. The median is a number that splits the data set in half, so that half the data values are greater than the media. The procedure for computing the median differs, depending on whether the number of observations in the data set is even or odd. If n is odd, the median is the middle number. If n is even, the median is the average of the two middle numbers.
Example: A National Center for Health Statistics study states that the mean height for adult men in the U.S. is μ = 69.4 inches, with a standard deviation of σ = 3.1 inches. The mean height for adult women is μ = 63.8 inches, with a standard deviation of σ = 2.8 inches. Who is taller relative to their gender, a man 73 inches tall, or a woman 68 inches tall?
Solution: z-score = x - μ / σ zMan's Height = 73 - 69.4 / 3.1 = 1.16 zWoman's Height = 68 - 63.8 / 2.8 = 1.50 since the z-score of the woman's height is larger than the man's z-score, the woman is taller, relative to the populations of women's heights.
Approximating the Mean with Grouped Data
Sometimes we don't have access to the raw data in a data set, but we are given a frequency distribution; in these cases we can approximate the mean using the following steps: Step 1: Compute the midpoint of each class. The midpoint is found by taking the average of the lower class limit and the lower class limit of the next larger class. Step 2: For each class, multiply the class midpoint by the class frequency. Step 3: Add the products (Midpoint)x(Frequency) over all classes. Step 4: Divide the sum obtained in Step 3 by the number of frequencies.
Calculator Procedures for Standard Deviation (TI-84 Plus) (Demonstration videos on D2L)
Standard Deviation
Computing Quartiles of a Data Set
Step 1: Arrange the data in increasing order. Step 2: Let n be the number of data values in the data set. To compute the second quartile, simply compute the median. For the first three quartiles, proceed as follows: for the first quartile, compute L = 0.25n for the third quartile, compute L = 0.75n Step 3: If L is a whole number, the quartile is the average of the number in position L and the number in position L + 1. If L is not a whole number, round it up to the next higher whole number. The quartile is the number in the position corresponding to the rounded-up value.
+ INSERT IMAGE, p. 5
TI-84 PLUS SCREENSHOT
Construct Boxplots to Visualize the Five-Number Summary and Outliers
___ is a graph that presents the five-number summary along with some additional information about a data set. there are several kinds of boxplots. the one we describe here is sometimes called a ____. + INSERT IMAGE, p. 5
Computing the Mean
a list of n numbers is denoted by x1, x2, ..., xn ∑x represents the sum of these numbers: ∑x = x1 + x2 + x3 + ... +xn
Variance
a measure of how far the values in a data set are form the mean, on average. the variance is computed slightly differently for populations and samples.
Both Chebyshev's Inequality and the Empirical Rule provide information about the proportion of a set within a given number of standard deviations of the mean.
an advantage of Chebyshev's Inequality is that it applies to any data set, whereas the Empirical Rule only applies to data sets that are approximately bell-shaped. a disadvantage of Chebyshev's Inequality is that for most data sets, it provides only a very rough approximation.
Understand the Effects of Outliers
an outlier is a value that is considerably larger or considerably smaller than most of the values in a data set. some outliers result from errors; for example a misplaced decimal point may cause a number to be much larger or much smaller than the other values in a data set. some outliers are correct values, and simply reflect the fact that the population contains some extreme values.
Compute the Standard Deviation of a Population and a Sample
because the variance is computed using standard deviations, the units of the variance are the squared units of the data. in most situations, it is better to use a measure of spread that has the same units as the data. we do this simply by taking the square root of the variance. this quantity is called the standard deviation. the standard deviation of a sample is denoted by S, and the standard deviation of a population is denoted by σ.
Determining the Shape of a Data Set from a Boxplot
boxplots can be used to determine skewness in a data set. if the median is closer to the first quartile than to the third quartile, or the upper whisker is longer than the lower whisker, the data are skewed to the ___. + INSERT IMAGE, p. 5 if the median is closer to the third quartile than to the first quartile, or the lower whisker is longer than the upper whisker, the data are skewed to the ___. + INSERT IMAGE, p. 5 the median is approximately halfway between the first and third quartiles, and the two whisker are approximately equal in length, the data are approximately ___. + INSERT IMAGE, p. 6
Computing Percentiles
several methods for computing percentiles, all of which give similar results. Step 1: Arrange the data in increasing order. Step 2: let n be the number of values in the data set. For the pth percentile, compute: L = (p / 100) * n Step 3: if L is a whole number, the pth percentile is the average of the number in position L and the number in the position L + 1. if L is not a whole number, round it up to the next higher whole number. the pth percentile is the number int he position corresponding to the rounded-up value.
z-scores and the empirical rule
since the z-score is the number of standard deviations from the mean, we can easily interpret the z-score for bell-shaped populations using the Empirical Rule. when a population has a histogram that is approximately bell-shaped, then: -approximately 68% of the data will have z-scores between -1 and 1. -approximately 95% of the data will have z-scores between -2 and 2. -or, almost all of the data will have z-scores between -3 and 3.
Computing a Percentile from a Given Data Value
sometimes we are given a value from a data set and wish to compute the percentile corresponding to that value. the procedure for doing this is described in the steps below. Step 1: Arrange the data in increasing order. Step 2: Let x be the data value whose percentile is to be computed. Use the following formula to compute the percentile: Percentile = 100 * (Number of values less than x) + 0.5 / Number of values in the data set Round the result to the nearest whole number. This is the percentile corresponding to the value x.
Empirical Rule for Samples
the Empirical Rule can be used for samples as well as populations. when we work with a sample, we use x̄ in place of μ, and S in place of σ.
The Empirical Rule
when a data set has a bell-shaped histogram, tit is often possible to use the standard deviation to provide an approximate description of the data using a rule known as The Empirical Rule. when a population has a histogram that is approximately bell-shaped, then: approximately 68% of the data will be within one standard deviation of the mean. approximately 95% of the data will be within two standard deviations of the mean. all, or almost all of the data will be within three standard deviations of the mean.
Use Chebyshev's Inequality to describe a data set
when a distribution is bell-shaped, we use The Empirical Rule to approximate the proportion of data within one or two standard deviations of the mean.. another rule called Chebyshev's Inequality holds for any data set.
Why divide by n-1?
when computing the sample variance, we use the sample mean to compute the deviations. for the population variance we use the population mean for the deviations. it turns out that the deviations using the sample mean tend to be a bit smaller than the deviations using the population mean. if we were to divide by n when computing a sample variance, the value would tend to be a bit smlaller than the population variance. it can be shown mathematically that the appropriate correct is to divide the sum of the squared deviations by n-1 rather than n.
Compute the Variance of a Population and a Sample
when the data set has a small amount of spread, most of the values will be close to the mean. when a data set has a large amount of spread, most of the values will be far from the mean.
Notation: Sample Mean
x̄ = x-bar; the sample mean
let x be a value from a population; with mean μ and standard deviation σ
z-score for x is: z = x - μ / σ
Notation: Population Mean
μ = Mu; the population mean
Population Standard Deviation
σ =√ σ^2