Statistics Chapter 3
Identifying Outliers for Modified Boxplots
1. Find the quartiles Q1, Q2, and Q3. 2. Find the interquartile range (IQR), where IQR = Q3 − Q1. 3. Evaluate 1.5 × IQR. 4. In a modified boxplot, a data value is an outlier if it is above Q3, by an amount greater than 1.5 × IQR or below Q1, by an amount greater than 1.5 × IQR.
Important Properties of z Scores
1. A z score is the number of standard deviations that a given value x is above or below the mean. 2.z scores are expressed as numbers with no units of measurement. 3. A data value is significantly low if its z score is less than or equal to −2 or the value is significantly high if its z score is greater than or equal to +2. 4. If an individual data value is less than the mean, its corresponding z score is a negative number.
Procedure for Constructing a Boxplot
1. Find the 5-number summary (minimum value, Q1, Q2, Q3, maximum value). 2. Construct a line segment extending from the minimum data value to the maximum data value. 3. Construct a box (rectangle) extending from Q1 to Q3, and draw a line in the box at the value of Q2 (median).
Example: Comparing a Baby's Weight and Adult Body Temperature (2 of 3)
4000 g birth weight:
5-Number Summary
5-Number Summary For a set of data, the 5-number summary consists of these five values: 1. Minimum 2. First quartile, Q1 3. Second quartile, Q2 (same as the median) 4. Third quartile, Q3 5. Maximum
Modified Boxplots
A modified boxplot is a regular boxplot constructed with these modifications: 1. A special symbol (such as an asterisk or point) is used to identify outliers as defined above, and 2. the solid horizontal line extends only as far as the minimum data value that is not an outlier and the maximum data value that is not an outlier.
z Scores
A z score (or standard score or standardized value) is the number of standard deviations that a given value x is above or below the mean. The z score is calculated by using one of the following:
Example: The Empirical Rule (1 of 2)
IQ scores have a bell-shaped distribution with a mean of 100 and a standard deviation of 15. What percentage of IQ scores are between 70 and 130?
Example: Computing Grade-Point Averag
In her first semester of college, a student of the author took five courses. Her final grades, along with the number of credits for each course, were A (3 credits), A (4 credits), B (3 credits), C (3 credits), and F (1 credit). The grading system assigns quality points to letter grades as follows: A = 4; B = 3; C = 2; D = 1; F = 0. Compute her grade-point average.
Calculating the Mean from a Frequency Distribution
Mean from a Frequency Distribution First multiply each frequency and class midpoint; then add the products.
Example: Median with an Even Number of Data Values
Repeat of the previous example after including the sixth data speed of 24.5 Mbps. That is, find the median of these data speeds: 38.5, 55.6, 22.4, 14.1, 23.1, 24.5 (all in Mbps). Solution First arrange the values in ascending order: 14.1 22.4 23.1 24.5 38.5 55.6 Because the number of data values is an even number (6), the median is found by computing the mean of the two middle numbers, which are 23.1 and 24.5.
Round-off Rule for the Coefficient of Variation
Round the coefficient of variation to one decimal place (such as 25.3%).
Round-off Rule for z Scores
Round z scores to two decimal places (such as 2.31).
Important Properties of the Mean
Sample means drawn from the same population tend to vary less than other measures of center. The mean of a data set uses every data value. A disadvantage of the mean is that just one extreme value (outlier) can change the value of the mean substantially. (Using the following definition, we say that the mean is not resistant.)
Standard Deviation of a Sample
Sample standard deviation.
Standard Deviation of a Sample
Shortcut formula for sample standard deviation (used by calculators and software).
Using z Scores to Identify Significant Values
Significant values are those with z scores ≤ −2.00 or ≥ 2.00.
Range Rule of Thumb for Identifying Significant Values
Significantly low values are µ − 2σ or lower. Significantly high values are µ + 2σ or higher. Values not significant are between (µ − 2σ ) and (µ + 2σ).
Example: Finding a Percentile (2 of 3)
Solution From the sorted list of airport data speeds in the table, we see that there are 20 data speeds less than 11.8 Mbps, so
Example: Finding a Percentile (1 of 3)
The airport Verizon cell phone data speeds listed below are arranged in increasing order. Find the percentile for the data speed of 11.8 Mbps.
Example: Computing the Mean from a Frequency Distribution (1 of 2)
The first two columns of the table shown here are the same as the frequency distribution of Table 2-2 from Chapter 2. Use the frequency distribution in the first two columns to find the mean.
Standard Deviation of a Sample (1 of 2)
The standard deviation of a set of sample values, denoted by s, is a measure of how much data values deviate away from the mean. Notation s = sample standard deviation σ = population standard deviation
Why Divide by (n - 1)?
There are only n − 1 values that can assigned without constraint. With a given mean, we can use any numbers for the first n − 1 values, but the last value will then be automatically determined. With division by n − 1, sample variances s² tend to center around the value of the population variance σ²; with division by n, sample variances s² tend to underestimate the value of the population variance σ².
Calculation and Notation of the Median
To find the median, first sort the values (arrange them in order) and then follow one of these two procedures: If the number of data values is odd, the median is the number located in the exact middle of the sorted list. If the number of data values is even, the median is found by computing the mean of the two middle numbers in the sorted list.
Other Mode Examples
Two modes: The data speeds (Mbps) of 0.3, 0.3, 0.6, 4.0, and 4.0 have two modes: 0.3 Mbps and 4.0 Mbps. No mode: The data speeds (Mbps) of 0.3, 1.1, 2.4, 4.0, and 5.0 have no mode because no value is repeated.
Example: Calculating Standard Deviation
Use sample standard deviation formula to find the standard deviation of these Verizon data speed times (in Mbps): 38.5, 55.6, 22.4, 14.1, 23.1. Solution
Example: Constructing a Boxplot (1 of 2)
Use the Verizon airport data speeds to construct a boxplot.
Example: Finding a 5-Number Summary (1 of 3)
Use the Verizon airport data speeds to find the 5-number summary.
Finding the Mode
A data set can have one mode, more than one mode, or no mode. When two data values occur with the same greatest frequency, each one is a mode and the data set is said to be bimodal. When more than two data values occur with the same greatest frequency, each is a mode and the data set is said to be multimodal. When no data value is repeated, we say that there is no mode.
Important Properties of the Midrange
Because the midrange uses only the maximum and minimum values, it is very sensitive to those extremes so the midrange is not resistant. In practice, the midrange is rarely used, but it has three redeeming features: 1. The midrange is very easy to compute. 2. The midrange helps reinforce the very important point that there are several different ways to define the center of a data set. 3. The value of the midrange is sometimes used incorrectly for the median, so confusion can be reduced by clearly defining the midrange along with the median.
Example: Comparing a Baby's Weight and Adult Body Temperature (3 of 3)
Interpretation
Example: Finding a Percentile (3 of 3)
Interpretation A data speed of 11.8 Mbps is in the 40th percentile. This can be interpreted loosely as this: A data speed of 11.8 Mbps separates the lowest 40% of values from the highest 60% of values. We have P40 = 11.8 Mbps.
Quartiles
Quartiles are measures of location, denoted Q1, Q2, and Q3, which divide a set of data into four groups with about 25% of the values in each group.
Range Rule of Thumb for Estimating a Value of the Standard Deviation s
Range Rule of Thumb for Estimating a Value of the Standard Deviation To roughly estimate the standard deviation from a collection of known sample data, use
Example: Converting a Percentile to a Data Value (1 of 4)
Refer to the sorted data speeds below. Find the 40th percentile, denoted by P40.
Example: Comparing a Baby's Weight and Adult Body Temperature (2 of 3)
Solution
Example: Computing Grade-Point Average
Solution
Example: Chebyshev's Theorem
Solution Applying Chebyshev's theorem with a mean of 100 and a standard deviation of 15, we can reach the following conclusions:
Example: Finding a 5-Number Summary (2 of 3)
Solution Because the Verizon airport data speeds are sorted, it is easy to see that the minimum is 0.8 Mbps and the maximum is 77.8 Mbps.
Example: Converting a Percentile to a Data Value (3 of 4)
Solution Since L = 20 is a whole number, we proceed to the box located at the right. We now see that the value of the 40th percentile is midway between the Lth (20th) value and the next value in the original set of data. That is, the value of the 40th percentile is midway between the 20th value and the 21st value.
Example: Converting a Percentile to a Data Value (4 of 4)
Solution The 20th value in the table is 11.6 and the 21st value is 11.8, so the value midway between them is 11.7 Mbps. We conclude that the 40th percentile is P40 = 11.7 Mbps.
Example: Constructing a Boxplot (2 of 2)
Solution The boxplot uses the 5-number summary found in the previous example: 0.8, 7.9, 13.9, 21.5, and 77.8 (all in units of Mbps). Below is the boxplot representing the Verizon airport data speeds.
Example: The Empirical Rule (2 of 2)
Solution The key is to recognize that 70 and 130 are each exactly . 2 standard deviations away from the mean of 100.. 2 standard deviations = 2s = 2(15) = 30 2 standard deviations from the mean is 100 − 30 = 70 or 100 + 30 = 130 About 95% of all IQ scores are between 70 and 130.
Example: Finding a 5-Number Summary (3 of 3)
Solution The value of the first quartile is Q1 = 7.9 Mbps. The median is equal to Q2, and it is 13.9 Mbps. Also, we can find that Q3 = 21.5 Mbps by using the same procedure for finding P75. The 5-number summary is therefore 0.8, 7.9, 13.9, 21.5, and 77.8 (all in units of Mbps).
Example: Computing Grade-Point Average
Solution Use the numbers of credits as weights: w = 3, 4, 3, 3, 1. Replace the letter grades of A, A, B, C, and F with the corresponding quality points: x = 4, 4, 3, 2, 0.
Example: Computing the Mean from a Frequency Distribution
Solution When working with data summarized in a frequency distribution, we make calculations possible by pretending that all sample values in each class are equal to the class midpoint. The result of x = 160.5 seconds is an approximation because it is based on the use of class midpoint values instead of the original list of service times.
Example: Computing Grade-Point Average
Solution The result is a first-semester grade-point average of 3.07. (In using the preceding round-off rule, the result should be rounded to 3.1, but it is common to round grade-point averages to two decimal places.)
Example: Converting a Percentile to a Data Value (2 of 4)
Solution We can proceed to compute the value of the locator L. In this computation, we use k = 40 because we are attempting to find the value of the 40th percentile, and we use n = 50 because there are 50 data values.
Standard Deviation of a Population
Standard Deviation of a Population A different formula is used to calculate the standard deviation σ of a population: Instead of dividing by n − 1 for a sample, we divide by the population size N.
Example: Median with an Odd Number of Data Values
Find the median of the first five data speeds for Verizon: 38.5, 55.6, 22.4, 14.1, and 23.1 (all in megabits per second, or Mbps). Solution First sort the data values by arranging them in ascending order, as shown below: Because there are 5 data values, the number of data values is an odd number (5), so the median is the number located in the exact middle of the sorted list, which is 23.1 Mbps.
Example: Midrange
Find the midrange of these Verizon data speeds: 38.5, 55.6, 22.4, 14.1, and 23.1 (all in Mbps) Solution The midrange is found as follows:
Example: Mode
Find the mode of these Sprint data speeds (in Mbps): Solution The mode is 0.3 Mbps, because it is the data speed occurring most often (three times).
Important Property of Range
Find the range of these Verizon data speeds (Mbps): 38.5, 55.6, 22.4, 14.1, 23.1. Solution Range = (maximum value) − (minimum value) = 55.6 − 14.1 = 41.50 Mbps
Example: Calculating Standard Deviation Using Shortcut Formula
Find the standard deviation of the Verizon data speeds (Mbps) of 38.5, 55.6, 22.4, 14.1, 23.1 Solution
Chebyshev's Theorem
IQ scores have a mean of 100 and a standard deviation of 15. What can we conclude from Chebyshev's theorem?
Mean (or Arithmetic Mean)
The mean (or arithmetic mean) of a set of data is the measure of center found by adding all of the data values and dividing the total by the number of data values.
Important Properties of the Median
The median does not change by large amounts when we include just a few extreme values, so the median is a resistant measure of center. The median does not directly use every data value. (For example, if the largest value is changed to a much larger value, the median does not change.)
Median
The median of a data set is the measure of center that is the middle value when the original data values are arranged in order of increasing (or decreasing) magnitude.
Midrange
The midrange of a data set is the measure of center that is the value midway between the maximum and minimum values in the original data set. It is found by adding the maximum data value to the minimum data value and then dividing the sum by 2, as in the following formula:
Important Properties of the Mode
The mode can be found with qualitative data. A data set can have no mode or one mode or multiple modes.
Mode
The mode of a data set is the value(s) that occur(s) with the greatest frequency.
Finding the Percentile of a Data Value
The process of finding the percentile that corresponds to a particular data value x is given by the following (round the result to the nearest whole number):
Range
The range of a set of data values is the difference between the maximum data value and the minimum data value. Range = (maximum data value) − (minimum data value)
Range Rule of Thumb for Understanding Standard Deviation
The range rule of thumb is a crude but simple tool for understanding and interpreting standard deviation. The vast majority (such as 95%) of sample values lie within 2 standard deviations of the mean.
Biased and Unbiased Estimators
The sample standard deviation s is a biased estimator of the population standard deviation s, which means that values of the sample standard deviation s do not tend to center around the value of the population standard deviation σ. The sample variance s² is an unbiased estimator of the population variance σ², which means that values of s² tend to center around the value of σ² instead of systematically tending to overestimate or underestimate σ².
Important Properties of Standard Deviation
The standard deviation s can increase dramatically with one or more outliers. The units of the standard deviation s (such as minutes, feet, pounds) are the same as the units of the original data values. The sample standard deviation s is a biased estimator of the population standard deviation σ, which means that values of the sample standard deviation s do not center around the value of σ.
Important Properties of Variance
The units of the variance are the squares of the units of the original data values. The value of the variance can increase dramatically with the inclusion of outliers. (The variance is not resistant.) The value of the variance is never negative. It is zero only when all of the data values are the same number. The sample variance s² is an unbiased estimator of the population variance σ².
Variance of a Sample and a Population
The variance of a set of values is a measure of variation equal to the square of the standard deviation. Sample variance: s² = square of the standard deviation s. Population variance: σ² = square of the population standard deviation σ.
Key Concept
This section introduces measures of relative standing, which are numbers showing the location of data values relative to the other values within the same data set. The most important concept in this section is the z score. We also discuss percentiles and quartiles, which are common statistics, as well as another statistical graph called a boxplot.
Key Concept
Variation is the single most important topic in statistics, so this is the single most important section in this book. This section presents three important measures of variation: range, standard deviation, and variance. These statistics are numbers, but our focus is not just computing those numbers but developing the ability to interpret and understand them.
Weighted Mean
When different x data values are assigned different weights w, we can compute a weighted mean.
Example: Comparing a Baby's Weight and Adult Body Temperature (1 of 3)
Which of the following two data values is more extreme relative to the data set from which it came?
Notation
n total number of values in the data set k percentile being used (Example: For the 25th percentile, k = 25.) L locator that gives the position of a value (Example: For the 12th value in the sorted list, L = 12.) Pk kth percentile (Example: P25 is the 25th percentile.)
Notation Summary
s = sample standard deviation s² = sample variance σ = population standard deviation σ² = population variance
Notation
µ is pronounced "mu" and is the mean of all values in a population.
Notation
∑ denotes the sum of a set of data values. x is the variable usually used to represent the individual data values. n represents the number of data values in a sample. N represents the number of data values in a population.
Descriptions of Quartiles (1 of 2)
Q1 (First quartile): Same value as P25. It separates the bottom 25% of the sorted values from the top 75%. Q2 (Second quartile): Same as P50 and same as the median. It separates the bottom 50% of the sorted values from the top 50%.
Boxplot (or Box-and-Whisker Diagram)
A boxplot (or box-and-whisker diagram) is a graph of a data set that consists of a line extending from the minimum value to the maximum value, and a box with lines drawn at the first quartile Q1, the median, and the third quartile Q3.
Skewness
A boxplot can often be used to identify skewness. A distribution of data is skewed if it is not symmetric and extends more to one side than to the other.
Mean
Caution Never use the term average when referring to a measure of center. The word average is often used for the mean, but it is sometimes used for other measures of center. The term average is not used by statisticians. The term average is not used by the statistics community or professional journals.
Comparing Variation in Different Samples or Populations
Coefficient of Variation: The coefficient of variation (or CV) for a set of nonnegative sample or population data, expressed as a percent, describes the standard deviation relative to the mean, and is given by the following:
Example: Mean (1 of 2)
Data Set 32 "Airport Data Speeds" in Appendix B includes measures of data speeds of smartphones from four different carriers. Find the mean of the first five data speeds for Verizon: 38.5, 55.6, 22.4, 14.1, and 23.1 (all in megabits per second, or Mbps).
Descriptions of Quartiles (2 of 2)
Q3 (Third quartile): Same as P75. It separates the bottom 75% of the sorted values from the top 25%. Caution Just as there is not universal agreement on a procedure for finding percentiles, there is not universal agreement on a single procedure for calculating quartiles, and different technologies often yield different results.
Percentiles
Percentiles are measures of location, denoted P1, P2, . . . , P99, which divide a set of data into 100 groups with about 1% of the values in each group.
Empirical Rule for Data with a Bell-Shaped Distribution
The empirical rule states that for data sets having a distribution that is approximately bell-shaped, the following properties apply. About 68% of all values fall within 1 standard deviation of the mean. About 95% of all values fall within 2 standard deviations of the mean. About 99.7% of all values fall within 3 standard deviations of the mean.