Chapter 1.3 Describing Quantitative Data with Numbers
The 1.5 x IQR rule for outliers
Call an observation an outlier if it falls more than 1.5 × IQR above the third quartile or below the first quartile.
Summary: Quartiles, 1st Quartile, 3rd, Interquartile Range (IQR), outlier
When you use the median to indicate the center of a distribution, describe its spread using the quartiles. The first quartile Q1 has about one-fourth of the observations below it, and the third quartile Q3 has about three-fourths of the observations below it. The interquartile range (IQR) is the range of the middle 50% of the observations and is found by IQR = Q3 − Q1. An extreme observation is an outlier if it is smaller than Q1 − (1.5 × IQR) or larger than Q3 + (1.5 × IQR).
Measuring Spread: Range and Interquartile Range (IQR)
The first quartile Q1 lies one-quarter of the way up the list. The second quartile is the median, which is halfway up the list. The third quartile Q3 lies three-quarters of the way up the list. These quartiles mark out the middle half of the distribution. The interquartile range (IQR) measures the range of the middle 50% of the data. IQR = Q3-Q1
Summary: Variance, standard deviation, resistant
The variance and especially its square root, the standard deviation sx, are common measures of spread about the mean. The standard deviation sx is zero when there is no variability and gets larger as the spread increases. The median is a resistant measure of center because it is relatively unaffected by extreme observations. The mean is nonresistant. Among measures of spread, the IQR is resistant, but the standard deviation and range are not.
Resistant measure
the mean is sensitive to the influence of extreme observations. These may be outliers, but a skewed distribution that has no outliers will also pull the mean toward its long tail. Because the mean cannot resist the influence of extreme observations, we say that it is not a resistant measure of center
Summary: The 5-number summary, boxplots
The five-number summary consisting of the median, the quartiles, and the maximum and minimum values provides a quick overall description of a distribution. The median describes the center, and the IQR and range describe the spread. Boxplots based on the five-number summary are useful for comparing distributions. The box spans the quartiles and shows the spread of the middle half of the distribution. The median is marked within the box. Lines extend from the box to the smallest and the largest observations that are not outliers. Outliers are plotted as isolated points.
The Five-Number Summary
The five-number summary of a distribution consists of the smallest observation, the first quartile, the median, the third quartile, and the largest observation, written in order from smallest to largest. That is, the five-number summary is: Minimum Q1 Median Q3 Maximum These five numbers divide each distribution roughly into quarters. About 25% of the data values fall between the minimum and Q1, about 25% are between Q1 and the median, about 25% are between the median and Q3, and about 25% are between Q3 and the maximum.
Summary: Mean and Median
The mean and the median describe the center of a distribution in different ways. The mean is the average of the observations, and the median is the midpoint of the values.
Choosing measures of center and spread
The median and IQR are usually better than the mean and standard deviation for describing a skewed distribution or a distribution with strong outliers. Use and sx only for reasonably symmetric distributions that don't have outliers.
Median
The median is the midpoint of a distribution, the number such that about half the observations are smaller and about half are larger. To find the median of a distribution: 1. Arrange all observations in order of size, from smallest to largest. 2. If the number of observations n is odd, the median is the center observation in the ordered list. 3. If the number of observations n is even, the median is the average of the two center observations in the ordered list.
Mean
The most common measure of center. To find the mean (pronounced "x-bar") of a set of observations, add their values and divide by the number of observations. The Σ (capital Greek letter sigma) in the formula for the mean is short for "add them all up." The subscripts on the observations xi are just a way of keeping the n observations distinct. They do not necessarily indicate order or any other special facts about the data. Actually, the notation refers to the mean of a sample. Most of the time, the data we'll encounter can be thought of as a sample from some larger population. When we need to refer to a population mean, we'll use the symbol μ (Greek letter mu, pronounced "mew"). If you have the entire population of data available, then you calculate μ in just the way you'd expect: add the values of all the observations, and divide by the number of observations.
The Standard Deviation
The square root of the variance. The standard deviation sx measures the typical distance of the values in a distribution from the mean. It is calculated by finding an average of the squared deviations and then taking the square root. This average squared deviation is called the variance. The properties: sx measures spread about the mean and should be used only when the mean is chosen as the measure of center. sx is always greater than or equal to 0. sx = 0 only when there is no variability. This happens only when all observations have the same value. Otherwise, sx > 0. As the observations become more spread out about their mean, sx gets larger. sx has the same units of measurement as the original observations. For example, if you measure metabolic rates in calories, both the mean and the standard deviation sx are also in calories. This is one reason to prefer sx to the variance , which is in squared calories. Like the mean , sx is not resistant. A few outliers can make sx very large. The use of squared deviations makes sx even more sensitive than to a few extreme observations. For example, the standard deviation of the travel times for the 15 North Carolina workers is 15.23 minutes. If we omit the maximum value of 60 minutes, the standard deviation drops to 11.56 minutes.