01.02 Describing Data
Examples of Shape
1. Symmetric 2. Skewed right (a graph that has a long tail on the right side of the data set) 3. Skewed left (a graph that has a long tail on the left side of the data set)
Skewed left
A graph that has a long tail on the left side of the data set
Skewed right
A graph that has a long tail on the right side of the data set
Resistant value
A value that is not changed by adding extreme values to the data set
Standard deviation of a population
Calculated by finding an average of the squared deviations and then taking its square root
Range
Maximum value − Minimum value
Mode
Mode is simply the number in a data set that occurs most often. It is not used frequently at this level of statistics
Variance
The average squared distance from the mean
Center
The center of a data set is described by either the mean or the median of the set of values. Unless the data set is symmetric, the median, rather than the mean, should be used to describe the center, because the median is a resistant measure whereas the mean is not.
Uniform
The data do not appear to have any distinct modes; there are no clear peaks on the graph
Shape
The shape the graph takes (which includes histograms, stem-and-leaf plots, dotplots, or boxplots)
Standard deviation of a sample
The square root of the variance
Does standard deviation describe the spread?
Yes, the more spread out the data, the greater the standard deviation
Is standard deviation based on mean?
Yes; which means that standard deviation is not resistant because mean is not resistant
Quartiles
a specific type of percentile
If the scores on a 22-point quiz from a class have been gathered and you know the five-number summary for the class is 5, 12, 14, 17, and 21, you can tell the following:
1. The minimum score in the class was 5 points. 2. The first quartile (Q1) is 12. Twenty-five percent of students earned 12 points or less, and 75% earned 12 points or more. 3. The median is 14. Fifty percent of students earned 14 points or less and fifty percent of students earned 14 points or more. 4. The third quartile (Q3) is 17. Seventy-five percent of students earned 17 points or less, and 25% earned 17 points or more. 5. The maximum score in the class was 21 points.
Symmetric
A graph in which the left and right sides are mirror images (only if the graph is exactly symmetric or roughly symmetric)
Interquartile Range (IQR)
Another way to find the spread of a data set; unlike range, it's resistant because it is not affected by extreme values (IQR = Q3 - Q1)
Outliers
Any unusual parts of the data set that do not fit the pattern of the data set. Informally, outliers can be found by looking at the data or graph; but when you use this method to describe the data, you have to say, "the outliers appear to be ...." a point that falls more than 1.5 times the IQR above the third quartile or below the first quartile. Lower Limit: Outlier < Q1- 1.5(IQR) Upper Limit: Outlier > Q3 + 1.5(IQR)
Standard deviation
Can be used to describe the spread; measures the average distance of the observations from the mean; not a resistant measure of spread
Five-number summary of a distribution
Consists of the smallest number, the first quartile, the median, the third quartile, and the largest number, written in order. The summary is: Minimum, Q1, Median, Q3, Maximum.
Finding first quartile
Find the median of the lower half of the data (the lower half of the data does not include the median of the data)
Finding the third quartile
Find the median of the upper half of the data (the upper half of the data does not include the median of the data)
Resistance
How a measure is influenced by extreme values
Outlier Example (solution)
IQR = Q3- Q1 = 20- 10 = 10 Outlier limits: Q1- 1.5(IQR) 10- 1.5(10) = 10- 15 =-5 Q3 + 1.5(IQR) 20 + 1.5(10) = 15 + 20 = 35 Therefore, any value less than-5 or greater than 35 is an outlier. Because 51 is greater than 35, it is an outlier. We indicate the outlier with a dot, and only draw the whisker to the next greatest value, which is 22.
Spread
The spread of a data set is used to describe the variability in the data. One way to describe spread is to find the range of the data, subtracting the smallest point of data from the largest point of data.
Outlier Example (Part 1)
Mean: To find the new mean, add 51 to the total salaries and add 1 to the number of employees: 204+51/14+1 = 255/15=17.00 The mean has increased by nearly $2.50! Median: To find the new median, list the values in ascending order. Now that there are 15 values, the median is the middle (or eighth) value. 8, 8, 8, 10, 10, 10, 16, 18, 18, 18, 18, 20, 20, 22, 51 The new median is 18.00. The median has increased by 1. Standard deviation: Because the mean has changed, the standard deviation calculation will change, as well, to about 10.31. This means that most of the employees should be making between $6.69 and $27.31.
NBA Salaries, $US, millions 17.1 5.8 5.0 4.5 4.3 4.2 3.1 2.1 2.0 1.0 1.0 0.8 0.7 0.3 Find the five-number summary of the data set
Minimum: 0.3 Quartile 1: 1.0 Median: 2.6 Quartile 3: 4.5 Maximum: 17.1
Does the value of the mean influence the magnitude of the standard deviation?
No, Imagine adding 5 to every value in the data set. Would that change the spread of the data? No, but it would change the value of the mean.
Does standard deviation depend on the size of the data set?
No, because it is the average distance from the mean, adding more values does not necessarily change the standard deviation.
Mathematically, how are outliers found using the interquartile range?
Outlier < Q1 − 1.5(IQR) → an outlier includes anything less than this value AND Outlier > Q3 + 1.5(IQR) → an outlier includes anything greater than this value.
Sample
Part of the population from which information is collected; used to draw conclusions about the entire population
SOCS
Shape, Outliers, Center, Spread
Bimodal
The data have exactly two clear modes, shown by two peaks of similar size on the graph
Multimodal
The data have multiple modes, shown by more than two peaks of similar size on the graph
Unimodial
The data set has one clear mode, shown by one peak on the graph
Population
The entire group of individuals about which we want information
Median
The median is the number that falls in the middle when the numbers are arranged in order from least to greatest
Mean
The most common measure of center is the mean, which is the arithmetic average of a set of data
Modes
The number that occurs most frequently in a set; can be used to describe data as the number of peaks represented in a display; peaks represent possible modes
First quartile (Q1)
The point at which 25% of the data is below that point and 75% is above that point
Second quartile (median)
The point at which 50% of the data is below and 50% is above that point
Third quartile (Q3)
The point at which 75% of the data is below and 25% is above that point
Percentiles
The values that divide a rank-ordered set of elements into 100 equal parts
Outlier Example (Part 2)
Using a boxplot to represent data graphically can often help you to recognize outliers. In a boxplot, outliers fall significantly below the first quartile, or significantly above the third quartile. We measure the significance according to the interquartile range (IQR), which is Q3- Q1, and it is another measure of spread. An outlier is a point that falls more than 1.5 times the IQR above the third quartile or below the first quartile. Lower Limit: Outlier < Q1- 1.5(IQR) Upper Limit: Outlier > Q3 + 1.5(IQR)
A percentile
a value that describes how one value in a data set compares with all other values in the set