5.04 Data Sets and Outliers
Example 1: Sylvia made a dot plot to display the hourly pay of all of the company's employees, including drivers, warehouse workers, sales reps, and managers. The boss claimed that Sylvia's salary is higher than the average salary. Is he correct? Find the mean and compare it to Sylvia's pay rate of $16 per hour. blob:chrome-untrusted://media-app/97ad5c48-40ff-44c9-924f-e2edd4023adf
8+8+8+10+10+10+16+18+18+18+18+20+20+2214 ≈14.57 The average salary is about $14.57 per hour, which is less than Sylvia's salary of $16 per hour. Sylvia's boss is correct.
Formulas
An outlier is more than 1.5 times the IQR less than the first quartile, or more than 1.5 times the IQR greater than the third quartile. Lower Limit: Outlier < Q1 − 1.5(IQR) Upper Limit: Outlier > Q3 + 1.5(IQR) Find the IQR. Multiply the IQR by 1.5. Subtract this number from Q1, and add this number to Q3 to find the limits for the outliers.
Comparing Data: Sylvia's friend Carlos also works for a vending machine company and has given her the salary data for his company's employees. While Sylvia has determined that she is underpaid within her company, she wants to compare data across the industry to see how her salary relates. She made a dot plot for Carlos's company. Then, she drew two box plots on the same number line. Note that Sylvia did not include the owner's salary information. https://lti.flvsgl.com/flvs-cat-content/nkctr37dsd4up5slbncu31uu0o/flvs-cat-session/temp_project_03/module05/lesson04/images/05_04_05_dotplot2.jpg Compare the salaries between the two companies in terms of center, spread, and shape.
Center: Sylvia's company has a mean salary of $14.57 and the mean at Carlos's company is $14.53—nearly identical. The median salary at Sylvia's company is $17, as compared to $15 at Carlos's company. Spread: Both companies have the same range and nearly the same IQR—8 at Sylvia's company and 9 at Carlos's. Shape: In looking at the dot plots, the salaries at Sylvia's company are concentrated at the low and high ends, with a gap in the middle. At Carlos's company, there is also a cluster of values at the bottom, but the salaries are more evenly distributed throughout the range and the IQR. On the box plots, Carlos's company's median wage falls roughly in the middle of the two end points, whereas Sylvia's median lies toward the top end of the range. In Carlos's distribution, the mean and median are nearly equal, which means that the data set appears to have a symmetrical distribution. Sylvia's data is not symmetrical and is skewed to the left. This would imply that the mean is less than the median and not a good descriptor of center.
Center, Spread, and Shape: A distribution is described by its center, spread, or shape. Let's look at an example. A student named Niko decides to collect data by asking 32 random students how long they thought Field Day at their school should last. https://lti.flvsgl.com/flvs-cat-content/nkctr37dsd4up5slbncu31uu0o/flvs-cat-session/temp_project_03/module05/lesson04/images/05_04_02_01.png
Center: Take a look at the data table. Do you notice anything about the data? Look at where most of the tally marks are located. This area is showing the most common response. In statistics, this is known as the "center of the distribution." It is important to know that the center of the distribution is not the center of a table or of a graph. The location of the center depends on the responses. There are various ways to measure the center of the distribution. However, without doing any calculations, you can probably guess the center in Niko's data is around 4 hours. This means a typical student may choose a response around 4 hours to be the ideal amount of time for Field Day. Spread: The spread of the data describes how the data vary. There are two different ways to look for spread—also known as "variability" or "scatter"—in a data table. For example, the data in Niko's table vary from 0 hour to 5 hours. This means the tally marks are spread between the values of 0 and 5. You can also look at how far the data are from the center of the distribution. Some of the values, 0 hour or 1 hour, on the table are far from the center. Shape: The shape of the data refers to the overall look of the data. You can see the data vary from one end of the data table to the other. There are a lot of tally marks at the bottom of the table and very few tally marks at the top of the table. If the majority of the data was grouped in the middle of the table, the distribution would be symmetric. The shape would rise toward the middle and then fall. Niko's data is not symmetric.
Sum It Up
Data sets can be analyzed in terms of center, using mean and median, spread, using range, IQR, standard deviation, and by looking at the overall shape of the data. If the mean and median are the same, the data is symmetrically distributed. If the mean is less than the median, the data is skewed to the left. (Tail on the left) If the mean is greater than the median, the data is skewed to the right. (Tail on the right) Since outliers in a data set stand out, you may be able to identify them just by looking at the data. However, outliers are defined according to the interquartile range (IQR). Including outliers during data analysis can significantly affect the mean and standard deviation. Therefore, if you have evidence to suggest that the outliers are not typical measurements, you may want to throw them out. Removing outliers generally affects the mean and standard deviation, but does not significantly affect the median or IQR.
Mean
How to Find the Mean. The mean is the average of the numbers. It is easy to calculate: add up all the numbers, then divide by how many numbers there are. In other words it is the sum divided by the count.
What Do We Do With Outliers?
If they can be explained, or if they will affect the decisions made from the data analysis, outliers may be included or excluded from the data, depending on the situation. Each case must be examined individually, and decisions made after analysis, both with and without outliers included. Example: 12, 89, 84, 83, 86 12 is the outlier
Example 1: Julie bought a catapult that launches water balloons. The company that makes the catapult claims it can launch water balloons from 25 to 30 feet with excellent consistency. She launches 10 water balloons and measures the distance traveled by each, rounding to the nearest foot. https://lti.flvsgl.com/flvs-cat-content/nkctr37dsd4up5slbncu31uu0o/flvs-cat-session/temp_project_03/module05/lesson04/images/05_04_04_dotplot.jpg Find the mean and median distance. What do these numbers tell us about the data?
Mean 21+23+25+25+25+27+27+28+29+30 / 10 =26 Median: List the numbers in order — 21, 23, 25, 25, 25, 27, 27, 28, 29, 30 The numbers 25 and 27 both fall in the middle. Find the average of these two numbers to find the median. 25+27 / 2 =26 Since both the mean and median are 26, this tells us that although the graph does not look particularly symmetric, the data is falling along those lines, with 5 data points on each side of the measures of center.
Example 2: The original mean was about $14.57, and the median was $17.00. Calculate the mean and median, including the owner's salary in the data.
Mean: To find the new mean, add 51 to the total salaries and 1 to the number of employees. 204 + 51 = 255 = 17.00 14 + 1 = 15 The mean has risen by nearly $2.50! Median: To find the new median, list the values in ascending order. Now that there are 15 values, the median is the middle (or 8th) value. 8, 8, 8, 10, 10, 10, 16, 18, 18, 18, 18, 20, 20, 22, 51 The new median is $18.00. The median has risen by $1.
Median
Median: The middle number; found by ordering all data points and picking out the one in the middle (or if there are two middle numbers, taking the mean of those two numbers). Example: The median of 4, 1, and 7 is 4 because when the numbers are put in order (1 , 4, 7) , the number 4 is in the middle.
How can this information be used to evaluate Sylvia's situation? https://lti.flvsgl.com/flvs-cat-content/nkctr37dsd4up5slbncu31uu0o/flvs-cat-session/temp_project_03/module05/lesson04/images/05_04_02_boxplot.jpg
Remember that after adding in the owner's salary, the mean changed from $14.57 to $17 and the median changed from $17 to $18. Let's determine if any of Sylvia's data would be considered an outlier and create a box plot. (Include the owner's wage in the data.) Since 51 is greater than 35, it is an outlier. We indicate the outlier with a dot, and only draw the whisker to the next greatest value, 22. Q1 − 1.5(IQR) 10 − 1.5(10) = 10 − 15 = −5 Q3 + 1.5(IQR) 20 + 1.5(10) = 15 + 20 = 35 https://lti.flvsgl.com/flvs-cat-content/nkctr37dsd4up5slbncu31uu0o/flvs-cat-session/temp_project_03/module05/lesson04/images/05_04_02_boxplot2.jpg
Standard Deviation
Standard deviation is a number used to tell how measurements for a group are spread out from the average (mean or expected value). A low standard deviation means that most of the numbers are close to the average, while a high standard deviation means that the numbers are more spread out. To calculate the standard deviation: 1). Work out the Mean (the simple average of the numbers) 2). Then for each number: subtract the Mean and square the result. 3). Then work out the mean of those squared differences. 4). Take the square root of that and we are done!
Skew Versus Outliers
Take a look at the data set below: 34, 56, 43, 21, 63, 60, 38, 47, 55 At first glance, it may appear that 21 is an outlier in this data set. Now let's order the data set from least to greatest: 21, 34, 38, 43, 47, 55, 56, 60, 63 Now that 21 is not placed between 43 and 63, it does not look as out of place as before. It is not an outlier, but the data are skewed. Most of the values are in the 20 units between 43 and 63, so 21 causes a tail. Even though the data set has a wide spread, you can see the values are relatively close to each other. Therefore, this data set does not contain an outlier, because 21 is only 13 units from 34. If the values 34 and 38 were not in this data set, 21 would be an outlier because it would be a whopping 22 units away from the next data value. Outliers can cause data to be skewed; however, skewed data do not always have outliers.
Range
The Range is the difference between the lowest and highest values. Example: In {4, 6, 9, 3, 7} the lowest value is 3, and the highest is 9.
Inter-Quartile Range (IQR)
The difference between the first and third quartiles. (Note that the first quartile and third quartiles are sometimes called upper and lower quartiles.) We can find the interquartile range or IQR in four simple steps: 1). Order the data from least to greatest. 2). Find the median. 3). Calculate the median of both the lower and upper half of the data. 4). The IQR is the difference between the upper and lower medians.
Outliers Example: Mr. Perez promised his class a pizza party if their average grade on the midterm was at least 80. The 15 students in the class scored a total of 1,191 points, for a mean of 79.4. However, Julian's score of 57 was an outlier in the data. Julian was ill for much of the semester and missed a lot of classes. Find the mean if Julian's score is not included. Does this new mean qualify the class for a pizza party?
To find the new mean, subtract Julian's score from the total points and subtract 1 from the number of students. 1191 - 57 = 1134 = 81 15 - 1 = 14 This new mean gives the class an average over 80, so they would be able to have the pizza party if Julian's score was not included. Should the class get their pizza party? Because Julian missed so much school, his score can be explained by an unusual circumstance, and it is appropriate to throw it out as an outlier. Without his score, the class has a high enough average to earn the party. It is not fair to punish the rest of the class for this circumstance.
Skew https://lti.flvsgl.com/flvs-cat-content/nkctr37dsd4up5slbncu31uu0o/flvs-cat-session/temp_project_03/module05/lesson04/images/05_04_02_3.png https://lti.flvsgl.com/flvs-cat-content/nkctr37dsd4up5slbncu31uu0o/flvs-cat-session/temp_project_03/module05/lesson04/images/05_04_02_4.png
When data are displayed graphically, some distributions are not symmetric. Sometimes, they can have a long tail that trails off to one side, which is called "skew." Distributions with a long tail on the right are said to be "skewed right." This means the tail of values on the right side pulls data measurements to the right, away from the bulk of the values. Distributions with a long tail on the left are said to be "skewed left." This means the tail of values on the left pulls data measurements to the left, away from the bulk of the values. The direction of the pull is in the direction of the tail.
Outliers
Whenever data are collected to answer a statistical question, the data have a distribution. The distribution of data simply shows how often each value in the data set occurs. For some of the data sets, some values can be much larger or smaller than other data values. Such a value is called an outlier. It lies outside most of the other values in the data set. This, in turn, can also affect the shape of the distribution. Sometimes, this can cause the data to be lopsided. Ex: 1, 50, 51, 52, 54, 56 In this example 1 is the outlier