Statistics Chapter 2

Lakukan tugas rumah & ujian kamu dengan baik sekarang menggunakan Quizwiz!

Variance

how far a set of numbers are spread out When you are analyzing the variance of a data set, the larger the variance, the larger the spread. The number 329.72 tells us that the data has a large spread, and that the numbers are very different from the mean.

When should the median be used as the center of data for a set of data?

can be used to find the center of data when the numbers in the data set contain one or more outliers

Ordinal ranking

is a system of ordering where each mathematical value is given a certain position in a sequence of numbers where no positions are equal Okay, let's move on to ordinal ranking. We already ordered our data, so let's take a look at it one more time: 62, 53, 45, 45, 39, 34, 29.5, 28.5, 20 According to the ordinal ranking system, each mathematical value is given a certain position in a sequence of numbers where no positions are equal. Therefore, the competition officials would have to come up with a way to break the tie for the people who each ate 45, like a sudden death hot dog eating round.

bimodal distribution

the data set has two modes Now Professor Greenfield wants to know the breakdown of the math test. He reviews each test and figures out how many students got each answer right. Then, he creates a histogram to represent this data.

Spread in data

the measure of how far the numbers in a data set are away from the mean or median

standard deviation

the square root of the variance

Symmetrical Representation (Also known as normal distribution)

where the shape of a visual representation of a data set created is mirrored nearly perfectly across a line Mean, median and mode are all the same Ex: This is a graph of the high temperatures Katelyn observed during the past two weeks The horizontal axis represents the temperature, while the vertical axis represents the number of days that temperature was observed. Notice that there is a central peak on this graph with each side of the peak mirroring the other. This is a symmetrical data distribution.

Linear transformation

which is when a variable is multiplied by a constant and then added to a constant. When using linear transformations on a data set, all variables in the data set are transformed. We can transform the data in this data set by using the following formula for linear transformations: a + bx. In this case, x = the number in the data set, a = the constant being added to the variable and b = the constant being multiplied to the variable. Ex: Professor Shannon is a science professor at a local university. Recently, Professor Shannon administered a test to one of her science classes. These are the percentage grades each of the students received on their tests: 31, 17, 27, 35, 22, 35, 15, 17, 21, 27, 17. The best way to figure out how to transform this data is to look at the mean, 24. Professor Shannon feels that it would be fair for the average student to score a 75% on this test. If the average score is 24, then we can use the formula for linear transformations to change the average score of 24 to an average score of 75 like this: 27 + 2(24) = 75. You may notice that the highest score changed from a 35 to a 97, and the lowest score changed from a 15 to a 57. Therefore, with the new scores, only one student failed the science test.

Method for finding a value in a data set corresponding to a specific percentile

Multiply the total number of values in the data set by the percentile; this will give you the index. Order all of the values in the data set in ascending order (least to greatest). If the index is a whole number, count the values in the data set from least to greatest, until you reach the index, then take the index and the next greatest number and find the average. If the index is not a whole number, round the number up, then count the values in the data set from least to greatest, until you reach the index.

Method for finding percentile required when you knew the total number of values in the data set and the rank, or index, of one of those values.

(k + .5r) / n = p In this formula, k = the number of values in the data set below the rank, r = the number of values in the data set equal to the rank and n = the number of values in the data set.

Gap

A gap is an area in the data set where no observations have been made. The horizontal axis represents the temperature, while the vertical axis represents the number of days that temperature was observed. This is a graph of the low temperatures Katelyn observed during the past month: Notice that the majority of the observations are concentrated to the left side of the graph. Also, notice that there aren't any days that had a low temperature of 62 degrees. This is called a gap in the data. A gap is an area in the data set where no observations have been made. Therefore, this data is skewed right and has a gap in the distribution.

How to find variance

Find the mean of the set of data. Subtract each number from the mean. Square the result. Add the numbers together. Divide the result by the total number of numbers in the data set.

When should the mean be used as the center of data for a set of data?

Mean is best used for a data set with numbers that are closer together

At halftime of our game, the Bears quarterback has passes of 3, 8, 9, 12, 12 and 15 yards. Let's find the median pass thrown by the Bears quarterback.

Occasionally there may be an even number of values, which would provide you with two numbers in the middle. If this occurs, you will need to average the two values. At halftime of our game, the Bears quarterback has passes of 3, 8, 9, 12, 12 and 15 yards. Let's find the median pass thrown by the Bears quarterback. The first step is to make sure your numbers are in order from least to greatest, which they are in this problem. The next step is to find the middle number. Since there are six numbers in this set, the middle numbers would be the third and fourth values. Since there are two numbers in the middle, you will average them together. Add the two numbers together, 9 + 12 = 21. Then, divide by 2: 21 ÷ 2 = 10.5. The median of this set of data is 10.5. In the first half, the Bears quarterback had a median passing yardage of 10.5 yards.

Interquartile Range.

The difference between the upper and lower quartile values. For this method we will have to find each quartile in the data set.

What Can Standard Deviation Tell Me About My Data?

The standard deviation can help us determine if our data is a normal distribution. In a normal distribution, most of your data will fall within one standard deviation of your mean. In the first example, the average weight of the fish caught was 56 lbs. The standard deviation was 16.9. So to find the range of where most of the information will be, we will add and subtract the standard deviation to the mean: 56 + 16.9 = 72.9 and 56 - 16.9 = 39.1. This tells us that the majority of data for this set will be between 72.9 and 39.1, which represents one standard deviation of the mean. This is assuming that the data has a normal distribution.

Reporting bias

The type of bias that involves choice of answers for a statistical study

Explain the impact of linear transformation on various measures

When you use linear transformation on a data set, your mean and any other measures of center will increase as well as your standard deviation and any other measures of spread. That is because you are increasing your original values according to this formula: a + bx, where x = the number in the data set, a = the constant being added to the variable and b = the constant being multiplied to the variable.

Standard competition ranking

is a ranking system in which the mathematical values that are equal are given equal rank and the next, lesser value is given the next highest rank Let's start with standard competition ranking. To figure out Mickey's ranking, we first need to order the data greatest to least. Why greatest to least? In this competition, the more hot dogs you eat, the better. Therefore, the person with the greatest number of hot dogs is the best person. According to this order, the first place person ate 62 hot dogs, second place ate 53, but we have a problem when it comes to third place. There are two contestants that ate 45 hot dogs. According to standard competition ranking, the mathematical values that are equal are given equal rank and the next, lesser value is given the next highest rank. Therefore, both contestants would receive third place, and the contestant that ate 39 hot dogs would receive fifth place.

Fractional ranking

system of ordering in which the mathematical values that are equal are given the mean of the ranking positions Now for fractional ranking. Once again, let's have a last look at our data: 62, 53, 45, 45, 39, 34, 29.5, 28.5, 20 According to the fractional ranking system, the mathematical values that are equal are given the mean of the ranking positions. Therefore, the competition officials would have to take third and fourth place and find the mean, like this: 3 + 4 = 7/2 = 3.5 So, according to the fractional ranking system, both competitors that ate 45 hot dogs would receive a ranking of 3.5. This isn't really a practical ranking system for this particular competition. However, if you are looking at other forms of data, this may be the best method to use.

unimodal distribution

the data set has one mode Ex: Professor Greenfield surveyed his students to see how many hours each student studied for the math test. Professor Greenfield came up with the following numbers: 1, 3, 2, 1, 5, 1, 4, 3, 2, 1, 1. Each number represents the number of hours each student spent studying. Look at this chart to see how the information is organized: # of Study Hours# of Students15223241516070 Notice that five of the students spent an hour studying for the math test. That's more than any other number of hours. We can also see this data represented in a histogram.

maximum

the largest mathematical value in the data set

minimum

the smallest mathematical value in the data set

Holly's professor posts a list of grades, without the names, on the blackboard. There are 15 students in the class and 15 grades on the board. The grades are: 85, 34, 42, 51, 84, 86, 78, 85, 87, 69, 94, 74, 65, 56, 97. Dave scored in the 80th percentile. Find his score.

Holly's friend, Dave, scored in the 80th percentile. We can use the same process to figure out Dave's grade. First, multiply the percentile by the number of values in the data set: 15 * .80 = 12. From this list we can see that the 12th number in this list is 86. However, because our index turned out to be a whole number, we need to take the 12th number and the 13th number and find the average: 86 + 87= 173 / 2 = 86.5. From this information, we know that Dave scored at or better than 86.5. We used the average in this step because the percentiles don't always divide out perfectly. Since percentiles tell us that the value is at or better than the rest of the population, we have to use the average in this particular instance. In this case, we know that Dave actually scored better than 86.5 to be in the 80th percentile.

Issues with presenting and interpreting data to audience

Jack can mislead his audience into interpreting his data inaccurately in a few ways: Misleading graphs Ranking issues (For example, if Jack said that, of Easter holiday concerns, people ranked bunny rescue as the most important, then you might assume that many people are concerned about the issue of abandoned bunnies. However, if Jack were ranking bunny rescue among other issues such as finding that last Easter egg before it rots and who ate the ears off my rabbit, then of course a living creature would probably rank pretty high. Additionally, ranked statistics don't always specify the sample group. If the people who ranked bunny rescue as number one were the volunteers in Jack's organization, then that would be a very biased statistic.) Qualifying issues (Often, you will hear statistics represented with misleading qualifiers, such as 'On Easter Monday, bunnies are the most abandoned suburban pet.' This is comparing bunnies to other suburban pets on Easter Monday to make the number of abandoned bunnies sound impressive. However, this gives us too little information to properly interpret the data. For example, you may think that there are poor abandoned bunnies everywhere! However, because the qualifiers 'on Easter Monday' and 'suburban pets' are being used in this statement, it really limits the range of abandoned animals that bunnies are being compared with. There may be a lot of abandoned bunnies on Easter Monday, but there may be fewer in the grand scheme of abandoned animals. When a statistic uses qualifiers to narrow down a category while keeping it vague, then it can be very misleading.)

According to the U.S. Bureau of Labor Statistics, the gas prices for each month of the year in 2000 were as follows, rounded to the nearest hundredth of a decimal: 1.30, 1.37, 1.54, 1.51, 1.50, 1.62, 1.59, 1.51, 1.58, 1.56, 1.56, 1.49. Calculate mean, median, mode and range

Let's start with the mean. Pause the video here to see if you can find the mean of this data set. The mean of a data set tells us on average how much gas cost in the year 2000. We can find the mean by adding all of the numbers up and dividing by 12, which is the number of months in the year and how many numbers we have in this data set. 1.30 + 1.37 + 1.54 + 1.51 + 1.50 + 1.62 + 1.59 + 1.51 + 1.58 + 1.56 + 1.56 + 1.49 = 18.13 / 12 = 1.51 1.51 is the mean for this data set. This number tells us on average the price of gas for the entire year. You will notice that 1.51 appears in the data set. Sometimes you will have an average that does not appear in the data set, but will still show you the big picture of the numbers given. Okay, let's move on to median. Pause the video here to see if you can find the median of this data set. The median of a data set tells us what number falls directly in the middle. This is useful if you have one or two numbers that are greatly larger or smaller than the rest of the numbers in the data set. If the numbers are all pretty close together, then the mean and the median will be very close to the same number. First, arrange the numbers in either ascending or descending order. 1.30, 1.37, 1.49, 1.50, 1.51, 1.51, 1.54, 1.56, 1.56, 1.58, 1.59, 1.62 Now, eliminate each number until you are down to the middle. I like to take one number from each end like this: So we are left with 1.51 and 1.54 with five numbers crossed out on each side. Sometimes you will have data sets that have an odd amount of numbers. When this happens you are left with one number as a median. In this case, we have two numbers because our data set has an even amount of numbers. When you are left with two numbers as the median, you need to find the average by adding the two numbers and dividing by 2. 1.51 + 1.54 = 3.05 / 2 = 1.53 I rounded this number to the nearest hundredth. The median is 1.53 and this tells us that exactly half of the data set is greater than 1.53 and exactly half of the data set is less than 1.53. Although it isn't the same number, the mean and the median for this set of numbers is very close, meaning that the numbers in the data set are very close together. Now let's find the mode in this set of data. The mode is the number you will see the most in the data set. While mean and median give you a big picture idea, the mode gives you an idea of what number you are most likely to encounter. Let's look at some numbers that repeat in the data set. I see two 1.51 numbers and two 1.56 numbers. In this data set, there are no other numbers that repeat. So in this case, we have two modes: 1.51 and 1.56. Our last practice problem is finding range. You can find the range in the data set by taking the largest number and subtracting the smallest number. This will show you the spread in the numbers and how much difference there is between them. In this data set, our largest number is 1.62 and our smallest number is 1.30. Subtract those numbers. 1.62 - 1.30 = .32 This is our range. So now we know that over the course of the year 2000, the price in gas fluctuated a total of 32 cents.

Skewed Representation

Okay, if your visual representation of a data set is not symmetrical, then it might be skewed, which is where the shape of a graph peaks to the left or the right of the center. Ex: The horizontal axis represents the temperature, while the vertical axis represents the number of days that temperature was observed. This is a graph of the low temperatures Katelyn observed during the past month: Notice that the majority of the observations are concentrated to the left side of the graph. Therefore, this data is skewed right. If the opposite were the case, the data would be skewed in the other direction.

So, our data set is 6, 3, 8, 11, 7. The first thing we need to do is order the data like this: 3, 6, 7, 8, 11. Pause the video here to see if you can find the quartiles and the interquartile range.

Okay, my median in the whole data set is 7 - I didn't have to find the mean this time because my data set contains an odd number of values. My first quartile is 4.5 because I had to find the mean of 3 and 6. My third quartile is 9.5 because I had to find the mean of 8 and 11. The interquartile range for this data set is 5.

Find IQR

Order the data from least to greatest. Find the median of the data set and divide the data set into two halves. Find the median of the two halves (quartiles) Subtract those quartiles

Bob wants to analyze the data of his sales staff. He wants to send his lowest salesperson to training and give a bonus to his top salesperson. Bob also wants to see if there are any salespeople that are significantly under-performing. First, let's order this data from least to greatest: 50, 52, 53, 67, 80. Okay, so the minimum and maximum values should be pretty easy to identify.

Our maximum value here is 80, and our minimum value is 50. Bob decides to send Jim to sales training, and he gives a bonus to Sally, because he doesn't want to give the bonus to himself.

Let's look at an example data set and calculate the variance. In the annual fishing competition, there were 10 competitors who caught fish. Each participant weighed their total catch and recorded their weights. There were 10 competitors, and the total weights of their fish were 23 lbs., 37 lbs., 82 lbs., 49 lbs., 56 lbs., 70 lbs., 63 lbs., 72 lbs., 63 lbs. and 45 lbs. Now calculate standard deviation

The first step to calculate the variance is to find the mean of the data set. To calculate the mean, we will add 23 + 37 + 82+ 49 + 56 + 70 + 63 + 72 + 63 + 45, which equals 560. Then, divide 560 ÷ 10 = 56. The average weight of fish caught was 56 lbs. The next step to calculate the variance is to subtract the mean from each value. The best way to set this up is in a table. Looking at the table, you can add a column for the mean to make the subtracting easier. To get these totals, we will now subtract: 23 - 56 = -33 37 - 56 = -19 82 - 56 = 26 49 - 56 = -7 56 - 56 = 0 70 - 56 = 14 63 - 56 = 7 72 - 56 = 16 63 - 56 = 7 45 - 56 = -11 Next, we will take each of these differences and square them. So we will calculate: -33^2 = 1,089 -19^2 = 361 26^2 = 676 -7^2 = 49 0^2 = 0 14^2 = 196 7^2 = 49 16^2 = 256 7^2 = 49 -11^2 = 121 Finally, to calculate the variance, we will average each of these squared totals. To do so, add 1,089 + 361 + 676 + 49 + 0 + 196 + 49 + 256 + 49 + 121 = 2,846. Next, take the total, 2,846, and divide by the 10 data points. 2,846 ÷ 10 = 284.6, so the variance of this data set is 284.6. Calculating the standard deviation is simple once we've found the variance. To find the standard deviation, we will simply take the square root of the variance. In the previous example, our variance was 284.6. The square root of 284.6 is 16.9 when rounded to the tenths place. The standard deviation for the total weights of fish caught was 16.9 lbs.

limiting question

These questions limit the response of the participant. For example, let's say that one of Jack's survey questions says, 'Every year, thousands of Easter bunnies are abandoned because they are no longer wanted as an Easter gift. If every rescue person were willing to house just two bunnies, we could make a great difference. Are you willing to house two bunnies? Yes or No?' This is a limiting question. The participants of the survey may be willing to house one rescue bunny, but unable to house two. Answering 'yes' or 'no' does not give Jack the full picture of this data and limits the information he can gather. Additionally, there may be some participants who are willing to foster a bunny or spend time volunteering. If Jack is looking for information about how many surveyed participants are interested in getting involved in bunny rescue, then this question limits the information he can gather.

Holly's professor posts a list of grades, without the names, on the blackboard. There are 15 students in the class and 15 grades on the board. The grades are: 85, 34, 42, 51, 84, 86, 78, 85, 87, 69, 94, 74, 65, 56, 97. Find Holly's grade. She scored in the 90th percentile

To find Holly's grade, we need to do the following steps: Multiply the total number of values in the data set by the percentile, which will give you the index. Order all of the values in the data set in ascending order (least to greatest). If the index is a whole number, count the values in the data set from least to greatest until you reach the index, then take the index and the next greatest number and find the average. If the index is not a whole number, round the number up, then count the values in the data set from least to greatest, until you reach the index. This probably sounds a little confusing. Feel free to pause the video and work the examples with me! Let's start with step number one. The total number of values in the data set is 15. I found this number by looking at how many students there are in the class. The percentile is 90 because that is the score Holly's professor said she received. Therefore, 15 * .90 = 13.5. Okay, so I got 13.5. Now, according to step two, I need to order the grades from least to greatest: 34, 42, 51, 56, 65, 69, 74, 78, 84, 85, 85, 86, 87, 94, 97. For step three I need to round my index up from 13.5 to 14. Next, I need to count from the smallest number up to the 14th number in the list. The 14th number in this list is 94. That tells us that Holly scored a 94 on her math test and only one person scored higher than she did.

Let's put our new skills into practice with an example. Let's find the mean, median, mode and range of how many medals the U.S. has won over the last six summer Olympics.

To find the mean of this data set, we would add 104 + 110 + 101 + 94 + 101 + 108, and then divide by 6 because there are six values. So, 104 + 110 + 101 + 94 + 101 + 108 = 618. And, 618 ÷ 6 = 103. So, over the past six Summer Olympics, the United States has been awarded an average of 103 medals. To find the median, we must first put the data in order from least to greatest. So our numbers in order from least to greatest would be 94, 101, 101, 104, 108, 110. The middle of this data set is actually two numbers (101 and 104). To find the median, we will need to add these two numbers together and divide by 2. 101 + 104 = 205, then dividing by 2 makes the median 102.5. Looking at this data set, we can see that there is only one number that repeats itself, which is 101. This means that the mode of the data set is 101. The range of this data set is found by taking the largest value (110) and the smallest value (94) and subtracting. So, 110 - 94 = 16. The range of this data set is 16.

Dave has a fantasy football league with his coworkers. Dave's team is ranked third out of 35 teams in his office. In what percentile is Dave's team?

We can find this information using the following formula: (k + .5r) / n = p First, we must consider Dave's rank as x. In this formula k = the number of teams below x, r = the number of teams equal to x, n = the number of teams and p = percentile. The first variable is k, which is the number of teams below Dave's rank at third place. Since there are 35 teams in Dave's fantasy football league that means there are 32 teams that rank under Dave. Therefore, k = 32. The next variable is r, which is the number of teams equal to Dave's rank at third place. Since there are no ties for third place, we can assume that only one team is in third place. Therefore, r = 1. Finally, we know that n = 35 because there are 35 teams in the league. Now our formula looks like this: (32 + .5(1)) / 35 = p Solve the equation: 32.5 / 35 = 92.8% Dave's team is in the 93rd percentile. That means that he is doing better than 93% of the teams in his league.

Let's take the data sets from our previous examples and see how variance can make a difference in how we interpret the data. Ruby's class has a total population of six students. The students have the following reading speeds: 12, 8, 10, 10, 8, 12.

We start with step one. I've listed my data set in the first column. We already know the mean, which is 10, so let's put that underneath step one. Step two, we need to take each number from the data, set and subtract it from the mean: 10 - 12 10 - 8 10 - 10 10 - 10 10 - 8 10 - 12 Third, square each number. If you look at the first row, you will see that I have 10 - 12 under step two, then the result, -2, is squared. I also have: 2^2, 0^2, 0^2, 2^2 and (-2)^2. Step four, add all of the numbers together. You'll see that under step four, I've listed the results of squaring the numbers: 4, 4, 0, 0, 4 and 4. Add all of those together for a result of 16. Step five: divide the result by the total numbers in the data set. I have six total numbers in my data set, so I will divide 16 by 6, and that will give me a result of 2.67. I am dividing by six here because we are looking at the entire population. The variance 2.67 describes the spread of the numbers throughout the data set. This shows us that all of the numbers are relatively close together.

Histogram

a graphical representation of the distribution of data

frequency polygon. Purposes?

a line graph created by joining all of the top points of a histogram Show the shape of a distribution of data Can be seen with or without the histogram bars Have end points that lie on the x-axis Can be used to compare two sets of data Are best for comparisons of data that have the same sample size

Percentile

a measure that indicates what percent of the given population scored at or below the measure Often, schools and colleges will use percentiles to rank students based on their academic performance. If Holly scored in the 90th percentile, then that means she scored at or better than 90% of her class. Since percentiles are based on a percentage, you will only see percentiles between the range of 0-100.

Center of data

a single number that summarizes the entire data set

Outlier

a value that is much larger or smaller than the other values in a data set From this chart you can see that our data set is 6, 7, 8, 8, 6, 0 and 7. You might notice that we have several numbers that are close together and one number that is a bit off, which is 0.


Set pelajaran terkait

fahmy 2020 dictionary for kids 6

View Set

Successful Coaching Practice Test

View Set

Unit 3: The Supreme Court and Civil Rights

View Set

Strategic Management Test 2 - Chapter 4 - The External Environment

View Set

Microeconomics 4th Section of Units - Pallab Ghosh

View Set

Infectious Diseases and Immunizations in infants, children and youth

View Set