AP Stats Data Analysis 31-60

Ace your homework & exams now with Quizwiz!

Given these parallel boxplots, which of the following is true? (A) All three distributions have the same range. (B) All three distributions have the same interquartile range. (C) All three medians are between 9 and 13. (D) All three distributions appear to be skewed right. (E) All three distributions can reasonably be assumed to be of samples from normally distributed population.

(A) All three have the same range: 22 - 2 = 20. The interquartile ranges of the first sets are 6, while the interquartile range of the middle set is 4. The medians are 12, 10, and 8, respectively. Only the lower distribution appears to be skewed right. With the given outliers, none of these distributions appears symmetric, and therefore, none appear normal (if the sample is small, any outliers in the sample call into question the use of t-procedures which depend upon a normality/large sample assumption).

Given this histogram, and using the most commonly accepted definition of outliers, what values would be considered outliers? (A) Between 115 and 120. (B) Between 110 and 120. (C) Between 50 and 55, or between 115 and 120. (D) Between 50 and 55, or between 110 and 120. (E) There are no outliers.

(E) Counting boxes (area) we note that Q₁ = 70 and Q₃ = 90. The interquartile range IQR is Q₃ - Q₁ = 90 - 70 = 20. Outliers would be values greater than Q₃ + 1.5 (IQR) = 120 or less than Q₁ - 1.5 (IQR) = 40. In this case, there are no such values.

Which of the following statements are true? (A) Two students working with the same set of data may come up with histograms that look different. (B) Displaying outliers is less problematic when using histograms than when using stemplots. (C) Histograms are more widely used than stemplots or dotplots because histograms display the values of individual observations. (D) Unlike other graphs, histograms axes do not need to be labeled. (E) A histogram of categorical variable can pinpoint clusters and gaps.

(A) Choice of interval width, and therefore number of bins, changes the appearance of a histogram. Displaying outliers is more problematic with histograms depending on the bin widths. Histograms do not show individual observations. All graphs need to be labeled. Histograms, stemplots, and boxplots make no sense with categorical variables.

Given this back-to-back stemplot, which of the following is incorrect? (A) The distributions have the same mean. (B) The distributions have the same range. (C) The distributions have the same interquartile range. (D) The distributions have the same standard deviation. (E) The distributions have the same variance.

(A) Note that adding 10 to every score in set A results in set B. Thus the means differ by 10, but measures of variability (for example, range, interquartile range, standard deviation, and variance) remain the same.

A simple random sample of 25 world-ranked tennis players provides the following statistics: Number of hours of practice per day: x̄ = 7.3, Sx = 1.2. Yearly winnings: ȳ = $1,820,000, Sy = $310,000. Correlation r = .23. Based on this data, what is the resulting linear regression equation? (A) Winnings = 1,390,000 + 59,400 hours (B) Winnings = 1,300,000 + 71,300 hours (C) Winnings = -63,400 + 258,000 hours (D) Winnings = -443,000 + 310,000 hours (E) Winnings = -10,000,000 + 1,620,000 hours

(A) Slope = 0.23 (310,000/1.2) ≈ 59,400 and intercept = 1,820,000 - 59,400 (7.3) ≈ 1,390,000.

Given the above two histograms, which of the following statements is incorrect? (A) As long as the sample sizes are large enough, the Empirical Rule applies to both sets. (B) The standard deviation of set A is greater than 5. (C) The standard deviation of set B is greater than 5. (D) Both appear to have similar ranges. (E) A greater percentage of the values in set A are at least 10 units from it's mean than is the case for set B.

(A) The Empirical Rule applies to bell-shaped data, which set A clearly is not. For bell-shaped data, roughly 95 percent of the values fall within two standard deviations of the mean. However, although distribution B is roughly bell-shaped, less than 95 of the data are between 80 and 100 (two standard deviations from the mean if the standard deviation were only 5), so the standard deviation is greater than 5. In set A the data generally are even further from the mean than in set B, so the standard deviation is even greater.

Suppose the correlation between two variables is r = .28. What will the new correlation be if .17 is added to all values of the x-variable, every value of the y-variable is doubled, and the two variables are interchanged? (A) .28 (B) .45 (C) .56 (D) .90 (E) -.28

(A) The correlation is not changed by adding the same number to every value of one of the variables, by multiplying ever value of one of the variables by the same positive number, or by interchanging the x- and y-variables.

To which of the histograms below can the boxplot correspond? (A) (B) (C) (D) (E)

(B) From the boxplot, we see that the median is 50 and Q₁ = 10. The only histogram which looks to have 50 percent of the area on each side of 50, and 25 percent of the area between 0 and 10, is B.

Using the most commonly accepted definition of outliers, a set has five outliers. If every value of the set in increased by 20 percent, how many outliers will there now be. (A) Fewer than five (B) Five (C) Six (D) More than six (E) It is impossible to determine without further information.

(B) Increasing every value by 20 percent will also increase Q₁, Q₃, and the IQR as well as both Q₁ - 1.5 (IQR) and Q₃ + 1.5 (IQR) by the same 20 percent.

These dotplots for randomly selected male and female students at a particular high school show the number of times per week they eat at fast food restaurants. Which of the following is a true statement? (A) One distribution is roughly symmetric; the other is skewed left. (B) The difference in their means is less than the difference in their medians. (C) The ranges are both 7 - 0 = 7. (D) The standard deviations are the same. (E) Combining the male and female times into one set of students times will increase the range to the sum of the individual ranges.

(B) The "Male" distribution is roughly symmetric, while the "Female" distribution is skewed right, not left. The medians are 3 and 1, respective; the means are closer because the distribution with skew indicates the mean is greater than the median, while in the roughly symmetric distribution, the mean and median are close. Both have range 5 - 0 = 5. The male distribution is clustered more tightly around its mean, so it has a smaller standard deviation. Combining, the male and female times into one set of students times will result in a new set whose smallest and largest values are still 0 and 5.

A histogram of the educational level (in number of years of schooling) of the adult population of the United States would probably have which of the following characteristics? (A) Symmetry (B) Clusters around 8, 12, and 16 years (C) A gap around 12 years (D) Skewness to the right (E) A normal distribution

(B) There will be clusters around 8 (finished middle school), 12 (finished high school), and 16 (finished college) years of schooling. Small segments of the population will have very few, if any, years of schooling, and so the distribution will be skewed to the left. There are very few normal distributions of individual observation. (Normal distributions in statistical inference mainly arise in sampling distributions.)

A scatterplot of a company's revenues versus time indicates a possible exponential relationship. A linear regression on y = log(revenue in $1,000) against x = years since 2005 gives ŷ = 0.75 + 0.63x with r = .68. Which of the following is a valid conclusion? (A) On the average, revenue goes up 0.63 thousand dollars per year? (B) The predicted revenue for year 2009 is approximately 1,862 thousand dollars. (C) Forty-six percent of the variation in revenue can be explained by variation in time. (D) Sixty-eight percent of the variation in revenue can be explained by variation in time. (E) None of the above are valid conclusions.

(B) log(revenue in $1,000), not revenue, goes up an average of 0.63 per year. log(revenue in $1,000) = 0.75 + 0.63 (4) = 3.27 gives revenue = 10^3.27 = 1,862 thousand dollars. The coefficient of determination, r² = (.68)² = .46, gives that 46 percent of the variation in log(revenue in $1,000), not revenue, can be explained by variation in time.

Which of the following statements about the correlation coefficient r is incorrect? (A) It is not affected by changes in the measurement units of the variables. (B) It is not affected by which variable is called and which is called y. (C) It is not affected by extreme values. (D) It gives information about a linear relationship, not about causation. (E) It always takes values between -1 and 1, even if the association is nonlinear.

(C) Correlation has the formula r = (1/n-1)∑((Xi - x̄)/Sx)((Yi - ȳ)/Sy). We see that x and y are interchangeable, and so the correlation does not distinguish between which variable is called x and which is called y. The formula is also seen to be based on standardized scores (z-scores), and so changing units does not change the correlation. Since means and standard deviations can be strongly influenced by outliers, the correlation is also strongly affected by extreme values. Regression measures association, but says nothing about cause-and-effect. The correlation is always between -1 and 1.

Consider the following total sales picture: Which of the following is a true statement? (A) Each year since 2005 the total sales has increased. (B) The average sale has increased during each of the three given time periods. (C) It is possible that the total sales per year decreased every year between 2005 and 2013. (D) This picture may be misleading, but it is still a histogram. (E) The make sales projections, a boxplot would be more informative for this data.

(C) Labeling the horizontal axis with different year spans results in a misleading picture. Taking into account the number of years represented by each class, the actual total sales per year could be decreasing ((200,000/2) = 100,000), ((250,000/3) ≈ 83,333), and ((300,000/4) = 75,000). This picture is really a bar chart; histograms show relative frequencies through relative areas, and this picture doesn't. A boxplot of the yearly total sales amounts would no indication of a trend.

The amount of Omega 3 fish oil in capsules labeled 1,000 mg is measured for four manufacturers' products yielding the following: Which of the manufacturers' samples has the smallest range? (A) A (B) B (C) C (D) D (E) There is insufficient information to answer this question.

(C) Outliers do influence the range, that is, the range is sensitive to extreme values (while, for example, the interquartile range is resistant to extreme values). The ranges for the four samples are 1,020 - 970 = 50, 1,030 - 980 = 50, 1,020 - 980 = 40, and 1,030 - 970 = 60, respectively.

Which of the following are possible residual plots? (A) I only (B) II only (C) III only (D) I and II (E) I, II, and III

(C) The mean of the residuals is always zero, thus ruling out plot I, and the regression line for a residual plot is a horizontal line, thus ruling out plot II. A regression plot as in plot III may indicate that a nonlinear for would be better, but it is still a possible residual plot.

Which of the following statements is incorrect? (A) The range of the sample data set can never be greater than the range of the population. (B) While the range is affected by outliers, the interquartile range is not. (C) Changing the order from ascending to descending changes the sign of the range. (D) The range is a single number, not an interval of values. (E) The interquartile range is the range of the middle half of the data.

(C) The sample comes from the population, so the smallest value in the sample cannot be smaller than the smallest value in the population, and, similarly, the largest value in the sample cannot be larger than the largest value in the population. Outliers are extreme values, and while they affect the range, they do not affect the interquartile range (we say the range is sensitive, that is, nonresistant, to extreme values while the IQR is resistant to extreme values). The range is a single number, the largest value minus the smallest value, which is the same no matter how the set is arranged. IQR = Q₃ - Q₁ and the middle half of the data is between the quartiles Q₁ and Q₃.

A doctor wishes to compare the resting heart rates of is younger patients (younger than 30 years old) versus his older patients (older than 30 years old). Which of the following graphical displays in inappropriate? (A) Back-to-back stemplot (B) Parallel boxplots (C) Side-by-side histograms (D) Scatterplot (E) All the above displays are appropriate.

(D) A scatterplot is used to study a relationship between two quantitative variables. If the example had been started about family members coming into the doctor's office in pairs, one younger and one older person in each pair, then perhaps some information could have been noted from a scatterplot, but this wasn't the case.

Consider the following three scatterplots: Which has the greatest correlation coefficient r? (A) I (B) II (C) III (D) They all have the same correlation coefficient. (E) The question cannot be answered without additional information.

(D) The correlation coefficient is not changed by adding the same number to each value of one of the variables or by multiplying each value of one of the variables by the same positive number.

Suppose a study finds that the correlation coefficient relating job satisfaction to salary r = +1. Which of the following is a proper conclusion? (A) High salary causes high job satisfaction. (B) Low salary causes low job satisfaction. (C) There is a 100% cause-and -effect relationship between salary and job satisfaction. (D) There is a very strong association between salary and job satisfaction. (E) None of the above are proper conclusions.

(D) The correlation r measures association, not causation.

A random sample of golf scores gives the following summary statistics: n = 20, x̄ = 84.5 Sx = 11.5, minX = 68, Q₁ = 78, Med = 86, Q₃ = 91, maxX = 112. What can be said about the number of outliers? (A) 0 (B) 1 (C) 2 (D) At least 1 (E) At least 2

(D) The interquartile range IQR = Q₃ - Q₁ = 13. The most commonly accepted definition of an outlier is any value greater than Q₃ + 1.5 (IQR) = 110.5 or less than Q₁ - 1.5 (IQR) = 58.5. Looking at the minimum and maximum values, we see that there are no outliers on the lower end, and at least one outlier on the upper end.

A data set includes two outliers, one at each end. If both these outliers are removed, which of the following is a possible result? (A) Both the mean and standard deviation remain unchanged. (B) Both the median and standard deviation remain unchanged. (C) Both the standard deviation and variance remain unchanged. (D) Both the mean and median remain unchanged. (E) Both the mean and standard deviation increase.

(D) When the extreme values are removed, the range, standard deviation, and variance will all decrease. The median, or middle value, remains the same if one extreme value at each end is removed, and it is possible that the mean remains unchanged.

A linear regression analysis is preformed on the data from two scatterplots, A and B, resulting in identical least squares regression lines with positive slops. Which of the following statements is true? (A) The sum of the squares of the residuals in A equals the sum of the squares of the residuals in B. (B) The correlation in A equals the correlation in B (C) If the sum of the squares of the residuals in A is greater than the sum of the squares of the residuals in B, then the correlation is A will be greater than the correlation in B. (D) If the sum of the squares of the residuals in A is greater than the sum of the squares of the residuals in B, then the correlation is A will be less than the correlation in B. (E) None of the above are true statements.

(E) Even if the regression lines are identical, the points may be more closely or less closely scattered about this line resulting in different sums of squared of the residuals and different correlations. It might seem that a greater sum of the squared of the residuals leads to a lesser correlation, but this depends upon the number of points. For example, {(1, 1), (2, 3), (3, 2)} has a lesser sum of squares of the residuals and a lesser correlation than {(1, 1), (2, 3), (3, 2), (4, 2), (4, 4)}.

Consider n pairs of numbers. Suppose x̄ = 4, Sx = 3, ȳ = 2, and Sy = 5. Of the following, which could be the least squared line? (A) ŷ = 2 + x (B) ŷ = -6 + 2x (C) ŷ = -10 + 3x (D) ŷ = 5/3 - x (E) ŷ = 6 - x

(E) The least squares line passes through (x̄, ȳ) = (4, 2), and the slope b satisfies b = r(Sy/Sx) = (5r/3). Since -1 ≤ r ≤ 1, we have -5/3 ≤ b ≤ 5/3.

Given two independent random variables, X with mean 12.3 and standard deviation 0.5, and Y with mean 9.1 and standard deviation 0.3, which of the following is a true statement? (A) The mean of X -Y is 21.4. (B) The median of X - Y is 3.2. (C) The range of X - Y is 21.4. (D) The standard deviation of X -Y is 0.8. (E) The variance of X - Y is .34.

(E) The mean of X - Y is E(X -Y) = E(X) - E(Y) = 12.3 - 9.1 = 3.2. The variance of X - Y is var(X - Y) = var(X) + var(Y) = (0.5)² + (0.3)² = 0.25 + 0.09 = 0.34. The standard deviation of X - Y is √0.34. The median and range of X- Y cannot be determined from the given information.

When a set of data has suspect outliers, which of the following are preferred measures of central tendency and variability? (A) Mean and standard deviation (B) Mean and variance (C) Mean and range (D) Median and range (E) Median and interquartile range.

(E) The mean, standard deviation, variance, and range are all affected by outliers; the median and interquartile range are not.

To the right is a histogram of test scores. Which of the following is a true statement? (A) The median score was 75. (B) If 90 and above was an A, most students received and A. (C) More students scored below 70 than above 90. (D) More students scored above the median than below the median. (E) The mean score is probably less than the median score.

(E) The median score splits the area in half, so it is not 75. The area between 90 and 100 is more than the area between 50 and 70, but less than the area between 50 and 90. The same percentage of the data are above and below the median. With data skewed to the left, the mean is usually less than the median.

An AP Statistics teacher grades using z-scores. On the second major exam of the marking period, a student receives a grade with a z-score of -1.3. What is the correct interpretation of this grade? (A) The student's grade went down 1.3 points from the first exam. (B) The student's grade went down 1.3 points more than the average grade went down from the first exam. (C) The student scored 1.3 standard deviations lower on the second exam than on the first. (D) The student scored 1.3 standard deviations lower on the second exam than the class average on the first exam. (E) The student scored 1.3 standard deviations lower on the second exam than the class average on the second exam.

(E) There is no comparison between first and second exams. The z-score gives number of standard deviations from the mean, or, in this example, the number of standard deviations from the class average.

Which of the following statements about the correlation r is true? (A) When r = 0, there is no relationship between the variables. (B) When r = .2, 20 percent of the variables are closely related. (C) When r = 1, there is a perfect cause-and-effect relationship between the variables. (D) A correlation close to 1 means that a linear model will give the best fit to the data. (E) All the statements are false.

(E) These are all misconceptions about correlation. Correlation measures only linearity, so when r = 0, there still may be a nonlinear relationship. Correlation shows association, not cause and effect. Curved data can have correlation near 1.

A study of weekly hours of television watched and SAT scores reports a correlation of r = -1.18. From this information, we can conclude that (A) students who watch more TV tend to have lower SAT scores. (B) the fewer the hours in front of a TV, the higher a student's SAT scores. (C) there is little relationship between weekly hours of television watched and SAT scores. (D) there is strong negative association between weekly hours of television watched and SAT scores, but it would be wrong to conclude causation. (E) a mistake in arithmetic has been made.

(E) the correlation r cannot take a value greater than 1 or less than -1.


Related study sets

Forensic Science Chapter 11 Test

View Set

Transcultural Module 8.03: Ethical Guidelines and Considerations in Transcultural Nursing Care

View Set

AIS, Chapter 2, enterprise resource planning (ERP) system

View Set

Taylor PrepU Ch 32 - Skin Integrity and Wound Care

View Set

Financial Accounting Exam 1 (Allen)

View Set