DS - Chapter 4, 5, and 6
5. 1. 27 This chart to the right summarizes explanations given for missing work. The data are the explanations given for 100 absences by employees on the assembly line, administration, and supervising managers. The explanations are classified as medical, family emergency, or other. (a) For which group are absences due to medical reasons most common? (b) Are the two variables associated? How can you tell?
(a): Assembly Line (b): Yes, because the bars and the segments of the given chart are not approximately identical for the three groups
5. 1. 47 The table to the right shows percentages of men and women employed in four industries. Use the table to complete parts (a) through (c) below. Men Women Advertising 33% 67% Publishing 36% 64% Law firms 42% 58% Banking 64% 36% (a) Is there association between the gender of the employee and the industry? How can you tell? (b) Interpret the association (or lack thereof). Select all that apply.
(a): F. Yes, because the percentages differ among the columns. (b): A. It means that the percentages of men in different industries are not all the same. E. One consequence is that some industries have a higher proportion of male employees than female employees, assuming that approximately 50% of the workforce is male. F. One consequence is that some industries have a higher proportion of female employees than male employees, assuming that approximately 50% of the workforce is female. *Use JMP and create a table* Find the chi-squared statistic and Cramer's V if these data were derived from n=400 employees, with 100 in each industry. Repeat the process for data derived from n=1,200 employees, with 300 in each industry. Which statistic changes? Which remains the same? For n=400 employees, with 100 in each industry, χ2=23.9223.92. (Round to two decimal places as needed.) For n=400 employees, with 100 in each industry, V=. 24.24. (Round to two decimal places as needed.) If instead the data were derived from n=1,200 employees, with 300 in each industry, which statistic changes and which remains the same? Select the correct choice below and fill in the answer box to complete your choice. (Type an integer or a fraction.) C. Chi-squared is 33 times larger and Cramer's V is unchanged.
4. 1. 1 Mode
A mode of a numerical variable identifies the position of an isolated cluster of values in the distribution. For a categorical variable, the mode is the most common category. In either case, the mode is the position of a peak in the histogram.
5. 1. 21 T/F If the percentage of female job candidates who are hired is larger than the percentage of male candidates who are hired, then there is association between the categorical variables Sex (male, female) and Hire (yes, no)
Answer: The statement is true. Reason= Two categorical variables are associated if the conditional distribution of one variable depends on the value of the other. Since the Hire variable depends on the Sex variable, the two variables are associated.
5. 1. 33 A study of purchases at a 24-hour supermarket recorded two categorical variables: the time of the purchase (8 A.M to 8 P.M vs. late night) and whether the purchase was made by someone with children present. Would you expect these variables to be associated?
Answer: Yes. Fewer shoppers with children present would be expected during late night. Reason= Late night purchases made by someone with children present are going to be much less common than daytime purchases made with children present, because most children are asleep at night. Thus, it is likely that these two variables will be associated.
4. 1. 23 If the median size used by 550 songs is 3.5 MB, will these all fit on a device that has 2 GB of storage? Can you tell?
From the given information, it is not possible to determine if these songs will fit on the device.
4. 1. 17 False: (The empirical rule indicates that the range from y−s up to y+s holds two-thirds of the distribution of any numerical variable.) True:(the empirical rule works well only when the distribution of the numerical variables is unimodal and symmetric.)
If the distribution of a numerical variable is a bell-shaped (symmetric and unimodal), the empirical rule uses the standard deviation s to describe how the data cluster around the mean y. According to the empirical rule, 68% (about two-thirds) of the data lie within one standard deviation of the mean, 95% of the data lie within two standard deviations, and almost all of the data fall within three standard deviations of the mean. Since not all distributions are symmetric, the statement is false.
4. 1. 7 Variance
The average squared deviation from the mean is called the variance. The symbol s2 is used for the variance, with the exponent 2 to remind you that the deviations are squared before averaging them.
4. 1. 3 Interquartile Range (IQR)
The distance from the lower quartile to the upper quartile on a boxplot is known as the interquartile range. This is the length of the box in the boxplot.
4. 1. 5 Skewed
The extremes at the left and right of a histogram where the bars become short locate the tails of the distribution. If one tail of the distribution stretches out farther than the other, the distribution is skewed.
5. 1. 18 T/F If a variable X is associated with a variable Y, then Y is caused by X.
The statement is false. Association does not imply causation.
6. 1. 25 If the correlation between number of customers and sales in dollars in retail stores is r = 0.6, then what would be the correlation if the sales were measured in thousands of dollars? In euros? (1 euro is worth about 1.2 to 1.5 dollars.)
What would be the correlation if the sales were measured in thousands of dollars? Choose the correct answer below. Answer: C. There would be no change. Reason= Because the correlation has no units, it is unaffected by the scale of measurement. Recall that the correlation has no units and always lies between −1 and +1. What would be the correlation if the sales were measured in euros instead of dollars? Choose the correct answer below. Answer: There would be no change. Reason=Because the correlation has no units, it is unaffected by the scale of measurement. Recall that the correlation has no units and always lies between −1 and +1.
The data available below give the amount of CO2 produced in 40 nations along with the level of economic activity during a recent year. CO2 emissions are given in millions of metric tons. Economic activity is given by the gross domestic product (GDP), a summary of overall economic output. Complete parts (a) through (d) below.
(a) Make a scatterplot of CO2 emissions and GDP. Which variable have you used as the response and which as the explanatory variable? Which scatterplot below shows the data? *Make a scatterplot with JMP* Which variable have you used as the response and which as the explanatory variable? Answer: The response variable is CO2 emissions, and the explanatory variable is GDP. (b) Describe any association between CO2 emissions and GDP. Answer: D. Ignoring any outliers, there is a strong positive linear association. (c) Find the correlation between CO2 emissions and GDP. Answer: corr(x ,y)= 0.661 (Round to three decimal places as needed.) (d) Which cases are outliers? How does the correlation change if outliers are removed? List all the points that are outliers. Use GDP as the first coordinate and CO2 emissions as the second coordinate. Answer: (10034.3, 7888.4), (11438.4, 8589.4) If the outliers are removed, the correlation to . Answer: increases ; 0.876.
5. 1. 39-T A service station near an interstate highway sells three grades of gasoline: regular, plus, and premium. During the last week, the manager counted the number of cars that purchased these types of gasoline. He kept the counts separate for weekdays and the weekend. The accompanying data table has two categorical variables. One distinguishes weekdays from weekends, and the other indicates the type of gas (regular, plus, or premium). Complete parts (a) through (e) below (a) Find the contingency table defined by the day of the week and the type of gas. Include the marginal distributions. Complete the contingency table below, using counts throughout.
(a): download data, then imput it to JMP, then in distribution; put grades in "y" and day in "by". (b): see frequency on the side(in JMP) (c): now change distribution; put days in "y" and grades in "by". (d): No, because these conditional distributions are not directly comparable. (e):weekends, the percentage of all sales on those days that are premium sales is greatest.
4. 1. 51 The figure shows the histogram of the annual tuition at 69 top undergraduate business schools. (a)Estimate from the figure the center and spread of the data. Are the usual notions of center and spread useful for these data? (b)Describe the shape of the histogram. (c)If you were only shown the boxplot, would you be able to identify the shape of the distribution of these data? (d)Can you think of an explanation for the shape of the histogram?
(a):The median is about $14,000 and the interquartile range is about $22,000. No, because they do not capture the bimodal nature of the data. (b):The shape of the histogram is bimodal. (c):NO (d):The first mode represents public schools and the second mode represents private schools.
6. 1. 3 Match the the value of the correlation to the data in the scatterplot. (a) r=0 (b) r=0.5 (c) r=0.8 (d) r=−0.6
(a): IV. Reason= When r equals 0, there is no pattern among the points. The scatterplot to the right is an example of a scatterplot with r approximately equal to 0. (b):III. Reason= The larger the magnitude of r, the tighter the points cluster along the diagonal line. The scatterplot to the right is an example of a scatterplot with r approximately equal to 0.5. (c): I. Reason= The larger the magnitude of r, the tighter the points cluster along the diagonal line. The scatterplot to the right is an example of a scatterplot with r approximately equal to 0.8. (d): II. Reason= The larger the magnitude of r, the tighter the points cluster along the diagonal line. The scatterplot to the right is an example of a scatterplot with r approximately equal to −0.6.
5. 1. 24 The accompanying table summarizes the status of 1000 loans made by a bank. Each loan either ended in default or was repaid. Loans were divided into large (more than $50,000) or small size. Repaid Default Large 30 10 Small 900 20 (a) What would it mean to find association between these variables? (b) Does the table show association? (You should not need to do much calculation.)
(a): Large and small loans have different chances of being repaid. (b): Yes, because the payment statuses among large and small loans are not approximately the same.
6. 1. 23 States in a country are allowed to set their own rates for sales taxes as well as taxes on services, such as phone calls. The scatterplot shown below graphs the state taxes charged for wireless phone calls (as a percentage) versus the state sales taxes (also as a percentage). Complete parts a through d. (a) Describe the association, if any, that you find in the scatterplot. There is . (b) Estimate the correlation between the two variables from the plot. Is it positive, negative, or zero? Is it closer to zero or to ±0.5? The correlation is It is closer to . (c) What is the effect of the cluster of states with no sales tax in the lower left corner on the association? If these were excluded from the analysis, would the correlation change? If these were excluded from the analysis, would the correlation change? (d) Would it be appropriate to conclude that states that have high sales tax charge more for services like wireless phone use?
(a): a weak positive association. (b): positive. ; 0.5. (c): The cluster of states increases the correlation. ; A. If the cluster of states were excluded from the analysis, the correlation would decrease. (d): D. No. When the outliers are excluded, the association is too weak to arrive at this conclusion.
4. 1. 11 False: (The boxplot shows the mean plus or minus one standard deviation of the data.) True: (The boxplot shows the median, with the lower edge at the 25th percentile point and the upper edge at the 75th percentile point.)
A boxplot is a graphical summary of a numerical variable that shows the five-number summary of a variable in a graph. Vertical lines locate the median and quartiles. Joining these lines with horizontal lines forms a box. The span of the box locates the middle half of the data, and the length of the box is equal to the interquartile range (IQR).
6. 1. 13 T/F If the correlation between the growth of a stock and the growth of the economy as a whole is close to 1, then this would be a good stock to hold during a recession when the economy shrinks.
Answer: C. False. The value of the stock would fall along with the economy. It would be better to have one that was negatively related to the overall economy. Reason=The correlation can reach 1, but only if all the data fall exactly on a diagonal line. Since the correlation is almost equal to 1, the relationship between the stock and the economy is strong positive. The stock's growth will match that of the economy. During recession, the economy will shrink, or decline, and the stock will also decline. A stock that has a negative relationship, or r<0, would rise during recession.
6. 1. 33 Which data do you think produce a larger correlation between the weight and the price of diamonds: using a collection of gems of various cuts, colors, and clarities, or a collection of stones that have the same cut, color, and clarity?
Answer: D. The correlation is larger among stones of the same cuts, colors and clarities. These factors add variation around the correlation line. By forcing these to be the same, the pattern is more consistent. Reason= It is assumed that the stones with similar attributes will have similar weights and prices, whereas, the gems that vary will be less consistent and have different weights and prices. The similar stones will have a consistent pattern and as such a larger correlation.
4. 1. 45 A survey in 2006 reported that the median household net worth in a country was $93,100 in 2004. In contrast, the mean household net worth was $448,200. How is it possible for the mean to be so much larger than the median?
Answer: The distribution of income is very right-skewed, with the upper tail reaching out to very high incomes. Reason= Income distributions are typically right-skewed, with the upper tail reaching out to very high incomes. As a result the mean is much larger than the median.
5. 1. 19 T/F If the categorical variable that identifies the supervising manager is associated with the categorical variable that indicates a problem with processing orders, then the manager is causing the problems.
Answer:False. Due to the possible presence of a lurking variable, association cannot be interpreted as causation. Reason=In all data where there is an association, it is possible that a lurking variable—a concealed variable that affects the apparent relationship between two other variables—exists. Thus, it cannot be assumed that association indicates causation. In the given sitution, it cannot be assumed that the association between the supervising manager and a problem with processing orders indicates that the manager is causing the problems.
4. 1. 9 z-score
A z-score is the distance from the mean of a set of data, counted as a number of standard deviations.
5. 1. 46 After a collapse of the stock market, a business newspaper polled its readers and asked whether they expected another big drop in the market during the next 12 months. A contingency table of the responses is available below. (a)Quantify the amount of association between the respondents' stock ownership and expectation about the chance for another big drop in stock prices. (b)Reduce the table by combining the counts of very likely and somewhat likely and the counts of not very likely and not likely at all, so that the table has three rows: likely, not likely, and unsure. Compare the amount of association in this table to that in the original table.
(a) Compute the chi-squared statistic for the table. χ2=3.03.0 (Round to one decimal place as needed.) Compute Cramer's V for the table. V=. 09.09(Round to two decimal places as needed.) Describe the association between the respondents' stock ownership and the expectation about the chance for another big drop in stock prices. Choose the correct answer below. = The association is weak. (b) Compute the chi-squared statistic for the reduced table. χ2=. 5.5 (Round to one decimal place as needed.) Compute Cramer's V for the table. V=. 04.04 (Round to two decimal places as needed.) Describe the association between the respondents' stock ownership and the expectation about the chance for another big drop in stock prices. Choose the correct answer below = The association is weak. Compare the amount of association in this table to that in the original table = The associations in both tables are weak.
6. 1. 39-T Each month, a government agency releases its latest estimate of construction activity in the housing industry. A key measure is the percentage change in the number of new homes under construction. Does the release of this number come with a change in the stock market? The accompanying data show the percentage change in the number of new, privately owned housing units started each month, as reported by the government agency. The data also include the percentage change in the S&P 500 index on the same day the government agency releases the housing results. Complete parts a through c.
(a) Draw the scatterplot of the percentage change in the S&P 500 index on the percentage change in the number of new housing units started. Describe the association. *Use JMP to create scatterplot* Answer: B. Describe the association. Choose the correct answer below. Answer: E. There is little or no association. (b) Find the correlation between the two variables in the scatterplot. What does the size of the correlation suggest about the strength of the association between these variables? *Make JMP to show line and anaswer is under Summary Statistics, for Correlation and Value* Answer = (negative)−0.101 (Type an integer or decimal rounded to three decimal places as needed.) Describe the correlation. Choose the correct answer below. Answer: C.The size of the correlation suggests the association is slightly negative. (c) Suppose you know that there was a 5% increase in the number of new homes. From what you've seen, can you anticipate movements in the stock market? Answer: C. No, because the association is too weak to be useful for prediction.
6. 1. 31 The timeplot shown to the right shows the values of two indices of the economy in a country: Inflation (left axis, in red, measured as the year-over-year change in a price index) and the Survey of Consumer Sentiment (right axis, in blue, from a university). Both series are monthly and cover the time period January 2004 through December 2005. Complete parts a through e.
(a) From the chart, do you think that the two sequences are associated? Answer: C. Yes. Overall, they appear to move in opposite directions. (b) The scatterplot shown to the right displays the same time series, with Consumer Sentiment plotted versus Inflation. Does this scatterplot change your impression of the association between the two? Answer: D. No. The graph shows a negative association. (c) Estimate the correlation between these two series. Answer: The correlation r is approximately −0.30. (d) When looking at the relationship between two time series, what are the advantages of these two plots? Each shows some things, but hides others. Which helps you visually estimate the correlation? Which tells you the timing of the extreme lows and highs of each series? Answer: B. The timeplot shows the timing of the events. The scatterplot shows the contemporaneous association more clearly and reveals the linear association. (e) Does either plot prove that inflation causes changes in consumer sentiment? Answer: C. No. The plots are only able to show association, not causation. Other factors in the economy could cause both series to move.
6. 1. 47-T The data in the accompanying table describe housing prices near a large city. Each of the 40 data points of this data table describes a region of the metropolitan area. The column labeled Selling Price gives the median price for homes sold in that area during 1999 in thousands of dollars. The column labeled Crime Rate gives the number of crimes committed in that area, per 100,000 residents. Complete parts a through e.
(a) Make a scatterplot of the selling price on the crime rate. Which observation stands out from the others? Is this outlier unusual in terms of either marginal distribution? *Make scatterplot with JMP* Answer: plot with selling price on the y-axis Review the scatterplot to determine which observation stands out from the others. Compare its x-coordinate to the other x-values and its y-coordinate to the other y-values to determine if this outlier is unusual in terms of either marginal distribution. Select the correct choice below and fill in the answer box to complete your choice. (Type an ordered pair. Use integers or decimals for any numbers in the expression.) Answer: B. The observation at left parenthesis 36.81 comma 96.719 right parenthesis(36.81,96.719) is an outlier. This data point is unusual in terms of crime rate, but not selling price. (b) Find the correlation using all of the data as shown in the prior scatterplot. Answer: corr(x, y)= (negative)−0.224 (Round to three decimal places as needed.) (c) Exclude the distinct outlier and redraw the scatterplot focused on the rest of the data. Does your impression of the relationship between the crime rate and the selling price change? *Redraw the scatterplot. Choose the correct graph below.* Does your impression of the relationship between the crime rate and the selling price change? Answer:C. Yes. The new scatterplot shows the pattern more clearly. There is now a weak negative trend that appears to be curved. (d) Compute the correlation without the outlier. corr(x, y)= (Round to three decimal places as needed.) Answer: -0.430 (e) Can we conclude from the correlation that crimes in the this area cause a rise or fall in the value of real estate? Answer: A.No. The correlation measures the association between the two variables. It does not signify causation.
6. 1. 45-T The accompanying data report characteristics of 15 types of cars sold in a certain country last year. One column gives the official mileage (combined MPG), and another gives the rated horsepower. Complete parts (a) through (e) below.
(a) Make a scatterplot of the two variables. Which variable makes the most sense to put on the x-axis, and which belongs on the y-axis? Choose the correct answer below. Answer: C. Horsepower makes more sense on the x-axis, given that one would want to understand variation on MPG. MPG belongs on the y-axis. Make a scatterplot. Choose the correct graph below. *Use JMP to make scatterplot* Answer: A. (b) Describe any pattern in the plot. Be sure to identify any outliers. Choose the correct answer below. Answer: C. There is a strong, negative linear association with some variation. There are no outliers. (c) Find the correlation between these two variables. Answer: r = (negative)−0.852 (Round to three decimal places as needed.) (d) Interpret the correlation in the context of these data. Does the correlation provide a good summary of the strength of the relationship? Answer: C. The correlation shows that as horsepower increases, the MPG decreases. The correlation does provide a good summary of the strength because the scatterplot indicates that there is a strong linear association, with little variation. (e) Use the correlation line to estimate the mileage of a car with 200 horsepower. Does this seem like a sensible procedure? What if the car has a 1.6 liter engine? Answer: The estimated mileage of a car with 200 horsepower is 27 MPG. (Round to the nearest integer as needed.) Does this seem like a sensible procedure? Answer: B. This seems like a sensible procedure because the scatterplot appears to have a linear association. The estimated mileage of a car with a 1.6 liter engine is MPG. Answer: 25 MPG
5. 1. 53 The data to the right compare the on-time arrival performance of two airlines, X and Y. The table shows the status of 13,767 arrivals for one year. Complete parts (a) through (c) below.
(a) On the basis of this initial summary, find the percentages (row or column) that are appropriate to comparing the on-time arrival rates of the two airlines. Which arrives on time more often? Airline X Airline Y On time 6,064 5,030 80.9% 80.2% Delayed 1,431 1,242 19.1% 19.8% =100% =100% (Round to one decimal place as needed.) Which airline arrives on time more often? Answer: X (b) The next two tables organize these same flights by destination. The first also shows arrival time and the second shows airline. Does it appear that a lurking variable might be at work here? How can you tell? Select the correct choice below and, if necessary, fill in the answer boxes to complete your choice. Answer: Yes, because the on-time rate for flights to Denver is 81.281.2%, whereas the on-time rate for flights to Philadelphia is 80.680.6%. This is important since most of airline X's flights go to Denver, whereas most of airline Y's flights go to Philadelphia.(Round to one decimal place as needed.) (c) Each cell of the following table shows the number of on-time arrivals for each airline at each destination. Is the destination a lurking factor behind the original 2×2 table? Select the correct choice below and fill in the answer boxes to complete your choice. *Find by dividing numbers on this chart, by the second chart in (b)* Answer: Yes, because airline Y has a better on-time arrival rate in Denver (80.9% for X vs. 84.7% for Y) and in Philadelphia (76.8% for X vs. 80.9% for Y).