DS101 Final answers
The manager of a warehouse monitors the volume of shipments made by the delivery team. The automated tracking system tracks every package as it moves through the facility. A sample of 25 packages is selected and weighed every day. On the basis of contracts with customers, the mean weight should be μ=22 pounds with σ=5 pounds. The standard error of the daily average is SE(Xbar)=1
True
The number of rows in a data table is indicated by the symbol n.
True
The empirical rule indicates that the range from ybar−s up to ybar+s holds two-thirds of the distribution of any numerical variable.
The statement is false because the empirical rule works well only when the distribution of the numerical variables is unimodal and symmetric.
If a variable X is associated with a variable Y, then Y is caused by X.
The statement is false. Association does not imply causation.
The 50th percentile is equivalent to Q1
The statement is false. The 50th percentile is equivalent to Q2
A large VIF (e.g., 10 or more) would indicate multicollinearity.
true
A line with positive slope describes a linear pattern with a positive direction.
true
A partial F test is used to assess when at least one variable in a subset of squared and interaction variables in the multiple regression model is significant.
true
A scatterplot graphically shows the relationship between two variables.
true
Adjusted R squared is less than regular R squared
true
By increasing the sample sizes from n=100 to n=400, we can reduce the margin of error by 50%.
true
Half of the numbers in a correlation matrix are redundant, including the diagonal.
true
If all of the data lie along a single line with nonzero slope, then the rsquared of the regression is 1. (Assume the values of the explanatory variable are not identical.)
true
If the correlation between the explanatory variable and response is zero, then the slope will also be zero.
true
If zero lies inside the 95% confidence interval for μ, then zero is also inside the 99% confidence interval for μ.
true
Prediction intervals get wider as you extrapolate outside the range of the data.
true
Regression predictions become less reliable as we extrapolate farther from the observed data.
true
Opposite of log transformation
exp (x)
The correlation between sales and advertising when both are measured in millions of dollars is 0.65. The correlation remains the same if these variables are converted into thousands of dollars.
true
The correlation coefficient is unit-free.
true
The estimated value ycarrot=B0+B1x approximates the average value of the response when the explanatory variable equals x.
true
The explanatory variable defines the x-axis in a scatterplot.
true
The normal probability plot is a residual plot that checks the normality assumption.
true
The primary use of stepwise regression is to identify the most important ________ that should be included in the multiple regression model.
independent variables
retailer maintains a Web site that it uses to attract shoppers. The average purchase amount is $80. The retailer is evaluating a new Web site that would, it hopes, encourage shoppers to spend more. Let μ represent the average amount spent per customer at its redesigned Web site. If the α-level of the test is α=0.05, then there is at most a 5% chance of incorrectly rejecting H0.
true
Shown in the bar chart of a categorical variable
marginal distribution
Largest α-level for which a test rejects the null hypothesis
p-value
Indicates a statistically significant result
p-value < alpha
Measure of association that lies between 0 and 1
Cramer's V
To estimate the value of p, the population proportion of successes, use the point estimate x.
False, to estimate the value of p, use the point estimate pcarrot=x/n
The value of Rsqr is 1 if data lie along a single line. Is it possible to fit a linear regression for which Rsqr is exactly equal to zero?
Yes, it is possible to fit a linear regression, but the slope, b1, will be zero.
Measure of association between two categorical variables that grows with increased sample size
chi-squared
Occurs if the p-value is larger than α when H0 is false
Type 2 Error
The Web site of a photo processor allows customers to send digital files of pictures to be printed on high-quality paper with durable inks. When the inks used by the processor start to run out, the color mix in the pictures gradually degrades. Complete parts (a) and (b) below. (a) How could the facility use a quality control process to identify when it needed to switch ink cartridges? Will it be necessary to check every photo or only a sample of photos? (b) If the facility samples photos, would it do better to group the photos into a large batch and then calculate a mean and standard deviation (for the measured color mix) or only wait a little while before calculating the mean and SD?
(a) Inspect a sample of photos using a system that can assess the color mix. It is not necessary to check every photo as a sample is enough. (b) If the problem is easily detected, only a few photos need to be batched. If the problem is subtle, the facility needs a large batch.
Consider each situation given in parts (a) through (d). Identify the population and the sample, explain what the parameter p or μ represents, and tell whether a confidence interval can be created. If so, indicate what the interval would say about the parameter. (a) Identify the parameter. Choose the correct answer below. (b) Determine whether the confidence interval can be calculated. Select all that apply. (c)If a confidence interval can be created, indicate what the interval would say about the parameter. Choose the correct answer below.
(a) The parameter is p, the proportion of all people who recently bought new kitchen appliances that expressed dissatisfaction with the salesperson. (b)The confidence interval can be calculated because all assumptions and conditions have either been satisfied or it is reasonable to assume that they have been satisfied. (c) The interval would give a range of plausible values in which the parameter is likely to lie.
Which of the following processes would you expect to be under control, and which would you expect not to be under control? Explain briefly why or why not. (a) Daily sales at each checkout line in a supermarket (b) Number of weekday calls to a telephone help line (c) Monthly volume of shipments of video game software (d) Dollar value of profits of a new startup company
(a) This process would likely be under control, unless there's a special sale or weekend shopping surge. (b) This process would likely be under control, unless some problem was discovered in their product, causing a surge in calls. (c) This process would likely be out of control, due to surges during the holiday season. (d) This process would likely be out of control, as sales typically have strong upward (or downward) trends for a startup company.
The correlation coefficient may assume any value between
-1 and +1
Dummy variables take on the values of ________ and are used to model the effects of different levels of qualitative variables.
0 or 1
What does the coefficient of determination equal if r = 0.89?
0.7921
What is a control chart? Describe its use.
A control chart is a graphical device used for monitoring process variation, identifying when to take action to improve the process, and assisting in diagnosing the causes of process variation.
A sales manager for an advertising agency believes that there is a relationship between the number of contacts that a salesperson makes and the amount of sales dollars earned. What is the dependent variable?
Amount of sales dollars
Identifies the intercept in a fitted line
B0
Identifies the slope in a fitted line
B1
The SRM assumes that the model errors have this property. A. Heteroscedasticity B. Scatterplot of Y on X C. Homoscedasticity D. Random sample from a population E. Outlier F. Durbin-Watson statistic G. Leveraged H. Normal quantile plot of residuals I. Timeplot of residuals J. Plot of residuals on x
C
Use this plot to check for dependence in data over time. A. Normal quantile plot of residuals B. Plot of residuals on x C. Timeplot of residuals Your answer is correct. D. Scatterplot of y on x
C
Percentage variation described by a fitted line
r^2
To identify the presence of curvature, it can be helpful to begin by fitting a line and plotting the residuals from the linear equation. A. This statement is false. Residuals cannot be calculated if an association is not linear. B. This statement is false. Fitting a curve and plotting the residuals from the nonlinear equation will help identify the presence of curvature. C. This statement is false. Residuals do not indicate whether or not an association is linear. D. This statement is true.
D
Use this plot to check the linear enough condition. A. Normal quantile plot of residuals or a scatterplot of y on x B. Plot of residuals on x or a normal quantile plot of residuals C. Normal quantile plot of residuals D. Scatterplot of y on x or a plot of residuals on x Your answer is correct. E. Timeplot of residuals F. Scatterplot of y on x or a timeplot of residuals G. Timeplot of residuals or a plot of residuals on x
D
An analyst is trying to purchase a large tract of land. The current owner of the tract has already subdivided the land into separate building lots and has prepared the lots by removing some of the trees. The developer wants to forecast the value of each lot. From previous experience, she knows that the most important factors affecting the price of a lot are size, number of mature trees, and distance to the lake. She runs the following multiple regression model for her analysis: P r i c e = β 0 + β 1 L o t S i z e + β 2 T r e e s + β 3 D i s t a n c e + ε Identify the dependent and independent variables.
Dependent variable: Price; Independent variables: Lot Size, Trees, Distance
x represents the number of home theater systems sold per month at an electronics store
Discrete variable
Statistic used to detect dependence in sequences of residuals A. Heteroscedasticity B. Leveraged C. Durbin-Watson statistic D. Random sample from a population E. Scatterplot of Y on X F. Timeplot of residuals G. Outlier H. Normal quantile plot of residuals I. Plot of residuals on x J. Homoscedasticity
Durbin-Watson statistic
Use this plot to check the similar variances condition. A. Outlier B. Timeplot of residuals C. Heteroscedasticity D. Normal quantile plot of residuals E. Plot of residuals on x Your answer is correct. F. Durbin-Watson statistic G. Homoscedasticity H. Leveraged I. Random sample from a population
E
What is the name of the variable that is used to predict another variable?
Explanatory
Consider the following simple linear regression model: y=B0+B1x+epsilon . The random error term is ________.
Epsilon
s/[sqrt]n s divided by square root of n
Estimated standard error of Ybar
Use this plot to check the nearly normal condition. A. Plot of residuals on x B. Leveraged C. Random sample from a population D. Durbin-Watson statistic E. Homoscedasticity F. Normal quantile plot of residuals G. Outlier H. Heteroscedasticity I. Timeplot of residuals J. Scatterplot of Y on X
F
Ninety-five percent z-intervals have the form of a statistic plus or minus 3 standard errors of the statistic.
False. Ninety-five percent z-intervals have the form of a statistic plus or minus 2 standard error(s) of the statistic.
Consider the following simple linear regression model: y=B0+B1x+epsilon When determining whether there is a negative linear relationship between x and y, the alternative hypothesis takes the form ________.
H1: B1<0
Which alternative hypothesis should be used to test the significance of a positive slope in a regression model?
H1: B>0
Term that describes data with unequal error variation Outlier Heteroscedasticity Leveraged Homoscedasticity
Heteroscedasticity
Symbol for the standard deviation of the residuals
Se (standard error)
A histogram with a long right tail. interquartile range standard deviation skewed Your answer is correct. variance z-score
Skewed
The square root of the variance.
Standard deviation
In practice, xbar and R-charts are used together to monitor a process. However, the R-chart should be interpreted before the xbar-chart. Why?
The control limits of the xbar-chart are a function of R, meaning that if the process variation is out of statistical control, the control limits of the xbar-chart have little meaning.
What does it mean to say "correlation does not imply causation"?
The fact that two variables are strongly correlated does not in itself imply a cause-and-effect relationship between the variables.
Suppose you were looking at the histogram of the incomes of all of the households in the United States. Do you think that the histogram would be bell shaped? Skewed to the left or right?
The histogram would be heavily right-skewed.
What requirements are necessary for a normal probability distribution to be a standard normal probability distribution?
The mean and standard deviation have the values of mean=0 and sigma=1
A summary of sales made in the quarterly report of a department store says that the average retail purchase was $125 with a margin of error equal to $15. What does the margin of error mean in this context?
The population average of sales is within $15 of the estimate, with some degree of confidence.
In the sample regression equation ycarot=B0+B1x what is ycarot?
The predicted value of y, , given a specific x value
Describe the range of values for the correlation coefficient.
The range of values for the correlation coefficient is −1 to 1, inclusive.
An accountant at a retail shopping chain accidentally calculated the correlation between the phone number of customers and their outstanding debt. He should expect to find a substantial positive correlation.
The statement is false. The accountant should expect to find no correlation.
In a scatterplot, the response is shown on the horizontal axis with the explanatory variable on the vertical axis.
The statement is false. The explanatory variable is shown on the horizontal axis with the response on the vertical axis.
Cramer's V is 0 if the categorical variables are not associated.
The statement is true.
As the size of a sample increases, the standard deviation of the distribution of sample means increases.
This statement is false. A true statement is, "As the size of a sample increases, the standard deviation of the distribution of sample means decreases."
All other things the same, a 90% confidence interval is shorter than a 99% confidence interval.
True
Auditors at a bank randomly sample 100 withdrawal transactions made at ATM machines each day and use video records to verify that authorized users of the accounts made the transactions. The system records the amounts withdrawn. The average withdrawal is typically $50 with SD $40. Deposits are handled separately. A histogram of the average withdrawal amounts made daily over the span of a month should cluster around $50.
True
Auditors at a bank randomly sample 100 withdrawal transactions made at ATM machines each day and use video records to verify that authorized users of the accounts made the transactions. The system records the amounts withdrawn. The average withdrawal is typically $50 with SD $40. Deposits are handled separately. A histogram of the daily standard deviations of the withdrawal amounts over the span of a month should cluster around $40.
True
Cases is another name for the columns in a data table.
True
If the 90% confidence interval for the average purchase of customers at a department store is $40 to $120, then $100 is a plausible value for the population mean at this level of confidence.
True
Occurs if the p-value is less than the α-level when H0 is true
Type 1 error
Multicollinearity is suspected when ________.
there is a high Rsqr coupled with insignificant explanatory variables
When estimating a population mean, are you more likely to be correct when you use a point estimate or an interval estimate? Explain your reasoning.
You are more likely to be correct using an interval estimate because it is unlikely that a point estimate will exactly equal the population mean.
The forward selection method of stepwise regression
adds predictors one at a time starting with the best single predictor.
Maximum tolerance for incorrectly rejecting H0
alpha level
A multiple regression model includes (X1 X2). The term is called ________.
an interaction
Counts cases that match values of two categorical variables
cell
Shown in a stacked bar chart
conditional distribution
Table of cross-classified counts
contingency table
x represents the volume of milk taken from one cow for a day
continuous
Autocorrelation occurs when the residuals are
correlated among each other
For the multiple regression model: y ^ = 75 + 25x1 − 15x2 + 10x3 if we were to increase x 2 by 5, holding x 1 and x 3 constant, the value of y will:
decrease on average by 75
In the quadratic model, y = β 0 + β 1 x + β 2 x 2 + ε, a negative value of β1 indicates a downward concavity.
false
In regression, multicollinearity is considered problematic when two or more explanatory variables are ________.
highly correlated
In regression analysis, which shape of residual plot demonstrates homoscedasticity?
horizontal band
Like a stacked bar chart but respecting the area principle
mosaic plot
If the simple correlation coefficient between two independent variables is greater than 0.90, then ________ is considered to be severe.
multicollinearity
In multiple regression analysis, when the independent variables are highly correlated, this situation is called ________.
multicollinearity
The Variance Inflation Factor (VIF) is used to assess ________________.
multicollinearity
The confidence interval estimate of the expected value of y for a given value x, compared to the prediction interval of y for the same given value of x and confidence level, will be:
narrower
In regression modeling, if you take a particular x value and plug it into a regression line equation, the result is a(n) ____________________.
point
Which process parameter is an xbar-chart used to monitor?
process mean
What characteristic of a process is an R-chart designed to monitor?
process variation
Simple linear regression analysis differs from multiple regression analysis in that ________.
simple linear regression uses only one explanatory variable
________ is an iterative variable selection procedure that allows an independent variable to be added to a multiple regression model in one iteration and deleted during the next iteration.
stepwise regression
Even if all the points on an xbar-chart fall between the control limits, the process may be out of control. Explain.
the process may be out of control because there may be nonrandom patterns of variation that have not yet broken through the control limits (or may never break through).
Autocorrelation is typically observed in ________.
time series data
The C p statistic is used
to choose the best model in regression model-building.
Symbol for the explanatory variable in a regression
x
Which statistic is the best unbiased estimator for μ?
xbar
Symbol for the response in a regression
y
Residual from an estimated regression equation
y-ycarrot
Fitted value from an estimated regression equation
ybar
The number of standard deviations from the mean.
z-score