STAT 326 Final Study
interpretation for R-squared in SLR
___ % of the variation in (y) is explained by the linear relationship between (x) and (y)
interpretation for R-squared in MR
___% of the variability in (y) is explained by the multiple regression model, including (list all explanatory variables)
List these from smallest margin of error to largest margin of error a) A 99% confidence interval with a sample size of 50. b) A 99% confidence interval with a sample size of 25. c) A 90 % confidence interval with a sample size of 50. d) A 95% confidence interval with a sample size of 50.
C, D, A, B
what is a parameter
a numerical description of a population
what is a statistic
a numerical description of a sample
correlation coefficient ("r")
a numerical measure of the direction and strength of a linear relationship between 2 quantitative variables
How does the standard error for an individual observation compare to the standard error for a mean response?
it will always be larger
what does overfitting do to a model? What does it do to pi?What is the rule of thumb?
It reduces the predictive power and increase the pi. There must be at least 10 observations for the data
How is the "independence" assumption checked?
look at prompt, see if sampling method was simple random sample
extrapolation
making predictions that are outside our range of x-values used to fit the LS model
as sample size increases
margin of error decreases
coefficient of determination (R-squared) in MR
measures how well our linear model fits our data
Multiple Regression (MR)
more than one explanatory variable & a numerical response variable
How do you calculate DF in SLR?
n-2
How do you calculate DF in MR?
n-p-1 (p= number of slope coefficients)
How can we determine the value of the correlation coefficient (r) by only referring to the info from JMP?
need to figure out the direction of the relationship (can determine by looking at estimated slope value or observing the scatterplot). Also need to know R-squared value
What does the null hypothesis for the F-test imply?
none of the explanatory variables contribute to helping predict the response (model is NOT useful)
as you decrease the p to leave what happens to the number of vars?
The vars increase
True or false? If the sample mean of a random sample from an x distribution is relatively small, when the confidence level c is reduced, the confidence interval for μ becomes shorter.
True. As the level of confidence decreases, the maximal error of estimate decreases.
True or false? If the original x distribution has a relatively small standard deviation, the confidence interval for μ will be relatively short.
True. As σ decreases, E decreases, resulting in a shorter confidence interval.
true or false The value zc is a value from the standard normal distribution such that P(-zc < z < zc) = c.
True. By definition, critical values zc are such that 100c% of the area under the standard normal curve falls between -zc and zc.
True or false? Consider a random sample of size n from an x distribution. For such a sample, the margin of error for estimating μ is the magnitude of the difference between and μ.
True. By definition, the margin of error is the magnitude of the difference between and μ.
True or false? The point estimate for the population mean μ of an distribution is x, computed from a random sample of the x distribution.
True. The mean of the distribution equals the mean of the x distribution and the standard error of the distribution decreases as n increases.
calculate Margin of Error
Upper bound- Lower bound/ 2
population regression model
Uy= Bo + Bi* xi
what are some indicators of multicollinarity?
Vif >= 10, strong correlation between x's, opposite signs of the slopes then we expect to see. slope is only a issue if orginal correlation is mod or strong
interpretation of a CI for the mean response (Uy)
We are 95% confident that the MEAN (Y) will be between (lower bound) and (upper bound), for a given value of x
interpretation of a PI for a single response (y)
We are 95% confident that the actual response will be between (lower bound) and (upper bound), for a given value of x
interpretation of CI for a regression slope(B1) example
We are 95% that for every additional unit increase in x, the MEAN response will (increase/decrease) by between (lower bound) and (upper bound)
interpretation for RMSE
We expect approximately 95% of the actual observations to be within (2*RMSE) of their corresponding predicted values
Conclusion for F-test
We have/lack statistically significant evidence that this Multiple Regression model is useful in predicting the response (Y), under the alpha level of .05
interpretation of bo example
When the explanatory variable is equal to 0, the PREDICTED response is (bo) units
Sam computed a 90% confidence interval for μ from a specific random sample of size n. He claims that at the 90% confidence level, his confidence interval contains μ. Is this claim correct? Explain.
Yes. The proportion of all confidence intervals based on random samples of size n that contain μ is 0.90.
how does a mixture model work?
You begin with none selected like a foward model, give a p to leave and enter hopefully the same to avoid infanite loops, stop when removing a far would decrease the amount of info, or make the graph less precise.
what is are the ways to detect multiocollinearity l?
You can use a scatterplot matrix. Look at the matrix and see if two explanitory vars has a strong correlation. If they do you should be concerned. Another is to look at the output and see how close to one it is. You will now visually if you have a scatterplot matrix with small close bubbles
what does it mean if vif is above 10?
You have multicollinearity
What happens to the width of a confidence interval as we increase the level of confidence ("C")?
it gets wider
How does R-squared adjusted compare to R-squared?
it is always smaller
How does the width of a CI (for the mean response Uy) compare to the width of a PI (for a single future observation y)?
it is narrower
interpretation of bo in MR
when all explanatory variables are equal to 0, the PREDICTED response (y hat) is ___
formula for PI for a single observation (y)
y hat +/- t* SE y hat (JMP refers to SE as standard error of individual)
formula for CI for an average/mean response (Uy)
y hat +/- t*SEu hat (JMP refers to SE as standard error of predicted)
estimated/predicted model
y hat= bo + b1*x
does multiolinarity lead to misleding slopes and have redudant information?
yes
how is the SLR line formed?
-determined based on the optimal combination which will minimize the sum of absolute vertical distances between the observations & the regression line. (by *minimizing the summation of squared errors*)
what happens to R-squared when explanatory variables are ADDED to the model? (may not be meaningful)
-increases
How is the "normality of residuals/the error" assumption checked?
-normal quantile plot of residuals *want residual points to follow straight line and stay within the bands*
How is the "form" assumption checked?
-with residual plot (x-axis: predicted values, y-axis: residuals of response) *want to see both positive and negative residuals*
How is the "constant variance" assumption checked?
-with residual plot (x-axis: predicted values, y-axis: residuals of response) *want to see similar spread of residuals w/ no certain pattern*
What assumptions must hold in MR to make valid inferences?
1) form 2) constant variance 3) normality of residuals 4) independence
What assumptions must hold in SLR to make valid inferences?
1) form is linear 2) constant variance of residuals 3) normality of residuals 4) independence
how do you do the foward selection?
1)make slr for all the explanitory vars 2)do a hypothesis test to see if there is a linear association to the response 3)look for the smallest p value and make sure it's smaller then your p to enter 4)add that var to the model and make a mrm with that and the other vars 5)repeat until you smallest p value is bigger then the p to enter
how do you calc vif?
1/(1-R^2i) where r^2i is the coefficient determination of the model
what is the minimum number of obersrvations per var?
10
Sam computed a 95% confidence interval for μ from a specific random sample. His confidence interval was 10.1 < μ < 12.2. He claims that the probability that μ is in this interval is 0.95. What is wrong with his claim?
Either μ is in the interval or it is not. Therefore, the probability that μ is in this interval is 0 or 1.
True or false? A larger sample size produces a longer confidence interval for μ.
False. As the sample size increases, the maximal error decreases, resulting in a shorter confidence interval.
True or false? Every random sample of the same size from a given population will produce exactly the same confidence interval for μ.
False. Different random samples may produce different values, resulting in different confidence intervals.
True or false? If the sample mean of a random sample from an x distribution is relatively small, then the confidence interval for μ will be relatively short.
False. The maximal error of estimate controls the length of the confidence interval regardless of the value of .
interpretation of b1 example
For every unit increase in the explanatory variable, the PREDICTED response will (increase or decrease) by ___.
what are the effects of multi collinearity?
Higher standrad error and potentially wrong slopes(so can belive x has a negative impact on y when it reality it doesnt it just appears that way after acccounting for the other response var)
lets say your vars all have the same p value in jmp. What do you do?
Look at the t value. You want to find the largest t value as that means a smaller p value
How is the F-ratio calculated?
MSR/MSE or (SSR/p)/(SSE/n-p-1) *obtain from JMP*
can vars be removed after selection?
NO
do any of the three prodcedures remove multicollinearity?
Nope but they help
statistic standard deviation
S
total sum of squares (SST)
SSR+SSE
how do you do the backwards elimination?What a disadvantage?
Start with a full model, remove the largest p-value, and keep going until all our below p to leave. You need lots of data
What does the alternative hypothesis for the F-test imply?
at least 1 explanatory variable is helpful in predicting the response, however we do not know which or how many variables (model is useful)
how to calculate a CI for a regression slope(B1)
b1 +/- t* se(b1)
How does the interpretation of the population slope (Bi) change in comparison to the interpretation of the sample slope(bi)?
change the predicted response to say MEAN (or average) response
to determine what var to remove?
depends on things like interpribility of the x var, measurability so like how much it would cost, and the correlation with the other x vars in the model
What is the Global (F-test) used for?
determine if a model is useful
calculate a residual
e=Actual-Response e= y- y hat
formula for t-statistic
estimate/ SE(estimate)
interpretation of bi in MR
for every unit increase in xi (one explanatory variable), the PREDICTED response (y hat) will increase/decrease by (bi), assuming all other explanatory variables are held constant
4 features to look for in a scatterplot
form, direction, strength, outliers
How is the F-test conducted?
hypothesis test that encompasses all B's by testing Ho: B1=B2=B3...=Bp=0 vs. Ha: at least one Bi=/0
Simple Linear Regression (SLR)
one numerical explanatory variable & a one numerical response variable
How does the p-value change for a one-sided hypothesis test?
p-value/2
RMSE
the estimated standard error of the regression model (residual standard error) =s (square root of MSE)
as the confidence level (C) increases
the margin of error (E) increases
as the standard deviation increases
the margin of error (E) increases
as the standard deviation decreases
the margin of error decreases
as sample size decreases
the margin of error increases
what does "Uy" represent?
the mean response (Y), for a given value of x
What does a larger F-ratio value mean?
the model does a better/good job at predicting our response
in a f selection the larger your p to enter?
the more vars in your final model
what is a sampling distribution
the probability distribution of a sample statistic
At what value of x can we predict the most precise intervals with?
the sample mean of x
model/regression sum of squares (SSR)
variation that can be explained through the multiple regression model
error sum of squares (SSE)
variation that remains unexplained
What is multicollinearity?
when 2 x's are highly correlated and so provide redundant info about the y