STAT 661
What are the regression coefficients?
B0 and B1
How can the strength of the linear relationship between two numerical variables be measured?
By the correlation coefficient.
log both X and Y
C^B1 (B1 >0 increase c^B1-1 or B1<0 decrease 1-c^B1 both x100 to get %)
How is r-squared calculated from ANOVA?
SSE/SST
R squared is the mathematical notation for:
The coefficient of determination
Define regression model
a function used to describe the regression
T/F: The least squares method is resistant to outliers.
false
T/F: The 95% prediction interval succeeds in capturing the future Y in 95% of its applications.
true
T/F: The best fit, or least squares, line minimizes the sum of the squares of the residuals
true
T/F: The degree of freedom for standard error and standard deviation in least squares regression is n-2.
true
T/F: The mean of the residuals is 0
true
T/F: The method of least squares is a procedure for choosing estimates of parameters in regression and ANOVA.
true
T/F: The regression effect is a phenomenon where an attribute that is extreme on an initial measurement will tend to be closer toward the mean of a group on a subsequent measurement
true
T/F: The simple linear regression model demonstrates a way to determine a line that best fits a population of X and Y data with a linear relationship.
true
T/F: The standard deviation of data around the prediction line is known as the standard error of the estimate in the simple linear regression model.
true
T/F: There are no degrees of freedom for estimating variance.
true
T/F: if the odds of a yes outcome are w, the odds of no outcome are 1/w.
true
T/F: if the odds of a yes outcome are w, the probability of yes is pi=w/(1+w)
true
T/F: the farther from the mean the larger the SE
true
T/F: the slope and intercept of the regression line can be estimated by the method of least squares.
true
T/F: The values for SST, SSR and SSE provide little information about the simple linear regression model.
true (The values for SST, SSR and SSE by themselves provide little information about the simple linear regression model. However, as part of other formulas, they can provide useful information about the results of the model)
What is R-squared when X is no help at all in predicting Y (SSregression is 0)?
0
If X can be used to predict Y exactly (all residuals are 0) [SSR=SST]
1
What are the 2 consequences if equal variance is not met?
1. SE inaccurately describe uncertainty 2. test/CI can be misleading
What are 2 violation of the linearity assumption that can occur leading to misleading information from least squares impacting over/under estimate of means and t-test/CI inaccuracy?
1. a straight line may be inadequate model, may contain some curvature for example 2. outliers may skew
What 4 things does the regression approach accomplish?
1. allows for interpolation 2. gives more degrees of freedom for error estimation 3. gives smaller standard errors for estimates of the mean responses 4. provides a simpler model
What are 2 benefits of replication?
1. an estimate of s.d.-squared that does not depend on any model being correct 2. a clearer picture of the relationship between the variance the mean
What are the 3 inferences in simple linear regression?
1. inferences about B0 and B1 2. estimation of the mean of Y given X 3. prediction of a single, future value of Y given X
What is replication?
Applying the exact same treatment to more than one experimental unit.
If log X (explanatory variable logged)
Associated with each doubling of X is a B1[log(2)] change in the mean of Y
At what value of X will there be the most precise estimate of the mean of Y?
At the sample average of the X's used in the estimation.
At what value of X will there be the most precise prediction of a future Y?
At the sample average of the X's used in the estimation.
What 2 parameters are involved in the simple regression model?
B0 intercept and B1 slope
Why is randomization important even in laboratory/controlled environments?
Confounding variables can still exist
What assumptions are made about the distribution of the explanatory variable in the normal simple linear regression model?
None
Why can an R-squared close to 1 not be used as evidence that the simple linear regression model is appropriate?
R-squared shows a strong linear relationship but does not account for possible curvature
What does the slope (b1) represent?
The estimated change in average Y per unit change in X.
What does the Y intercept (b0) represent?
The predicted value of Y when X = 0.
What information is contained in the coefficient of determination?
The proportion of total variation in Y that is explained by X
What does the standard error of the estimate measure?
The variation around the regression line
What is the purpose of a simple linear regression?
To predict scores on a dependent variable from scores on a single independent variable
A 95% confidence interval for B1 is determined to be (15,30). Interpret the meaning of this interval.
You can be 95% confident that average value of Y will increase by between 15 and 30 units for every one unit increase in X.
If, after performing a Student test for comparison of means, we obtain p = 0.0256, then: A. We reject H0 and accept HA B. We accept H0 C. We reject HA D. We cannot decide
a
The result of a statistical test, denoted p, shall be interpreted as follows: A. the null hypothesis H0 is rejected if p <0.05 B. the null hypothesis H0 is rejected if p> 0.05 C. the alternate hypothesis H1 is rejected if p> 0.05 D. the null hypothesis H0 is accepted if p <0.05
a
What is the residual sum of squares?
a measure of the distance between all responses and their fitted values
Define simple linear regression model
a particular regression model in which the regression is a straight-line function of a single explanatory variable
Which of the following apply: Which of the following tests are parametric tests: A. ANOVA B. Student C. Wilcoxon D. Kruskal-Wallis
a, b
Which of the following apply: The Student's t test is: A. a parametric test B. a nonparametric test C. a test for comparing averages D. a test for comparing variances
a, c
Which of the following apply: The fundamental statistical indicators are: A. Mean B. Median C. Variance D. Standard deviation
a, d
Consider the regression of weight on height for a sample of adults. Suppose the intercept (Bo) is 5 kg. a) Does this imply the adults of height 0 weigh 5 kg? b) Does this mean simple linear regression is meaningless?
a. No. height = 0 is outside the range of observed values. b. No, it is useful for answering questions within the observed range.
T/F: The standard chi-squared test for a 2 by 2 contingency table is valid only if: (a) all the expected frequencies are greater than five (b) both variables are continuous (c) at least one variable is from a Normal distribution (d) all the observed frequencies are greater than five (e) the sample is very large
a. T b. f c. f d. f. e. f
Given regression equation: mean pH= 6.98 - 0.73(log time x) a. interpret B1 b. interpret B0
a. for each increase of log time by 1, it is estimated that pH will decrease by 0.73. b. it is estimated the mean pH will be 6.98 when log time =0
If logY (response variable logged)
as X increases by 1, the median of Y changes by e^B1
Unless a relationship between X and Y is perfect, then predictions for Y A. will fall on a straight line. B. will be closer to the mean of Y. C. will be closer to the mean of X. D. will be invalid.
b
Why is normality important for prediction intervals?
based on normality of the population distribution, not sampling distribution
A longitudinal or prospective study is also referred to as a(n): A. ecological study B. cross-sectional study C. cohort study D. observational study
c
An odds ratio = 1.0 means that the odds of the event is _______: a. 100 times greater in the control group compared to the treatment group. b. 100 times greater in the treatment group compared to the control group. c. the same in both the treatment and control group. d. 10 times greater in the treatment group compared to the control group.
c
The stages of a malignant disease (cancer) is recorded using the symbols 0, I, II, III, IV. We say that the scale used is: A. Alphanumeric B. Numerical C. Ordinal D. Nominal
c
What does it mean to say there is error in our regression? A. We calculated it wrong. B. There were data entry errors. C. We can not predict Y perfectly. D. The data points all fall on a straight line.
c
r2 tells us: A. how to determine someone's score. B. how to describe a relationship. C. the proportion of variability in Y accounted for by X. D. all of the above.
c
All of the following are true of odds ratio except: A. It is an estimate of relative risk B. It is the only measure of risk that can be obtained directly form a case-control study C. It tends to be biased towards 1 (neither risk or protection at high rates of disease D. It is the ratio of incidence in exposed divided by incidence in nonexposed E. It can be calculated without data on rates (as in a case-control study
d
Heteroscedasticity occurs when A. there are larger values on X than Y. B. there is a linear relationship between X and Y. C. more error is accounted for than remains. D. variability in Y depends on the exact value of X.
d
The line described by the regression equation attempts to: A. pass through as many points as possible. B. pass through as few points as possible. C. minimize the number of points it touches. D. minimize the squared distance from the points.
d
The standard deviation is to the mean as the ____________ is to the regression line. A. z-score B. SSR C. coefficient of determination D. standard error of the estimate
d
A regression analysis is inappropriate when: A. you have two variables that are measured on an interval or ratio scale. B. you want to make predictions for one variable based on information about another variable. C. the pattern of data points form a reasonably straight line. D. there is heteroscedasticity in the scatter plot.
d (A major assumption of the regression analysis is that there be equal variance at all values of X)
What do residuals represent?
difference between the actual Y values and the predicted Y values
T/F: Homoscedasticity is a term for the linearity of the random errors for all values of X.
false ( Homoscedasticity is a term for equal variance among the random errors for all values of X. This is a requirement of the simple linear regression model)
T/F: 95% CI: With large samples, you know the mean with much more precision than you do with a small sample, so the confidence interval is quite wide when computed from a large sample.
false (CI is quite NARROW when computed from a large sample)
T/F: As sample size increases, SE increases
false (SE gets smaller)
T/F: The least-squares method is one way to determine the simple linear equation. The method maximizes the sum of the squared differences between actual and predicted values of Y.
false (The method minimizes the sum of the squared differences between actual and predicted values of Y. Minimizing this measurement of the differences can determine the coefficients needed to create the Y intercept and slope of a line needed to calculate the formula for a line that best fits the data)
T/F: A strong linear relationship of X and Y data indicates that X causes Y.
false (cannot by itself prove cause-effect)
T/F: SSE is the sum of squares residuals. The goal is to create a linear model that maximizes the SSE.
false (goal is to MINIMIZE the SSE)
T/F: Predicting values of Y from a given X within the data range is known as extrapolation.
false (interpolation)
T/F: The variance about the regression is estimated by the sum of squared residuals divided by the degrees of freedom n-1
false (n-2)
T/F: it is possible to use a prediction interval to predict exactly about an individual in a population
false (not possible- cqn predict a parameter but not an individual)
What is the estimated mean called?
fitted value or predicted value
What is wrong with the formula: Y=Bo + B1X?
implies exact relationship between Y and X. Should be the mean of Y as a function of X
What assumptions are used for exact justification of tests and CI for slope and intercept in simple regression?
independence, constant variance, linearity, normality (normality is least important)
What is a prediction interval?
indicates likely values for a future value of a response variable at some specified value of the explanatory variable
What is the difference between the observed response and its estimated mean?
residual
What is the standard error of prediction as the sample size approaches infinity?
standard deviation
What are residuals?
the differences between the observed and expected dependent variable scores
Define regression
the mean of a response variable as a function of the explanatory variable
T/F: A binary variable is a way of coding a 2-group categorical response - like dead or alive or diseased or not diseased - into a number (0, 1)
true
T/F: A high coefficient of determination (r2) indicates that the proportion of variation in Y is due more to X than to unexplained variation.
true
T/F: A proportion must be between 0 and 1; odds must be greater than or = to 0, but have no upper limit
true
T/F: Confidence intervals can be made arbitrarily smaller by increasing n, whereas prediction intervals cannot.
true
T/F: Fitted values should be closer to the regression line than sample mean values.
true
T/F: For a prediction interval, normality and equal variance become very important, but not sample size
true
T/F: If the OR = 1, the outcome of interest has the same probability of occurring in both groups.
true
T/F: If the outcome is the same in both groups the odds ratio will be 1, which implies there is no difference between the two arms of the study.
true
T/F: In a case-control study (retrospective), you cannot measure incidence, because you start with diseased people and non-diseased people, so you cannot calculate relative risk.
true
T/F: Plotting X against the residual (e) is a method for determining the linearity of X-Y data.
true
T/F: Plotting X against the variability of the residual (e) is a method for determining the variance of the difference between observed and predicted values of Y in X-Y data.
true
T/F: R-squared is unitless.
true
T/F: RMSE (root mean square error) has the same units as Y (response)
true
T/F: The 95% confidence interval defines a range of values that you can be 95% certain contains the population mean.
true
T/F: lack of independence causes no bias in least squares estimates of the coefficients BUT standard errors are seriously affected.
true
T/F: least squares estimates of B0 and B1 are those values for the intercept and slope that minimizes the sum of squared residuals.
true
T/F: A proportion of 1/2 corresponds to odds of 1
true (equal odds, even odds, odds are fifty-fifty)
T/F: If the odds CI does not include 1, then the p-value is <0.05.
true (moderately convincing evidence)
T/F: The prediction line only gives a point estimate of the value for Y for any given X.
true (The prediction line formula Ypredicted = b0 + b1X yield a single Y value for a given X. The accuracy of this value can be determined through calculation of a confidence interval estimation or prediction interval of the mean)