QBA
What does (mew) Uy|x stand for?
B0 + B1X + E - which is the mean value of the dependent variable Y for the given values of the independent variable X.
What would a hypothesis test be? Would it involve the Y-int or the Slope?
It involves the slope. We would want to reject H0. That would mean that there is some type of linear relationship between X and Y.
What is ANOVA?
It is the analysis of variance and it analyzes why the dependent variable varies.
What is the mean for the random error E?
It normally has a mean of zero.
What is the root mean squared error the same as?
Its like the estimated standard deviation for regression line.
What is the mean square for error?
MSE = SSE/dfE
What is the mean square for regression (MSR)?
MSR = SSR/dfR
For MLR, what is the degrees of freedom for SSE, SSR, and SST?
MSR: dfR = p = the number of independent variables. MSE: dfE = n - (p + 1) -- (p +1 equals the number of parameters you have to estimate) - dfT = n - 1
is r the same as R?
No they are two different things. R is residual, r is correlation coefficient.
Define statistical relationship.
Relationship between X and Y but doesn't have to be exactly equal.
What is extrapolation?
When you take values that are not in the data set (for X) and make predictions of Y from them. This is bad.
How can we fix heteroscidasticity?
With the weighted least squares: a way to be able to fix hetero in a statistical model.
Which variable do we act like we know?
X
What is X? Is it a constant? What is another term for X?
X is the independent variable, it is a constant. Another term is the predictor or explanatory variable
What is the difference between Y and Yhat
Y = actually observed data Yhat = the predicted point on the line
What is Y?
Y is the dependent variable, it is a random variable. Its also known as the response variable.
What is the formula for the regression line of estimated values?
Y-hat = b0 +b1X + E
Can outliers mess up a data set?
YES.
Do the observations equal the sample size?
Yes it is "n" which is the number of variables.
Is it important to have a confidence interval for a slope? Why?
Yes it is. If the the interval contains 0, it is possible the slope equals 0 and that is bad.
Is the E term distributed?
Yes, ~N(0, Sigma)
Is Yhat the same as Xbar for regression?
Yes.
What is the formula for e?
e = (Y - Yhat)^2 (this is also called a residual, its the same thing)
What is the formula for the sum of squared errors?
e^2 = (Y- Yhat)^2
What is the formula for Sum of Squared Errors? (SSE)
ei^2
Should a pattern be detected on the residual plot to know that the assumption checks out?
no. You should not see a pattern.
Do we care much for the y-intercept?
nope.
Are independent variables a function of time?
nope. This would mean that they are dependent on eachother.
Can we find EXACT values from the statistics?
yes.
Does the order matter for Degrees of Freedom for F distr.?
yes.
If the p value is less than alpha, do we reject?
yes.
Should the degrees of freedom always be greater than 0?
yes.
Should the number of independent variables and the sample size have an exponential relationship?
yes.
Will rho (p) and B1 always have the same sign?
yes.
X is the control variable?
yes.
is e an error term?
yes.
In SLR is Y a continuous random variable?
yes. But it is possible for Y to be DISCRETE.
Most likely the population slope is greater than 0?
yes. This is because of the p-value.
Is notation (mew) Uy|x a parameter?
yes. Yhat would be the statistic.
What numbers can r be between?
-1 <= r <= 1
What numbers can r^2 be between?
0<= r^2 <= 1
What are the four assumptions for regression?
1. The residuals are normally distributed (the e are distributed ~N(0, sigma) 2. There exists a linear relationship between X and Y. 3. There is constant variance about the regression line (homoscedasticity) 4. The residuals are independent of one another.
What do we check assumption 1 with?
A normal quantile plot.
What do we check assumption 3 with?
A residual plot. We want to make sure it is homesc and not heterosc. Hetero might look like a bottle neck on the graph.
What is a regression analysis?
A statistical technique that uses observed data to relate the dependent variable to one or more independent variables.
What is the value of e if the data point is above or below line?
Above line = positive value below line = negative value
What is F equal to?
F = MSR / MSE
What is the full equation and distribution for F?
F = MSR / MSE ~ F(dfR, dfE)
What is a test statistic used for?
For testing the null hypothesis.
What would an F-distribution hypothesis test be for regression?
H0: all populations slopes in your model are zero. Ha: at least one the slope does not equal zero. we want to reject.
What are the two degrees of freedom for F-distribution?
Numerator degree of freedom for regression, denominator degree of freedom for error. That order matters.
What are the classifications for B0 and B1?
Regression parameters
What is the simple linear regression formula? What is the multiregression formula?
SLR: Y = BETA0 +BETA1X + E MLR = multiple slopes.
What is the formula for SST (sum or squares total)
SSE + SSR = SST
What is the formula for SRR, SSE, and SST? (all similar but different)
SSE: (Y- Yhat)^2 SSR: (Yhat - Ybar)^2 SST: (Y - Ybar)^2
For SLR, what is the degrees of freedom for SSE, SSR, and SST?
SSR: DfR = 1 SSR: DfE = n-2 (because of the two parameters, Y and Y-int) SST: dfT = n - 1.
What is the sample standard deviation and sample variance for regression?
Sample standard deviation = S = sqrRoot(MSE) (root mean square error) Sample variance = S^2 = MSE (mean square error).
What is the root mean square error? Do we want the value to be large or small?
Take the square root of the mean error. It is like the sample standard deviation for regression only. We want this value to be small which means a better fit for the data on the line.
What is the purpose of regression?
Th purpose is to build a regression model that can describe, predict, and control the dependent variable based on the independent variable.
What are the values for the x and y axis of the residual plot?
The Y axis is the dependent variable Y and the X axis is the predicted variable Yhat.
Give the exact definition of the population slope.
The change in avg value of Y per UNIT change in X.
Is the dependent variable a random variable? what about the independent variable?
The dependent is the independent is not.
The more spread out the data is....
The larger the error (the weaker the linear relationship is)
What is the mean of response? what does it do?
The mean of response is Ybar and it is the average of all Y values.
What is Y-hat?
The notation for the dependent variable for a statistic
What is rho (P)?
The population correlation coefficient. The parameter for r
What is mew (u)?
The population mean (know the formula)
What is sigma^2?
The population variance (know the formula).
What is r? What does it do?
The sample correlation coefficient. Measures the strength of the linear relationship and states whether it is negative or positive.
What is x bar?
The sample mean (know the formula)
What is s^2?
The sample variance (know the formula)
What happens if Y is a discrete variable in regression? What is this called? How many outcomes?
There are two outcomes, Y can either be 0 or 1. This is called logistical regression.
What is the formula for MST?
There is no MST.
How do we check assumption 4?
Use a residual plot. We do NOT want to see a pattern that means that the residuals are dependent on one another.
Define simple linear regression.
Uses 1 independent variable to predict the outcome of 2 dependent variables.
What are we interested in looking for in SLR?
We are interested in the avg value of Y.
What can we explain and what cant we explain in the SLR formula?
We can't explain: error term We can explain: Pop Y int, Population slope, independent variable.
Can we explain SSE? how about SSR?
We cannot explain SSE but we can explain SSR
Do we know the exact values for B0 and B1? What must we do instead?
We do not know these exact values so we use sample data to estimate them. We use b0 and b1 as statistics.
Do we want B1 to be zero? why or why not?
We dont want B1 to be zero because that means there is no linear relationship and no slope.
What is the F-distribution? explain what it is and what it does?
We use this for hypothesis testing. Always RIGHT-SKEWED distribution. Only starts at zero (F >= 0). It has two degrees of freedom.
Do we want homosc or heterosc?
We want homo.
What do we want r^2 to equal?
We want it close to 1.
What do we want r to be close to when looking at a graph? why?
We want r to be close to 1, meaning the data points are really close to the line. This means a STRONG linear relationship. We DO NOT want r to be 0. This means no linear relationship.
How do we fix a non-linear relationship between x and y on a residual plot? This would mean the graph is parabolic.
We would square the term which would fix the non-linear relationship.
When is the Y intercept (B0) relevant?
When the data value X=0 is in the data set. this is also not good because it means that our data could have no linear relationship.
What do we check assumption 2 with?
a residual plot. The plots should look spread out evenly.
What are least squares estimates?
b0 and b1.
r and what other variable will always have the same sign?
b1
What are point estimates (aka least squares estimates) for regression?
b1 and b0
What is r^2?
coefficient of determination.
What is the Df equal to for SLR? How about MLR?
df for SLR: always equal to the P-value. df for MLR: equal to the number of independent variables.
What is the formula for r? is r a statstic?
r is a statistic. The formula: SqrRoot( SSR / SST ) if b1 is positive. - SqrRoot (SSR/SST) if b1 is negative (if the slope is negative)
What is the difference between r and r^2?
r is used to determine linear strength and the sign on of the slope. It is a number value r^2 is what we use to determine the strength of linear relationship of the line. It is a percent value of the EXPLAINED variation in the regression equation.
What is r^2 adjusted? What does it do? what do you use it for?
r^2 adjusted penalizes you for every extra X added. It is only used in MULTI linear regression.
What are residuals? (R or e)
the distance between the observed data points to predicted points on a regression line.
The smaller the standard deviation for regression...
the more accurate.
What does (mew) Uy stand for?
the population avg of the variable Y.
Which is the most important assumption for regression?
the residuals are normally distributed -- E ~ N(0,sigma)
What is a statistic?
used to estimate parameters.