Stats Exam 3
What variables does Logistic Regression have?
-1 nominal (dependent): 2 values = binary, 2+ values = multinomial -I measurement (independent)
What is the range of values for r?
-1 to 1
What variables does linear regression have?
-2 measurement OR -1 measurement, 1 ordinal/nominal
What variables does an ANCOVA have?
-2 measurement -1 nominal that divides the regressions into two or more sets
What variables does Correlation have?
-2 measurements (2 continuous, or 1 continuous & 1 dichotomous) -1 nominal that keeps the two measurements together in pairs
What variables does Multiple Linear Regression have?
-3+ measurements, one is the dependent variable and the rest are independent variables -Dependent variable is continuous.
What type of data violates the assumption of independence in correlation?
-Autocorrelation data (serial, spatial) -Autocorrelation can give you significant results much more often than 5% of the time (false positives).
Although multiple regression approaches may be appropriate for evaluating several independent variables against a dependent variable, you may consider other multivariate approaches such as...
-Canonical Correlation -Principle Components Analysis -Multidimensional Scaling
In regression, what are the two methods of estimating X from Y?
-Classical Estimation -Inverse Estimation
How do you do an ANCOVA?
-Compute each regression line and see if the slopes are significantly different. -If they are not significantly different, draw a regression line through each group of points, all with the same slope. This common slope is a weighted average of the slopes of the different groups. -Test the null that all of the Y intercepts of the regression lines with a common slope are the same. Because they are parallel, saying that they are significantly different at one point (Y intercept) means that the lines are different at any point.
What is Autocorrelation?
-Data in which the correlation between the values of the same variables is based on related objects. -Serial autocorrelation, spatial autocorrelation
For regression, if the assumption of linearity is violated, what can you do?
-Data transformation for J-shaped curve -Curvilinear regression for U or S-shaped curve
What is Adjusted r2?
-Estimates how well the model is expected to fit new data. -Despite it's name, it is not a squared value. It is the r2 minus an adjustment factor based on the number of variables and sample size. Can be negative.
What are types of equations for Curvilinear Regression?
-Exponential -Power -Logarithmic -Trigonometric
What are the assumptions of Correlation?
-For any value of X, the Y values will be normally distributed and homoscedastic. -Linearity -Independence, random sample -No extreme outliers
What are the assumptions of linear regression?
-For any value of X, the Y values will be normally distributed and homoscedastic. -Linearity -Independence, random sample -No extreme outliers
When your data follows a curved line, what are the three options?
-If you only care about the association, not the line, use the P-value. -Data transformation -Curvilinear regression
What are the assumptions of Logistic Regression?
-Independence -Relationship between the natural log of the odds ratio and the measurement variable is linear. -Does NOT assume that the measurement variable is normally distributed or homoscedasticity.
What are other nonparametric correlation coefficients when assumptions are violated? (probz don't need to know these but HERE YA GO)
-Kendall's tau-b/c: small sample size -Gamma: assumes ordinal variables -Somer's d: asymmetric, identify one variable as dependent -Phi Cramer's V: nominal/categorical data
How do logistic regression and linear regression differ in terms of obtaining the equation?
-Logistic regression uses maximum-likelihood method. -Linear regression uses least-squares method.
What are characteristics of the best fit line?
-Minimizes the sum of the squared distances between the data points and the line. -Line goes through the mean scores X and Y. -If you square the vertical distance of each data point from the line and then sum these values, the resulting value is smaller than the value obtained with any other line.
What are the assumptions of ANCOVA?
-Normality & homoscedasticity of Y for each value of X. -Independence
What are the assumptions of Multiple Linear Regression?
-Normally distributed -Homoscedastic -Linearity -X variables are not multicollinear -Don't have collinearity
What are two disadvantages of Reduced Major Axis Regression?
-Not used for predicting values. -Cannot test the null that the slope of the reduced major axis line is zero.
What are the hypotheses for linear regression?
-Null: one variable does not significantly predict values of the other; slope of best fit line is 0. -Alternative: one variable significantly predicts values of the other.
What are the hypotheses for correlation?
-Null: r = 0, no statistically significant relationship. -Alternative: r does not = 0, statistically significant relationship.
What are the two null hypotheses of curvilinear regression?
-Null: there is no relationship between the X and Y variables. -Null: the increase in r2 is only as large as you would expect by chance (when comparing the lines from the simpler equation and the more complicated equation).
What is another name for the Correlation Coefficient (r)?
-Pearson Product-Moment Correlation Coefficient -Pearson's r
How is Pearson's Correlation Coefficient expressed for populations versus samples?
-Population parameter ρ -Sample statistic r
What is the benefit of Spearman's Rank Order Correlation?
-Reduces the influence of extreme values. -You can also do a data transformation, or analyze the data without the extreme values.
When there is a significant regression or correlation, X values with higher mean Y values will often have higher standard deviations of Y as well. This is because...
-Standard deviation is often a constant proportion of the mean. -When this happens, you can make the data homoscedastic with a log transformation of the Y variable.
What does the result of regression sum of squares/total sum of squares mean?
-The X variable "explains" __% of the variation in the Y variable. -This is the definition of r2.
What is the Coefficient of Determination (r2)?
-The proportion of the variation in the Y variable that is "explained" by the variation in the X variable. -Can be thought of as "goodness of fit." -Expresses the strength of the relationship between the X and Y variables.
What are the two goals of Logistic Regression?
-To see whether the probability of getting a particular value of the nominal variable is associated with the measurement variable. -To predict the probability of getting a particular value of the nominal variable given the measurement variable.
What are the assumptions of curvilinear regression?
-Y is normally distributed and homoscedastic for each value of X. -Independence -Curvilinearity
What is the range of values for r2?
0 to 1
What is the breakdown of the range of values for r? (small, medium, large)
0: no relationship .10 to .29: small relationship .30 to .49: medium relationship .50 to 1.0: large relationship
What variables does Curvilinear Regression have?
2 measurement
How would you interpret a correlation coefficient reported with a confidence interval?
Assuming the data were randomly sampled from a larger population, there is a 95% chance that this range includes the population correlation coefficient.
In regression, how should you interpret r2?
Because you chose the X-values before you did the experiment, you should NOT interpret it as an estimate of something general about the population you've observed.
In regression, what would you do if you did know X and were told to guess Y?
Calculate the regression equation and use it.
Multiple Logistic Regression controls for...
Confounding variables. Since you are seeing how multiple independent variables affect the dependent variable, you can make the confounding variables an independent variable and examine their effect.
What is interpolation?
Constructing new data points within the range of a discrete set of known data points.
If you are only interested in whether two variables covary, and you are not trying to infer a cause-and-effect relationship, should you use correlation or regression?
Correlation
How do you use nominal variables in Multiple Linear Regression?
Create a variable where one nominal variable has a value of 0 and the other has a value of 1, and treat that variable as if it were a measurement variable (do this for independent variables).
What does Reduced Major Axis Regression do?
Describes the relationship between two variables with a symmetrical relationship. It gives a line that is intermediate in slope between the least-squares regression line with X as the independent variable and with Y as the independent variable.
In selecting variables, what is Forward Selection?
Do a linear regression for each X variable, one at a time, then pick the X variable that had the highest r2. Do a multiple regression with that X variable and each of the other X variables. Add the X variable that increases r2 by the greatest amount, if the P-value of the increase in r2 is below the desired cutoff (P-to-enter). Continue adding X variables until adding one does not significantly increase r2.
How do you do Inverse Estimation?
Do linear regression with Y as the independent variable and X as the dependent variable, also known as regressing X on Y. This is usually more accurate.
How do you do Classical Estimation?
Do the usual regression with X as the independent variable and Y as the dependent variable, then rearrange the equation to solve for X.
What is the other null hypothesis of Multiple Linear Regression?
Each additional X variable does not improve the fit of the multiple regression equation any more than expected by chance.
The absolute value of r is...
Effect Size
What is ε?
Error
What does Bivariate Linear Regression Analysis do?
Examines the ability of the predictor variable to predict the criterion variable. (Just regular linear regression tbh)
What is extrapolation?
Extension of a graph or range of values by inferring unknown values from trends in the known data.
What does Linear Regression do?
Finds the line that best fits the data points; tests to see if there is an association between two variables.
How do you do Spearman's Rank Order Correlation?
First, rank the scores (both X and Y) with the lowest score getting a rank of 1, and so on. Sum up the ranks of X, Y, and XY. Then plug it into that long ass bullshit equation.
In regression, do you fit a model to the data, or fit the data to the model?
Fit a model (regression line) to the data.
What is the slope of the line produced by Reduced Major Axis Regression?
Geometric mean of the two least-squares regression lines.
Every time you add a variable to a multiple regression, what happens to r2?
Increases
Why can't you use P-value to determine strength of association?
It is a function of both the r2 and the sample size.
The regression test statistic gets _____ as degrees of freedom or r2 gets _____.
Larger, larger
All linear regression is contingent upon...
Lower and upper bounds; regression is between lowest and highest X values and does not have much meaning outside of those parameters (can't do extrapolation).
The goal of Multiple Linear Regression is to explain as much of the variance via the independent variables as is...
Meaningful
What does Spearman's Rank-Order Correlation do?
Measures the relationship between two ordinal variables, or between two variables that are related but not linearly (can do this with all variables except nominal/categorical).
What is Correlation used for?
Measuring the strength of the relationship with r2; whether as one variable increases, the other tends to increase/decrease.
In Ordinary Least-Squares Regression (Method of Least Squared Error), the best-fit line is the line that...
Minimizes the squared vertical distances between the data points and the regression line.
What does it mean when two variables have a symmetrical relationship?
Neither is the independent variable.
How many variables do you determine ahead of time in Correlation?
None, both are naturally variable and are both measured.
Regression and correlation are not sensitive to deviations from what assumption?
Normality
In Correlation, what happens if you switch which variable you call X and which variable you call Y?
Nothing! P-value and r2 are not affected by which variables are assigned to X and Y.
What is the null hypothesis for Logistic Regression?
Null: probability of a particular value of the nominal variable is not associated with the value of the measurement variable; the regression line has a slope of zero.
What is the basic explanation of how to do a Multiple Linear Regression?
Once the independent variable with the biggest effect on the variance of the dependent variable has been identified, it is "frozen" in order to examine the other independent variables to identify the one with the next highest amount of variance, then moves to the next independent variable and so on.
Simple Logistic Regression has _____ independent variable. Multiple Logistic Regression has _____ independent variable.
One, more than one
In regression, how many variables do you determine ahead of time?
One, you choose the X-values.
What is the basic template for the equation for the general linear model of regression?
Outcome = (Model) + error
How can r2 be expressed?
Percentage, because it is the fraction of the variance shared between the two variables.
Which one fits a line to the data, correlation or regression or both?
Regression
If the model results in a better prediction than the mean, then...
Regression sum of squares will be greater than residual sum of squares.
Adjusted Means gives you a better idea of the _____ of the difference.
Relative Size
Multiple linear regression is about replication in the identification of the most important independent variables. It is often criticized because...
Replication also replicates error.
What criteria must be met in order for regression to imply causation?
Set the values of the independent variable and then control or randomize all of the possible confounding variables.
What is β1 or b1?
Slope
If r2 is high, the difference between classical estimation and inverse estimation is...
Small
What is a nonparametric correlation?
Spearman Rank-Order Correlation (Spearman's rho)
In selecting variables, what is Backward Elimination?
Start with multiple regression using all of the X variables, then perform multiple regressions with each X variable removed in turn. Eliminate the X variable whose removal causes the smallest decrease in r2, if the P-value is greater than the P-to-leave. Continue removing X variables until removal of any X variable would cause a significant decrease in r2.
What does a high F-statistic indicate?
That the model is good.
If the results of correlation show that there is an association, what can you infer?
That variation in X may cause variation in Y, or variation in Y may cause variation in X, or variation in some other factor may affect both Y and X.
If you accept the first null of ANCOVA, then you test the second hypothesis. What is the second null?
The Y intercepts of the regression lines (a) are all the same; also defined to be that the adjusted means (least-squares means) of the groups are the same.
Multiple Linear Regression is not sensitive to deviations from normality if...
The data set does not contain extreme outliers.
Logistic Regression is analogous to Linear Regression except...
The dependent variable is nominal rather than measurement.
How do you do Ordinary Least-Squares Regression?
The difference between Y1 and Y^1 (predicted value of Y at X1) is calculated, then squared. This squared deviate is calculated for each data point, and the sum of these squared deviates measures how well a line fits the data. The regression line is the one for which this sum of squared deviates is smallest.
In regression, what is the regression sum of squares?
The difference in variability between the model (regression line) and the mean (horizontal line).
As opposed to r2, R2 is used for what tests?
The non-linear tests, curvilinear and multiple logistic regression.
What is Adjusted Means?
The predicted Y variable for that group, at the mean X variable for all the groups combined. Because the lines you use for estimating the adjusted mean are parallel, the difference in adjusted means is equal to the difference in Y intercepts.
What does the term "model" refer to?
The regression line determined by the equation
What does a Semi-Partial Correlation measure?
The relationship between two variables after removing a third variable from just the one variable.
What does a Partial Correlation measure?
The relationship between two variables after removing the overlap of a third variable from both variables.
What does a Bivariate (Zero Order) Correlation measure?
The relationship between two variables while ignoring all other variables. (Just a regular correlation tbh)
Do Correlation & Regression give the same or different P-Values?
The same, DUH.
What is the first null that you test in ANCOVA?
The slopes of the regression lines (b) are all equal; the regression lines are parallel to each other.
In regression, what does r2 depend on?
The values of the independent variable that you chose.
In regression, what is the residual sum of squares (SSE)?
The vertical distance of the points from the model (regression line).
What is the null hypothesis of Multiple Linear Regression?
There is no relationship between the X variables and the Y variable.
What is the purpose of Multiple Linear Regression (Stepwise)?
To find an equation that best predicts the Y variable as a linear function of the X variables; see which independent variables have a major effect on the dependent variable.
Adding regression sum of squares and residual sum of squares gives you...
Total sum of squares
In regression, what is the total sum of squares?
Total variability between the data point and the mean (horizontal line), results from adding up the squared deviates.
If you are mainly interested in using the P-Value for hypothesis testing to see whether there is a relationship between the two variables, should you do a regression or correlation?
Trick question! It doesn't matter which one you do because they will both give you the exact same P-Value!
Although comparing two regression lines is the most common, what can you do if you want to compare three or more regression lines?
Tukey-Kramer Test: post-hoc for ANCOVA with 3+ regression lines; compares 2 lines at a time to see which ones significantly differ in Y-intercepts.
When do you use ANCOVA?
Use when you want to compare two or more regression lines to each other; will tell you whether the regression lines are different from each other in either slope or intercept.
ANCOVA is a way of comparing the Y variable among groups while statistically controlling for ...
Variation in Y caused by variation in the X variable.
What does Collinearity mean?
When one variable is the same as the other.
When do you use Proportional Hazards Regression? (from the book, probz don't need to know)
When the outcome is survival time.
What does Multicollinear mean?
When two independent variables are highly correlated with each other.
When do you use Curvilinear Regression?
When you have graphed two measurement variables and you want to fit an equation for a curved line to the points on the graph (best fit curve).
When do you use Logistic Regression?
When you want to know whether variation in the measurement variable (independent) causes variation in the nominal variable (dependent).
What is the equation used for linear regression?
Y = ax + b
What is the equation for the general linear model of regression using population parameters?
Y = β0 + β1X + ε
What is the equation for the general linear model of regression using sample statistics?
Y' = b0 + b1X + error
What is β0 or b0?
Y-intercept
In terms of measurement of the variables, how does Logistic Regression differ from Linear Regression?
You do not measure Y directly, it is instead the probability of obtaining a particular value of a nominal variable.
In regression, what would you do if you didn't know anything about X and were told to guess Y?
Your best guess would be the mean Y value.
For k values of the nominal variable, you create _____ dummy variables.
k - 1
What is the degrees of freedom for correlation and regression?
n - 2
What is the main difference between r and r2?
r shows the direction of the slope depending on if it's positive or negative.
What happens to r2 as the range of X-Values gets narrower?
r2 gets smaller.
What is the test statistic for regression?
√ d.f. × r2 / √ (1 − r2)