INST 314 Final Exam

¡Supera tus tareas y exámenes ahora con Quizwiz!

How alternate measure of error are related to model fit/ predictive performance (RSE, RMSE, MAE):

" In general, when interpreting any of these error measures, smaller is better. We typically use these to compare two models, therefore look at the values of our error measurement for each model and compare".

Interpret a beta coefficient for a categorical predictor?

"Let's say that x describes gender and can take values ('male', 'female'). Now let's convert it into a dummy variable which takes values 0 for males and 1 for females. Interpretation: average y is higher by 5 units for females than for males, all other variables held constant."

Graphically detect violation of assumptions (post-hoc plots), namely looking at normality of residuals, homoscedasticity (constant variance), influential outliers, and independence of errors

"Normal Q-Q Plot" provides a graphical way to determine the level of normality.

Consequences of Multicollinearity:

"Statistical consequences" of multicollinearity include difficulties in testing individual regression coefficients due to inflated standard errors. Thus, you may be unable to declare an X variable significant even though (by itself) it has a strong relationship with Y. "Numerical consequences" of multicollinearity include difficulties in the computer's calculations due to numerical instability. In extreme cases, the computer may try to divide by zero and thus fail to complete the analysis. Or, even worse, the computer may complete the analysis but then report meaningless, wildly incorrect numbers.

Population Mean:

(mu)=x(sum of all data values)/N(population size)

Assumption #5: Normal Distribution of error terms:

- If the error terms are non- normally distributed, confidence intervals may become too wide or narrow. - Once confidence interval becomes unstable, it leads to difficulty in estimating coefficients based on minimization of least squares. Presence of non - normal distribution suggests that there are a few unusual data points which must be studied closely to make a better model. - How to check: You can look at QQ plot. You can also perform statistical tests of normality such as Kolmogorov-Smirnov test, Shapiro-Wilk test.

Assumption #4: Heteroskedasticity:

- The presence of non-constant variance in the error terms results in heteroskedasticity. Generally, non-constant variance arises in presence of outliers or extreme leverage values. Look like, these values get too much weight, thereby disproportionately influences the model's performance. - - - When this phenomenon occurs, the confidence interval for out of sample prediction tends to be unrealistically wide or narrow. - How to check: You can look at residual vs fitted values plot. If heteroscedasticity exists, the plot would exhibit a funnel shape pattern (shown in next section). Also, you can use Breusch-Pagan / Cook - Weisberg test or White general test to detect this phenomenon.

Assumption #3: Multicollinearity:

- This phenomenon exists when the independent variables are found to be moderately or highly correlated. In a model with correlated variables, it becomes a tough task to figure out the true relationship of a predictors with response variable. In other words, it becomes difficult to find out which variable is actually contributing to predict the response variable. - Another point, with presence of correlated predictors, the standard errors tend to increase. And, with large standard errors, the confidence interval becomes wider leading to less precise estimates of slope parameters. - Also, when predictors are correlated, the estimated regression coefficient of a correlated variable depends on which other predictors are available in the model. If this happens, you'll end up with an incorrect conclusion that a variable strongly / weakly affects target variable. Since, even if you drop one correlated variable from the model, its estimated regression coefficients would change. That's not good! - How to check: You can use scatter plot to visualize correlation effect among variables. Also, you can also use VIF factor. VIF value <= 4 suggests no multicollinearity whereas a value of >= 10 implies serious multicollinearity. Above all, a correlation table should also solve the purpose.

Assumptions to violate:

1. Assumption of Linearity 2. Assumption of No Multicollinearity 3. Assumption of Independence/No Autocorrelation 4. Assumption of Heteroscedasticity 5. Assumption of Normality of Errors

MS(between/within):

= SS(between/within)/df(between/within)

df(between/within/total):

= k-1(number of groups minus 1) = n-k(sample size minus number of groups) =n-1(still)

Interpret ANOVA as a comparison of group means

ANOVA involves breaking down the sum of squares into two pieces- the distance between the grand mean and the group mean (between/group variance) and the distance between the group mean and the individual observations (within/residual variance)

Omnibus Test:

An omnibus test of the overall model. Tells us if the IV(s) are significant predictors of the DV Compares a null model with no IVs- to the alternative (fitted) model Tests whether the R-squared is significantly different from 0. Ex: "F-test indicates that our model fits the data well- it's better than a null model"

Understand the normality of residuals (assumption)

Assumptions of ANOVA: - DV is numeric (interval or ratio) - No extreme outliers - Normality of residuals (one of the assumptions) - Homogeneity of Variance - Independence of Observations (random selection, different samples) - Group sample sizes are approximately equal

Being able to interpret one coefficient in the presence of other IVs (multiple regression):

B_O = when all of our IV's(Predictors") are Zero the DV B_1 or B_2 or B_3 = the difference in the predicted value in Y for each one-unit difference in X2 if X1 remains constant

OVERALL THINGS

Being able to read R code and R output Add functions and elements used in lab codes Understanding power: What it means? How it relates to Type II error? What is the potential issue with having low power? ~ Power is the probability of rejecting the null hypothesis when, in fact, it is false. The probability of avoiding a Type II error.

Bonferroni (safest post hoc option)/Interpret results

Bonferroni Adjustment is simply an adjustment to alpha to account for multiple comparisons. It sets the alpha for each test to overall desired alpha divided by the number of comparisons a/c - Bonferroni is more conservative than Turkey HSD, but may be preferable when only a portion of pairs are of interest Interpret results: To accept an effect is significant use Bonferroni (alpha) = (alpha)/number of tests

The differences (pros/cons) between Tukey HSD and Bonferroni and interpret the output of either tests to determine significant pairs

Bonferroni is more conservative than Tukey but is preferred when only a portion of pairs are of interest "We can compare our means pairwise through a "post hoc" analysis. These methods use some adjustment so that the overall Type I error rate of All of the comparisons equals our chosen alpha (likely 0.05) Null: Parings are significant Alternate: Pairings are not significant

(Unsure)Understand when we can conduct an F-test to compare two models (nested)

Comparing models to see if one model is a significant improvement over a previous model. We use an F−test to compare nested models, one with k parameters (reduced) and another one with k + p parameters (complete or full). -Hypotheses: H0 : βk+1 = βk+2 = ... = βk+p = 0 versus Ha : At least one β6= 0. -Test statistic: F = ((SSER−SSEC)/ # of additional β0sSSEC)/[n−(k+p+1)] -At level α, we compare the F−statistic to an Fν1,ν2 from table, where ν1 = p and ν2 = n − (k + p + 1). -If F ≥ Fα,ν1,ν2, reject H0(null).

Central limit theorem(CLT)

Deviations from normality (especially skewness) The normal distribution is a probability function that describes how the values of a variable are distributed. The mean , where the peak of the density occurs, and the standard deviation , which indicates the spread or girth of the bell curve. The central limit theorem (CLT) states that the distribution of sample means approximates a normal distribution (also known as a "bell curve"), as the sample size becomes larger, assuming that all samples are identical in size, and regardless of the population distribution shape.

Evaluation: Rsquared/Adjusted Rsquared

Evaluation of the R-Square: When the regression is conducted, an R2 statistic (coefficient of determination) is computed. The R2 can be interpreted as the percent of variance in the outcome variable that is explained by the set of predictor variables. Evaluation of the Adjusted R-Square: The adjusted R2 value is calculation of the R2 that is adjusted based on the number of predictors in the model.

Calculating RSS or SSE of Prediction:

Example: Consider two population groups, where X = 1,2,3,4 and Y=4,5,6,7 , constant value α = 1, β = 2. Find the Residual Sum Of Square(RSS) values for the two population groups. Given, X = 1,2,3,4 Y = 4,5,6,7 α = 1 β = 2 Solution: Substitute the given values in the formula, Residual Sum of Squares Formula RSS = ∑(4-(1+(2x1)))2 + (5-(1+(2x2)))2 + (6-(1+(2x3))2 + (7-(1+(2x4))2 = ∑(1)2 + (0)2 + (-1)2 + (-2)2 = 6

interpret an r-value to discuss the correlation - strength, direction, does the correlation exist?

Example: r =- .823 R's strength is strong, it is a linear, and is negative. A correlation is present

Given enough information filled in, complete an ANOVA table (need to know how df, SS and MSS, and F are calculated) you will not have to calculate p-value

F-value is calculated: msbetween(pop)/msbetween(sample) Interpret: If the p-value is smaller than .05 the model is significantly better than a null model

How to interpret the F-value and related p-value? What it means (and does not mean) about the difference in means among the groups?

F-value: is the ratio of between group within group variance P-value: - A small p-value (typically < 0.05) indicates strong evidence against the null hypothesis, so you reject the null hypothesis (variables will be statistically different) - A large p-value (> 0.05) indicates weak evidence against the null hypothesis, so you fail to reject the null hypothesis

Assess the potential correlation between two variables based on a scatterplot (is there a correlation, roughly how large does it appear to be, is it positive or negative?)

Go from -1 to 1, if they have 0 meaning no correlation at all How can we determine the strength of association based on the Pearson correlation coefficient? The stronger the association between the two variables, the closer the Pearson correlation coefficient, r, will be to either +1 or -1 depending on whether the relationship is positive or negative, respectively. Achieving a value of +1 or -1 means that all your data points are included on the line of best fit - there are no data points that show any variation away from this line. Values for r between +1 and -1 (for example, r = 0.8 or -0.4) indicate that there is variation around the line of best fit. The closer the value of r to 0 the greater the variation around the line of best fit. Different relationships and their correlation coefficients: (Strength of Association/Coefficient, r: Positive/Negative) ~ (small/ .1 to. 3/ -.1 to -.3) (medium/.3 to .5/ -.3 to -.5) (large/.5 to 1/ -.5 to -1)

Understand the Assumption of homogeneity of variance

Homogeneity is the when we want our group variances to be equal (Residuals vs Fitted) H0: The variances in the groups are equal HA: The variances in the groups are not equal

What is homoscedasticity?

Homoscedasticity basically means that the variances along the line of best fit remain similar as you move along the line. It is required that your data show homoscedasticity for you to run a Pearson product-moment correlation. (Homoscedasticity vs Heteroscedasticity): ~ homo: values are evenly distributed along the line of regression hetero: values are not evenly distributed along the line of regression

Interpret the statistical significance of a Beta coefficient

If the beta coefficient is not statistically significant (i.e., the t-value is not significant), the variable does not significantly predict the outcome. If the beta coefficient is significant, examine the sign of the beta. If the beta coefficient is positive, the interpretation is that for every 1-unit increase in the predictor variable, the outcome variable will increase by the beta coefficient value. If the beta coefficient is negative, the interpretation is that for every 1-unit increase in the predictor variable, the outcome variable will decrease by the beta coefficient value.

How to detect normality of residuals on a QQ plot?

If the data were normally distributed most of the points would be on the line - Compare our line of residual points to the diagonal reference line Instead of being worried about our DV being normally distributed, we instead worry about if our residuals are normally distributed. This means we can't test this assumption until AFTER we run our ANOVA analysis Inspect normality visually, using a QQplot of residuals The way that we can assure that our residuals are normally distributed is by the dashed line, making sure there is only a slight deviation

Common misconceptions for the coefficient of determination(r^2):

If your correlation coefficient has been determined to be statistically significant, this does not mean that you have a strong association. It simply tests the null hypothesis that there is no relationship. By rejecting the null hypothesis, you accept the alternative hypothesis that states that there is a relationship, but with no information about the strength of the relationship or its importance.

Interpret interaction effects and know how to interpret main effects in the presence of interaction effects

Interaction effects account for a third variable that affects the impact of an IV on the DV. This could be something like the effect of a treatment differing between genders. A two way ANOVA with interaction effects would give you 3 f-values in your output

Interpret VIF (variance inflation factor):

Interpreting VIF: Scores start at 1 - all clear Scores >1 to 4: Still safe Score of 5: Threshold for VIF issues (prefer 4 & under) Score of 10 & up: Need to do something about it, massive multicollinearity

When is multicollinearity ok? So if we can fix it, why not just fix it when you print in this area?

Issue: When two of our IVs are too correlated with each other. They "overlap" in their prediction of the DV redundancy in predicted variance - both IVs are trying to explain the same variation in the DV. Extent of Multicollinearity: ~ Little-Not a problem/Low corr. in IVs/little effect Moderate-Not usually a problem/Medium corr. in IVs/affects regression coefficients Strong-Statistical consequences: Often a problem if you want to estimate effects of individual X variables (ie, regression coefficients)

Assumption of homogeneity of variance: Testing it with Levene's test

Levene's Test is the same as var.test() we used with t-tests, however, it tests the variances of multiple groups

How do quadratic terms and basic transformations affect the interpretation of the model:

Log transformation: can be used to make highly skewed distributions less skewed. This can be valuable both for making patterns in the data more interpretable and for helping to meet the assumptions of inferential statistics. Only the dependent/response variable is log-transformed

Interpret both main effects and interaction effects

Main effect is the effect of one of just one of the independent variables on the dependent variable. An interaction effect occurs if there is an interaction between the independent variables that affect the dependent variable.

The relationship between correlation (r, the coefficient of correlation) and r-squared (r^2, the coefficient of determination):

More literal: The coefficient of determination, r^2, is the square of the Pearson correlation coefficient, r. The coefficient of determination helps us determine whether the association is statistically significant. More specific: The coefficient of determination, with respect to correlation, is the proportion of the variance that is shared by both variables. It gives a measure of the amount of variation that can be explained by the model (the correlation is the model).

Understand multicollinearity in multiple regression? Why it occurs?

Multicollinearity occurs when independent variables in a regression model are correlated. This correlation is a problem because independent variables should be independent. If the degree of correlation between variables is high enough, it can cause problems when you fit the model and interpret the results.

Why we do pairwise tests - what they tell us?

Multiple comparisons of means between groups IV(3+) Post- hoc pairwise comparisons can be used to determine which groups have statistically significant paired differences.

Know why/when you would use a non-parametric ANOVA

Nonparametric ANOVA: Hypothesis Tests of the Median (Kruskal-Wallis, Mood's median test) Use when your area of study is better represented by the median don't assume the data follows the normal distribution You have ordinal data, ranked data, or outliers that you can't remove

Why we might prefer a smaller model with fewer variables (parsimony)

Our reduced/smaller model needs to be nested inside our larger model. -Parsimony: there is no benefit to including more predictors that do not significantly improve the model

How the non-parametric ANOVA interpretation differs from regular one-way or two-way ANOVA

Parametric ANOVA: Hypothesis Tests of the Mean (One-way ANOVA) Parametric tests can perform well with skewed and nonnormal distributions Parametric tests can perform well when the spread of each group is different More statistical power

The probability of committing a type II error is equal to one minus the power of the test, also known as beta. The power of the test could be increased by increasing the sample size, which decreases the risk of committing a type II error (fails to reject the null hypothesis).

Powers lower than 0.8, while not impossible, would typically be considered too low for most areas of research.

What is a good value of R-squared?

R-squared is always between 0 and 100%: 0% indicates that the model explains none of the variability of the response data around its mean. 100% indicates that the model explains all the variability of the response data around its mean. ex. An R2 value of 0.9, means that 90 percent of the variation in the y data is due to variation in the x data.

Understand r-squared, what it represents, and interpret values of r-squared

R-squared is the proportion of variance in DV (y) explained by IV (x) R-squared same as ANOVA, but instead of SSbetween we now use the SSmodel Ranges from 0-1. The closer the value to 1 the more variance explained Can be inflated due to picking up chance variation in the sample that doesn't exist in the population. Ex: The adjusted R-squared indicates that the number of cylinders explains 58% of the variance in hwy mpg. Cyl is a good predictor of hwy

Understand alternate measures of error:

RSE: Residual Standard Error √𝑅𝑆𝑆/𝑑𝑓: -based on the residual sum of squares. It is the default error measure printed in the lm() summary. General interpretation - smaller error is better. We can compare RSE between models to see which predicts our y variable the best (which has the smallest standard deviation of the residuals). RMSE: Root Mean Square Error -like RSE is the square root of RSS, RMSE is the square root of MSE MAE: Mean Absolute Error -the average of the residuals squared

ANOVA vs T-test:

Repeatedly doing statistical testing on the same data leads to the multiple comparison problem. We always have a chance of obtaining statistical significance when there is, in fact, no relationship (alpha/Type I error). By doing multiple comparisons we inflate the possibility of error.

ANOVA: Understand within-group and between-group variance and which one is the residual variance.

Residuals are the distance from the individual observations to the group mean (Within variation) Within group variation is the denominator of the F-ratio (F- ratio = variance between/ variance within) The leftover part of variance in Y, not explained by X, is called the "residual" variance.

One Way ANOVA Table Formulas: SS

SS represents the sum of squared differences from the mean and is an extremely important term in statistics. Variance

The potential effects due to violation of assumption:

See individual Assumptions for potential effects and how to check Assumption #1: Linear and Additive: ~ If you fit a linear model to a non-linear, non-additive data set, the regression algorithm would fail to capture the trend mathematically, thus resulting in an inefficient model. Also, this will result in erroneous predictions on an unseen data set. How to check: Look for residual vs fitted value plots (explained below). Also, you can include polynomial terms (X, X², X³) in your model to capture the non-linear effect.

Understanding the difference between a population and a sample, what is inference, and what are the threats to generalizability (bias in the sample, violations of assumptions) Understand the difference between statistical and substantive significance.

Statistical significance reflects the improbability of findings drawn from samples given certain assumptions about the null hypothesis. statistical significance of any test result is determined by gauging the probability of getting a result at least this large if there was no underlying effect.The outcome of any test is a conditional probability or p value. If the p value falls below a conventionally accepted threshold (say .05), we might judge the result to be statistically significant. Substantive significance is concerned with meaning, as in, what do the findings say about population effects themselves? The substantive significance of a result, in contrast, has nothing to do with the p value and everything to do with the estimated effect size. Only when we know whether we're dealing with a large or trivial sized effect, will we be able to interpret its meaning and so speak to the substantive significance of our results.

Know the statistical hypotheses for a significance test of a correlation and interpret output to conclude if a correlation is statistically significant.

The hypotheses for this type of test is pretty basic - is the correlation coefficient significantly different from 0? This tells us nothing about the relative strength, only if a significant effect exists (or not).

Assumption #2: Autocorrelation:

The presence of correlation in error terms drastically reduces model's accuracy. This usually occurs in time series models where the next instant is dependent on previous instant. If the error terms are correlated, the estimated standard errors tend to underestimate the true standard error. - If this happens, it causes confidence intervals and prediction intervals to be narrower. Narrower confidence interval means that a 95% confidence interval would have lesser probability than 0.95 that it would contain the actual value of coefficients. Let's understand narrow prediction intervals with an example: - For example, the least square coefficient of X¹ is 15.02 and its standard error is 2.08 (without autocorrelation). But in presence of autocorrelation, the standard error reduces to 1.20. As a result, the prediction interval narrows down to (13.82, 16.22) from (12.94, 17.10). - Also, lower standard errors would cause the associated p-values to be lower than actual. This will make us incorrectly conclude a parameter to be statistically significant. - How to check: Look for Durbin - Watson (DW) statistic. It must lie between 0 and 4. If DW = 2, implies no autocorrelation, 0 < DW < 2 implies positive autocorrelation while 2 < DW < 4 indicates negative autocorrelation. Also, you can see residual vs time plot and look for the seasonal or correlated pattern in residual values.

Sample Variance:

The variance is a numerical measure of how the data values are dispersed around the mean. = s^2 = (1/(n-1))(sum of all(x(i)-(mu)) Population Variance: ~ the population variance is defined in terms of the population mean μ and population size N = (sigma)^2 = (1/N)(sum of all(x(i)-(x(bar)))

LINEAR REGRESSION How the "best fit line" is calculated - sum of squared residuals

To describe the relationship between a numerical IV (predictor) and a numerical DV (outcome) calculate the slope and intercept of the line that "best fits" the combined distribution of the data by minimizing the sum of squares of the residuals (the error remaining that is not explained by the model) Ex: By running the lm () function, we fit the linear model that tells us our 𝛽0 and β1 to fill into this formula, which defines the line that reflects the relationship between our IV and DV.

Tukey HSD/Interpret results

Tukey HSD is a procedure in which multiple t-tests are performed with a correction for the "family-wise error rate"

·Know the difference in interpretation when you have data on one sample from multiple time points (repeated measures) [one-ANOVA within subjects] vs. multiple "samples" (groupings of one variable) [one-way ANOVA between subjects]

Two Way or Multiple Way ANOVA- Means all of the IVs should be categorical/groups and will predict our numerical DV EXAMPLE Wool( A & B) + Tension(H,M,L) Predicts DV breaks

What two options are there for comparing individual means?

Two of the options for this are: Tukey HSD (Honest Significant Differences) Bonferroni Adjustment

VIF:

VIF: used to check for multicollinearity in Multiple Linear Regression: an estimate of the amount of inflated variance in coefficient due to multicollinearity in the model. The higher the VIF, the less reliable the regression model

WHAT IF WE WANT TO COMPARE INDIVIDUAL MEANS?

We can compare our means pairwise, through a "post hoc" analysis. These methods use some adjustment so that the overall Type I error rate of ALL of the comparisons equals our chosen alpha (likely 0.05).

Why we "leave one out" and the purpose of dummy variables

We can't include dummy codes for all levels in our model- we need to "leave one out" to serve as the reference variable We need our categorical variable to be a number. Why we "leave one out" and the purpose of dummy variables: How would we do this, even if it isn't ordinal? ~ We would create a series of "dummy variables" - 0/1 variables that indicate whether each observation is a member of that level. So for cyl we would potentially have 4 dummy variables Beta coefficient: ~ is the degree of change in the outcome variable for every 1-unit of change in the predictor variable.

When would you not use a PostHoc test?

When ANOVA is not significant.

Understand what the purpose is of the F-test of the model, how to interpret it, and what it means about our model fit.

When the regression is conducted, an F-value, and significance level of that F-value, is computed. If the F-value is statistically significant (typically p < .05), the model explains a significant amount of variance in the outcome variable.

Know when we would use things like quadratic terms and basic transformations (log transformations):

When there's an interaction between that variable and itself seen as a parabola seen in age variables

Identify potential interaction effects graphically

explain more of the variability in the dependent variable It could also lead to a bias in estimating model parameters. The place in which the lines cross over each other is the place in which the variables interact.

CORRELATIONS Understand what correlation is and what it tells us about the relationship between two variables

the relationships between two numerical variables. Correlations are a numerical representation of the strength of the relationship between two numerical variables - and in a way reflects the ability to predict the second variable given the value of the first variable

Assumption of homogeneity of variance: How to interpret it.

want to fail to reject the null, because it's easier if our variances are equal and we don't need to make the adjustment.

How to interpret the output of anova ()? ex. ## Model 1: value

weight ## Model 2: value ~ weight + clarity + color ## Res.Df RSS Df Sum of Sq F Pr(>F) ## 1 148 5569 ## 2 146 3187 2 2381 54.5 <2e-16 *** ~ The result shows that model 3 did indeed provide a significantly better fit to the data compared to model 1. However, as we know from our previous analysis, model 3 is not significantly better than model 2.

Sample Mean:

x(bar)=x(sum of all data values)/n(sample size) a statistic, that mean is used to estimate the true population parameter, mu accurate sample means to come from samples that are randomized and include a sufficient number of people

What that represents (what are the null/alternative hypotheses) What does it mean about the Beta coefficient if we reject null?

𝐻0: 𝛽1=0 -- The coefficient is not significantly different from zero. 𝐻𝐴: 𝛽1≠0 -- The coefficient is significantly different from zero. Ex: In this case the p-value (9.34 x 10^8) is much smaller than an alpha of 0.05, so we reject the null hypothesis. This means that expenditures per capita is a significant predictor of crimes reported per million people.

What B_0(intercept) represents and what B_1 (or B_2 or B_3) represents (slope) and how to interpret them - (Cant find) and being able to interpret one coefficient in the presence of other IVs (multiple regression)?

𝛽0: coefficient (intercept) is self-explanatory it's the y-intercept of our line. Ex: When exp.per.cap.1960 equals 0 then crime.per.million is predicted to be 14.4. 𝛽1: coefficient value (the Estimate) is the slope of our line. We interpret this by saying "A one unit increase in expenditures per capita is associated with a 0.89 unit increase in crimes reported per million residents."


Conjuntos de estudio relacionados

Chapter 16&17 international marketing

View Set

Chapter 7: Environmental Policy: Making Decisions and Solving Problems

View Set

Craven Chapter 9: Patient Education and Health Promotion

View Set