Stats Exam 2

Réussis tes devoirs et examens dès maintenant avec Quizwiz!

Difference between R square and Adjusted R squared

Adjusted R squared takes into account the number of predictors in the model-->adjusted R squared will be less than R squared

Influential data points

An influential data point is one whose removal from the dataset would cause a large change in fit. A measure of the influence of individual data points on the regression is Cook's Distance. If D is greater than the 50th percentile of the F distribution with v1=k+1 and v2=n-(k+1) degrees of freedom

outliers

An outlier is defined as an observation with a z score greater than 3 or less than -3.

Categorical Predictors Terminology

Categorical predictors are called factors. Different categories within a factor are called levels or classes.

Rules for keeping terms in the model: Quadratic models

If the p value for the coefficient of the quadratic term I significant, then it is kept in the model along with the lower order term. If the p-value for the coefficient of the quadratic term is not significant, the quadratic term is usually removed from the model.

When will the interpretation of beta0 (the intercept) make sense?

In general, beta0 will not have a practical interpretation unless it makes sense to set the values of the x's simultaneously equal to 0.

R^2 in multiple regression

In the context of multiple regression, R squared is called the multiple coefficient of determination. R squared is the fraction of the sample variation of y (measured by SSyy) that is explained by the least squares regression model.

Nested Models

Two models are nested if one model contains all the terms of the second model plus at least one additional term. The larger model is called the full or complete model; the smaller model is called the reduced model. We wish to test whether a group of (k-g) predictors should be kept in the model or not

Interaction between two variables

Two variables interact when the slope of the regression line for the first variable changes significantly when the second variable changes value. The effect of E(y) of a change in x1 (the slope) depends on the value of x2. --if when you change the value of x2, the resulting lines of the model are nonparallel--there is an interaction

General interpretation of beta when predictor is binary

When x = 1, we expect y to be beta higher on average than when x = 0.

why does the formula for MSE use the denominator n-(k+1)?

With only one predictor, we had k=1 --> n-(1+1) = n-2 Note: The value n-(k+1) will be used as the degrees of freedom for our multiple linear regression problems. Every time we add a new variable to our model we lose a degree of freedom.

Mean Square Error (MSE)

estimated random error variance--the variance of the random error the larger the value of sigma squared, the greater the error will be in estimating the model parameters and making predictions

rules for keeping coefficients in the model with an interaction term

if the interaction coefficient is significant, then it is kept in the model along with the lower order terms. If the p value for the interaction is not significant, the interaction term is usually removed from the model.

nested model test hypotheses

null: "extra" parameters in full model are 0, don't contribute info for the prediction of y alternative: at least one of the parameters we omitted in the reduced model is not 0 if we reject null==use full model if we fail to reject null==use reduced

When is R squared a good measure of power for a model and why

only if the sample size is considerably larger than the number of predictors we put in the model If k is large compared to n, then our F statistic will be small. --more predictors will mean a higher R squared. More predictors won't necessarily help increase the F stat though

quadratic (second order) models

when two variables x and y have a parabolic relationship

why is it not recommended to perform inferences on individual predictors in a multiple regression model?

--These slope estimates are only accurate when all other predictors are held constant. --It's dangerous to conduct multiple parameter comparisons at the same time. Doing so increases our chance of a type I error. 90.25% confidence with 2 comparisons 85.74% confidence with 3 comparisons

Why do we use variable selection methods?

--We want our models to be simple (meaning we want fewer x's) --we want powerful models that predict y well (meaning we have more x's) Variable selection methods balance these competing goals.

Intuition behind test statistic in the nested models test

-As predictor variables are accumulated in our model the R squared value will go up, even if the predictors really aren't related to y, which means that SSE * sum of the squared residuals will go down --since the reduced model is nested in the complete model, the complete model will always have a smaller SSE --We want to see fi the difference in See is worth keeping all of those extra variables, hence the SSEr-SSEc is the numerator of the F statistic --the rest of the formula is derived from probability theory

How to find the coefficients in multiple linear regression

1) set partial derivatives equal to zero and solve the (k+1) resulting simultaneous linear equations 2) linear algebra: B =(X'X)^-1X'Y, X is a design matrix 3) technology R: lm(y~x1+x2+...)

The steps to using Principal Components Analysis (PCA) in a regression model

1. Normalizing (Standardizing) the x's 2. Computing the principal components 3. Deciding how many principal components to use in the regression model 4. Fitting the model with PC's as predictors

Assumptions of a linear regression model

1. The average value of the residuals is equal to zero 2. The variance of the values of the residuals is a constant value of sigma squared for all possible values of x 3. The distribution of values of the residuals is normal. 4. The errors associated with any two different observations are independent.

How to decide how many principal components to use in the regression model

1. Use as few principal components as possible and still explain at least 90% of the variation in the data. 2. Use as few principal components as possible and explain at least Rsquared percent of the data, where Required comes from your original model. In general, this is only used if Required was more than 90%.

Methods of dealing with Multicollinearity

1. Variable selection (stepAIC(), adjusted R squared) 2. Principal Components Analysis (PCA) 3. LASSO/Ridge Regression (advanced topics)

Effects of Multicollinearity

1. artificially inflates the standard errors of the parameters 2. this will decrease our test statistics and increase our p-values (our tests are not valid)

With q predictors, how many possible models are there using only main effects?

2^q (this is because for each predictor there are two different options: either it is in the model, or it is not in the model)

Interaction model + interpretation

E(y) = beta0 + beta1x1 + beta2x2 + beta3x1x2 (beta1 + beta3*x2) represents the change in E(y) for every 1-unit increase in x1, holding x2 fixed (beta2 + beta3*x1) represents the change in E(y) for every 1-unit increase in x2, holding x1 fixed

How many dummy variables are needed for a categorical x with k levels?

For k levels we need k-1 indicator variables Ex. if x has 3 levels (high, medium, low): High dummy variable, medium dummy variable with low as the baseline (the y intercept) dummy variables = indicator variables

Why do we use a value of 10 for the threshold on the VIF?

For predictor j, if VIFj =10 then the R squared of j = .9 (solving our formula backwards). This means that 90% of the variation in predictor j is being explained by the other predictor variables in the model.

Model Utility Test/Global Test

Null: all beta coefficients are equal to zero Alternative: at least one of the coefficients is nonzero To use model: you want to reject the null use F stat tests if a model is statistically "useful" but does not necessarily mean "best", another model may prove even more useful in terms of providing more reliable estimates and predictions

What is the F statistic?

Numerator= MS(Model)--represents the variability in y explained (or accounted for) by the model Denominator=MSE= the unexplained (or error) variability in the model F is the ratio of explained variability to unexplained variability, the larger the proportion of the total variability accounted for by the model, the larger the Fstat larger Fstat = model is more useful than no model at all for predicting y num df=k den df=n-(k+1)

Why do we use Z scores instead of original data?

Principal components they are mixtures/combinations of the data that we run through the PC procedure. If we wish to combine our values, they need to be on the same scale. Normalizing strips our variables of all units and puts them on the same scale.

True or False: PCA can only be done on quantitative variables

TRUE

Model Utility Test

The Model Utility Test (also known as a global, or omnibus, test) determines if at least one of the slope parameters is useful in predicting our response variable. Note: this does not test if every slope parameter is non-zero, nor if a specific slope parameter is non-zero.

Interpretation of a quadratic model

The estimated coefficient of x (beta1) will not, in general, have a meaningful interpretation in the quadratic model. beta2 (x^2) = shows the concavity do a t test with alternative hypothesis that beta2 is >/<0 (whichever is relevant) and to see if you reject the null assume very strong evidence of downward curvature: y value increases more slowly per unit increase in x value for subjects with higher levels of x than for subjects with lower levels of x

Interpretation of coefficient in multiple linear regression

The interpretation of beta i (the slope coefficient for predictor xi) is the expected increase in y corresponding to a unit increase in xi when all other x's are held fixed

What does the stepAIC function do?

The stepAIC() function (using Backwards selection will start with the full model then it removes predictors one at a time until it finds the model that minimizes the AIC. R takes the full model and removes predictors one at a time to find a model with a lower AIC. R removes the variable that gives the lowest AIC. Repeat until we have a model where removing variables does not reduct AIC any further

design matrix

a column of 1's attached to the data of our predictors

Principal Components Analysis (PCA)

a general technique for reducing the dimension of data that is very important in many fields of study --principal components (PCs) are rescaled linear combinations of the predictor variables --If you have k predictors, you could compute k principal components PC1, PC2, PC3, ... PCK so that each PC is a function of the original predictors. While the original x's may have been correlated, the PCs are computed such that they are uncorrelated with each other. --less PCs in second model than k predictors

Variable Screening Method

a.k.a. variable selection or data reduction methods useful when we have a very large number of predictor variables to potentially include in a regression model to predict our response y

3 ways to check the normality assumption

assumption: errors are normally distributed 1. Make a histogram of the residuals and check if it looks bell shaped. 2. Construct a normal probability plot in which a straight-line pattern indicates normality. 3. Conduct a formal hypothesis test such as the Shapiro-Wilk test in which the null is normality and the alternative is non-normality

Using R squared and adjusted R squared to see if a predictor is useful in the model

compare multiple r square and adjusted r square with the predictor and without the predictor --multiple r squared will go down, but if adjusted r squared goes up than the predictor was not giving enough predictive power to be useful

determining if there is an interaction from an interaction plot

crossing lines or different slopes = interaction

criterion-based variable selection

searches through all possible 2^q models and picks the model that provides the best value of that particular criterion --Using adjusted R squared (finding the biggest adjusted R squared value of all possible models): only good if a few predictors --stepAIC() function in R AIC (Akaike information criterion)--statistical criterion with the same goal as the adjusted R squared, and the smaller the AIC, the better the model performs

Multicollinearity

some or many of the predictor variables in our model are correlated. When predictor variables are correlated, the information they provide overlaps or is redundant.

criteria

statistical quantities which can assess the balance of the predictive power of the model (for example, the reduction in SSE yielded by the model compared to smaller models) versus complexity (the number of predictors in the model).

sigma squared

variance of random error the variance of the probability distribution of the random error for a given set of values for x1, x2,...,xk. It is the mean value of the squares of the deviations of the y-values (for given values of x1, x2,...) about the mean value E(y). important measure of model utility: if sigma squared is high--larger deviation between the prediction equation and the mean value (not good) estimator = s-squared = MSE = SSE/ (n-(k+1)) MSE = mean square for error Root MSE/RMSE= s (instead of s squared), provides a rough estimation to the accuracy with which the model will predict future values of y for given values of x, we expect models to provide predictions of y to within about +/- 2s (2RMSE) x units.


Ensembles d'études connexes

Matetnal Childhood Chapter 9 Review

View Set

Chapter 9 Applications: International Trade

View Set